Wafer-Scale Fast Fourier Transforms

Marcelo Orenes-Vera, I. Sharapov, R. Schreiber, M. Jacquelin, Philippe Vandermersch, Sharan Chetlur
{"title":"Wafer-Scale Fast Fourier Transforms","authors":"Marcelo Orenes-Vera, I. Sharapov, R. Schreiber, M. Jacquelin, Philippe Vandermersch, Sharan Chetlur","doi":"10.1145/3577193.3593708","DOIUrl":null,"url":null,"abstract":"We have implemented fast Fourier transforms for one, two, and three-dimensional arrays on the Cerebras CS-2, a system whose memory and processing elements reside on a single silicon wafer. The wafer-scale engine (WSE) encompasses a two-dimensional mesh of roughly 850,000 processing elements (PEs) with fast local memory and equally fast nearest-neighbor interconnections. Our wafer-scale FFT (wsFFT) parallelizes a n3 problem with up to n2 PEs. At this point, a PE processes only a single vector of the 3D domain (known as a pencil) per superstep, where each of the three supersteps performs FFT along one of the three axes of the input array. Between supersteps, wsFFT redistributes (transposes) the data to bring all elements of each one-dimensional pencil being transformed into the memory of a single PE. Each redistribution causes an all-to-all communication along one of the mesh dimensions. Given the level of parallelism, the size of the messages transmitted between pairs of PEs can be as small as a single word. In theory, a mesh is not ideal for all-to-all communication due to its limited bisection bandwidth. However, the mesh interconnecting PEs on the WSE lies entirely on-wafer and achieves nearly peak bandwidth even with tiny messages. We analyze in detail computation and communication time, as well as the weak and strong scaling, using both FP16 and FP32 precision. With 32-bit arithmetic on the CS-2, we achieve 959 microseconds for 3D FFT of a 5123 complex input array using a 512 × 512 subgrid of the on-wafer PEs. This is the largest ever parallelization for this problem size and the first implementation that breaks the millisecond barrier.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 37th International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577193.3593708","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

We have implemented fast Fourier transforms for one, two, and three-dimensional arrays on the Cerebras CS-2, a system whose memory and processing elements reside on a single silicon wafer. The wafer-scale engine (WSE) encompasses a two-dimensional mesh of roughly 850,000 processing elements (PEs) with fast local memory and equally fast nearest-neighbor interconnections. Our wafer-scale FFT (wsFFT) parallelizes a n3 problem with up to n2 PEs. At this point, a PE processes only a single vector of the 3D domain (known as a pencil) per superstep, where each of the three supersteps performs FFT along one of the three axes of the input array. Between supersteps, wsFFT redistributes (transposes) the data to bring all elements of each one-dimensional pencil being transformed into the memory of a single PE. Each redistribution causes an all-to-all communication along one of the mesh dimensions. Given the level of parallelism, the size of the messages transmitted between pairs of PEs can be as small as a single word. In theory, a mesh is not ideal for all-to-all communication due to its limited bisection bandwidth. However, the mesh interconnecting PEs on the WSE lies entirely on-wafer and achieves nearly peak bandwidth even with tiny messages. We analyze in detail computation and communication time, as well as the weak and strong scaling, using both FP16 and FP32 precision. With 32-bit arithmetic on the CS-2, we achieve 959 microseconds for 3D FFT of a 5123 complex input array using a 512 × 512 subgrid of the on-wafer PEs. This is the largest ever parallelization for this problem size and the first implementation that breaks the millisecond barrier.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
晶圆尺度快速傅里叶变换
我们已经在Cerebras CS-2上实现了一维、二维和三维阵列的快速傅立叶变换,该系统的存储和处理元件位于单个硅片上。晶圆级引擎(WSE)包含大约850,000个处理元素(pe)的二维网格,具有快速的本地内存和同样快速的最近邻互连。我们的晶圆级FFT (wsFFT)用最多n2个pe并行处理n3问题。此时,PE每个超步只处理3D域的单个向量(称为铅笔),其中三个超步中的每一个都沿着输入数组的三个轴之一执行FFT。在超级步骤之间,wsFFT重新分配(转置)数据,将每个一维铅笔的所有元素转换到单个PE的存储器中。每次重新分配都会导致沿一个网格维度进行全对全通信。在给定并行度的情况下,pe对之间传输的消息大小可以小到一个单词。理论上,由于网格的二分带宽有限,它不是全对全通信的理想选择。然而,在WSE上连接pe的网格完全位于晶圆上,即使是很小的消息也能达到接近峰值的带宽。在FP16和FP32两种精度下,详细分析了计算时间和通信时间,以及弱尺度和强尺度。使用CS-2上的32位算法,我们使用片上pe的512 × 512子网格,实现了5123复杂输入阵列的959微秒3D FFT。对于这个问题规模,这是有史以来最大的并行化,也是第一个打破毫秒限制的实现。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
FLORIA: A Fast and Featherlight Approach for Predicting Cache Performance FT-topo: Architecture-Driven Folded-Triangle Partitioning for Communication-efficient Graph Processing Using Additive Modifications in LU Factorization Instead of Pivoting GRAP: Group-level Resource Allocation Policy for Reconfigurable Dragonfly Network in HPC Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1