Wafer-Scale Fast Fourier Transforms

Proceedings of the 37th International Conference on Supercomputing Pub Date : 2022-09-29 DOI:10.1145/3577193.3593708

Marcelo Orenes-Vera, I. Sharapov, R. Schreiber, M. Jacquelin, Philippe Vandermersch, Sharan Chetlur

{"title":"Wafer-Scale Fast Fourier Transforms","authors":"Marcelo Orenes-Vera, I. Sharapov, R. Schreiber, M. Jacquelin, Philippe Vandermersch, Sharan Chetlur","doi":"10.1145/3577193.3593708","DOIUrl":null,"url":null,"abstract":"We have implemented fast Fourier transforms for one, two, and three-dimensional arrays on the Cerebras CS-2, a system whose memory and processing elements reside on a single silicon wafer. The wafer-scale engine (WSE) encompasses a two-dimensional mesh of roughly 850,000 processing elements (PEs) with fast local memory and equally fast nearest-neighbor interconnections. Our wafer-scale FFT (wsFFT) parallelizes a n3 problem with up to n2 PEs. At this point, a PE processes only a single vector of the 3D domain (known as a pencil) per superstep, where each of the three supersteps performs FFT along one of the three axes of the input array. Between supersteps, wsFFT redistributes (transposes) the data to bring all elements of each one-dimensional pencil being transformed into the memory of a single PE. Each redistribution causes an all-to-all communication along one of the mesh dimensions. Given the level of parallelism, the size of the messages transmitted between pairs of PEs can be as small as a single word. In theory, a mesh is not ideal for all-to-all communication due to its limited bisection bandwidth. However, the mesh interconnecting PEs on the WSE lies entirely on-wafer and achieves nearly peak bandwidth even with tiny messages. We analyze in detail computation and communication time, as well as the weak and strong scaling, using both FP16 and FP32 precision. With 32-bit arithmetic on the CS-2, we achieve 959 microseconds for 3D FFT of a 5123 complex input array using a 512 × 512 subgrid of the on-wafer PEs. This is the largest ever parallelization for this problem size and the first implementation that breaks the millisecond barrier.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 37th International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577193.3593708","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

We have implemented fast Fourier transforms for one, two, and three-dimensional arrays on the Cerebras CS-2, a system whose memory and processing elements reside on a single silicon wafer. The wafer-scale engine (WSE) encompasses a two-dimensional mesh of roughly 850,000 processing elements (PEs) with fast local memory and equally fast nearest-neighbor interconnections. Our wafer-scale FFT (wsFFT) parallelizes a n3 problem with up to n2 PEs. At this point, a PE processes only a single vector of the 3D domain (known as a pencil) per superstep, where each of the three supersteps performs FFT along one of the three axes of the input array. Between supersteps, wsFFT redistributes (transposes) the data to bring all elements of each one-dimensional pencil being transformed into the memory of a single PE. Each redistribution causes an all-to-all communication along one of the mesh dimensions. Given the level of parallelism, the size of the messages transmitted between pairs of PEs can be as small as a single word. In theory, a mesh is not ideal for all-to-all communication due to its limited bisection bandwidth. However, the mesh interconnecting PEs on the WSE lies entirely on-wafer and achieves nearly peak bandwidth even with tiny messages. We analyze in detail computation and communication time, as well as the weak and strong scaling, using both FP16 and FP32 precision. With 32-bit arithmetic on the CS-2, we achieve 959 microseconds for 3D FFT of a 5123 complex input array using a 512 × 512 subgrid of the on-wafer PEs. This is the largest ever parallelization for this problem size and the first implementation that breaks the millisecond barrier.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

晶圆尺度快速傅里叶变换

我们已经在Cerebras CS-2上实现了一维、二维和三维阵列的快速傅立叶变换，该系统的存储和处理元件位于单个硅片上。晶圆级引擎(WSE)包含大约850,000个处理元素(pe)的二维网格，具有快速的本地内存和同样快速的最近邻互连。我们的晶圆级FFT (wsFFT)用最多n2个pe并行处理n3问题。此时，PE每个超步只处理3D域的单个向量(称为铅笔)，其中三个超步中的每一个都沿着输入数组的三个轴之一执行FFT。在超级步骤之间，wsFFT重新分配(转置)数据，将每个一维铅笔的所有元素转换到单个PE的存储器中。每次重新分配都会导致沿一个网格维度进行全对全通信。在给定并行度的情况下，pe对之间传输的消息大小可以小到一个单词。理论上，由于网格的二分带宽有限，它不是全对全通信的理想选择。然而，在WSE上连接pe的网格完全位于晶圆上，即使是很小的消息也能达到接近峰值的带宽。在FP16和FP32两种精度下，详细分析了计算时间和通信时间，以及弱尺度和强尺度。使用CS-2上的32位算法，我们使用片上pe的512 × 512子网格，实现了5123复杂输入阵列的959微秒3D FFT。对于这个问题规模，这是有史以来最大的并行化，也是第一个打破毫秒限制的实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 37th International Conference on Supercomputing

自引率

0.00%

发文量