基于晶圆级处理器的快速模板代码计算

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-10-07 DOI:10.1109/SC41405.2020.00062

K. Rocki, D. V. Essendelft, I. Sharapov, R. Schreiber, Michael Morrison, V. Kibardin, Andrey Portnoy, J. Dietiker, M. Syamlal, Michael James

{"title":"基于晶圆级处理器的快速模板代码计算","authors":"K. Rocki, D. V. Essendelft, I. Sharapov, R. Schreiber, Michael Morrison, V. Kibardin, Andrey Portnoy, J. Dietiker, M. Syamlal, Michael James","doi":"10.1109/SC41405.2020.00062","DOIUrl":null,"url":null,"abstract":"The performance of CPU-based and GPU-based systems is often low for PDE codes, where large, sparse, and often structured systems of linear equations must be solved. Iterative solvers are limited by data movement, both between caches and memory and between nodes. Here we describe the solution of such systems of equations on the Cerebras Systems CS-1, a wafer-scale processor that has the memory bandwidth and communication latency to perform well. We achieve 0.86 PFLOPS on a single wafer-scale system for the solution by BiCGStab of a linear system arising from a 7-point finite difference stencil on a $600\\times 595\\times 1536$ mesh, achieving about one third of the machine’s peak performance. We explain the system, its architecture and programming, and its performance on this problem and related problems. We discuss issues of memory capacity and floating point precision. We outline plans to extend this work towards full applications.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"45","resultStr":"{\"title\":\"Fast Stencil-Code Computation on a Wafer-Scale Processor\",\"authors\":\"K. Rocki, D. V. Essendelft, I. Sharapov, R. Schreiber, Michael Morrison, V. Kibardin, Andrey Portnoy, J. Dietiker, M. Syamlal, Michael James\",\"doi\":\"10.1109/SC41405.2020.00062\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The performance of CPU-based and GPU-based systems is often low for PDE codes, where large, sparse, and often structured systems of linear equations must be solved. Iterative solvers are limited by data movement, both between caches and memory and between nodes. Here we describe the solution of such systems of equations on the Cerebras Systems CS-1, a wafer-scale processor that has the memory bandwidth and communication latency to perform well. We achieve 0.86 PFLOPS on a single wafer-scale system for the solution by BiCGStab of a linear system arising from a 7-point finite difference stencil on a $600\\\\times 595\\\\times 1536$ mesh, achieving about one third of the machine’s peak performance. We explain the system, its architecture and programming, and its performance on this problem and related problems. We discuss issues of memory capacity and floating point precision. We outline plans to extend this work towards full applications.\",\"PeriodicalId\":424429,\"journal\":{\"name\":\"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis\",\"volume\":\"33 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"45\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SC41405.2020.00062\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC41405.2020.00062","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 45

摘要

对于PDE代码，基于cpu和gpu的系统的性能通常很低，因为PDE代码必须求解大型、稀疏且通常是结构化的线性方程系统。迭代求解器受到数据移动的限制，无论是在缓存和内存之间，还是在节点之间。在这里，我们描述了在Cerebras systems CS-1上求解此类方程组的方法，CS-1是一种具有良好内存带宽和通信延迟的晶圆级处理器。我们在单晶圆级系统上实现了0.86 PFLOPS，通过BiCGStab解决了一个线性系统，该系统由7点有限差分模板在600\ \ 595\ \ 1536$网格上产生，实现了大约三分之一的机器峰值性能。我们解释了系统，它的架构和编程，以及它在这个问题和相关问题上的性能。我们讨论了内存容量和浮点精度的问题。我们概述了将这项工作扩展到全面应用的计划。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Fast Stencil-Code Computation on a Wafer-Scale Processor

The performance of CPU-based and GPU-based systems is often low for PDE codes, where large, sparse, and often structured systems of linear equations must be solved. Iterative solvers are limited by data movement, both between caches and memory and between nodes. Here we describe the solution of such systems of equations on the Cerebras Systems CS-1, a wafer-scale processor that has the memory bandwidth and communication latency to perform well. We achieve 0.86 PFLOPS on a single wafer-scale system for the solution by BiCGStab of a linear system arising from a 7-point finite difference stencil on a $600\times 595\times 1536$ mesh, achieving about one third of the machine’s peak performance. We explain the system, its architecture and programming, and its performance on this problem and related problems. We discuss issues of memory capacity and floating point precision. We outline plans to extend this work towards full applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

自引率

0.00%

发文量

期刊最新文献

CAB-MPI: Exploring Interprocess Work-Stealing towards Balanced MPI Communication Toward Realization of Numerical Towing-Tank Tests by Wall-Resolved Large Eddy Simulation based on 32 Billion Grid Finite-Element Computation Scalable yet Rigorous Floating-Point Error Analysis Scalable Knowledge Graph Analytics at 136 Petaflop/s BORA: A Bag Optimizer for Robotic Analysis