Steven W. D. Chien, Jonas Nylund, Gabriel Bengtsson, I. Peng, Artur Podobas, S. Markidis
{"title":"sputniPIC: An Implicit Particle-in-Cell Code for Multi-GPU Systems","authors":"Steven W. D. Chien, Jonas Nylund, Gabriel Bengtsson, I. Peng, Artur Podobas, S. Markidis","doi":"10.1109/SBAC-PAD49847.2020.00030","DOIUrl":null,"url":null,"abstract":"Large-scale simulations of plasmas are essential for advancing our understanding of fusion devices, space, and astrophysical systems. Particle-in-Cell (PIC) codes have demonstrated their success in simulating numerous plasma phenomena on HPC systems. Today, flagship supercomputers feature multiple GPUs per compute node to achieve unprecedented computing power at high power efficiency. PIC codes require new algorithm design and implementation for exploiting such accelerated platforms. In this work, we design and optimize a three-dimensional implicit PIC code, called sputniPIC, to run on a general multi-GPU compute node. We introduce a particle decomposition data layout, in contrast to domain decomposition on CPU-based implementations, to use particle batches for overlapping communication and computation on GPUs. sputniPIC also natively supports different precision representations to achieve speed up on hardware that supports reduced precision. We validate sputniPIC through the well-known GEM challenge and provide performance analysis. We test sputniPIC on three multi-GPU platforms and report a 200-800x performance improvement with respect to the sputniPIC CPU OpenMP version performance. We show that reduced precision could further improve performance by 45% to 80% on the three platforms. Because of these performance improvements, on a single node with multiple GPUs, sputniPIC enables large-scale three-dimensional PIC simulations that were only possible using clusters.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SBAC-PAD49847.2020.00030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
Large-scale simulations of plasmas are essential for advancing our understanding of fusion devices, space, and astrophysical systems. Particle-in-Cell (PIC) codes have demonstrated their success in simulating numerous plasma phenomena on HPC systems. Today, flagship supercomputers feature multiple GPUs per compute node to achieve unprecedented computing power at high power efficiency. PIC codes require new algorithm design and implementation for exploiting such accelerated platforms. In this work, we design and optimize a three-dimensional implicit PIC code, called sputniPIC, to run on a general multi-GPU compute node. We introduce a particle decomposition data layout, in contrast to domain decomposition on CPU-based implementations, to use particle batches for overlapping communication and computation on GPUs. sputniPIC also natively supports different precision representations to achieve speed up on hardware that supports reduced precision. We validate sputniPIC through the well-known GEM challenge and provide performance analysis. We test sputniPIC on three multi-GPU platforms and report a 200-800x performance improvement with respect to the sputniPIC CPU OpenMP version performance. We show that reduced precision could further improve performance by 45% to 80% on the three platforms. Because of these performance improvements, on a single node with multiple GPUs, sputniPIC enables large-scale three-dimensional PIC simulations that were only possible using clusters.
等离子体的大规模模拟对于提高我们对聚变装置、空间和天体物理系统的理解至关重要。细胞内粒子(PIC)码在高性能计算系统中成功地模拟了许多等离子体现象。如今,旗舰级超级计算机在每个计算节点上配备多个gpu,以实现前所未有的高功率效率计算能力。PIC码需要新的算法设计和实现来利用这种加速平台。在这项工作中,我们设计并优化了一个三维隐式PIC代码,称为sputniPIC,用于在通用多gpu计算节点上运行。与基于cpu的领域分解相比,我们引入了一种粒子分解数据布局,在gpu上使用粒子批进行重叠通信和计算。sputniPIC还原生支持不同的精度表示,以在支持降低精度的硬件上实现加速。我们通过著名的GEM挑战验证sputniPIC,并提供性能分析。我们在三个多gpu平台上测试了sputniPIC,并报告了与sputniPIC CPU OpenMP版本性能相比,性能提高了200-800倍。我们表明,降低精度可以在三个平台上进一步提高45%到80%的性能。由于这些性能改进,在具有多个gpu的单个节点上,sputniPIC可以实现只有使用集群才能实现的大规模三维PIC模拟。