GPU-enabled extreme-scale turbulence simulations: Fourier pseudo-spectral algorithms at the exascale using OpenMP offloading

IF 7.2 2区物理与天体物理 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Computer Physics Communications Pub Date : 2024-09-05 DOI:10.1016/j.cpc.2024.109364

P.K. Yeung , Kiran Ravikumar , Stephen Nichols , Rohini Uma-Vaideswaran

{"title":"GPU-enabled extreme-scale turbulence simulations: Fourier pseudo-spectral algorithms at the exascale using OpenMP offloading","authors":"P.K. Yeung , Kiran Ravikumar , Stephen Nichols , Rohini Uma-Vaideswaran","doi":"10.1016/j.cpc.2024.109364","DOIUrl":null,"url":null,"abstract":"<div>Fourier pseudo-spectral methods for nonlinear partial differential equations are of wide interest in many areas of advanced computational science, including direct numerical simulation of three-dimensional (3-D) turbulence governed by the Navier-Stokes equations in fluid dynamics. This paper presents a new capability for simulating turbulence at a new record resolution up to 35 trillion grid points, on the world's first exascale computer, Frontier, comprising AMD MI250x GPUs with HPE's Slingshot interconnect and operated by the US Department of Energy's Oak Ridge Leadership Computing Facility (OLCF). Key programming strategies designed to take maximum advantage of the machine architecture involve performing almost all computations on the GPU which has the same memory capacity as the CPU, performing all-to-all communication among sets of parallel processes directly on the GPU, and targeting GPUs efficiently using OpenMP offloading for intensive number-crunching including 1-D Fast Fourier Transforms (FFT) performed using AMD ROCm library calls. With 99% of computing power on Frontier being on the GPU, leaving the CPU idle leads to a net performance gain via avoiding the overhead of data movement between host and device except when needed for some I/O purposes. Memory footprint including the size of communication buffers for MPI_ALLTOALL is managed carefully to maximize the largest problem size possible for a given node count.Detailed performance data including separate contributions from different categories of operations to the elapsed wall time per step are reported for five grid resolutions, from 20483 on a single node to 327683 on 4096 or 8192 nodes out of 9408 on the system. Both 1D and 2D domain decompositions which divide a 3D periodic domain into slabs and pencils respectively are implemented. The present code suite (labeled by the acronym GESTS, GPUs for Extreme Scale Turbulence Simulations) achieves a figure of merit (in grid points per second) exceeding goals set in the Center for Accelerated Application Readiness (CAAR) program for Frontier. The performance attained is highly favorable in both weak scaling and strong scaling, with notable departures only for 20483 where communication is entirely intra-node, and for 327683, where a challenge due to small message sizes does arise. Communication performance is addressed further using a lightweight test code that performs all-to-all communication in a manner matching the full turbulence simulation code. Performance at large problem sizes is affected by both small message size due to high node counts as well as dragonfly network topology features on the machine, but is consistent with official expectations of sustained performance on Frontier. Overall, although not perfect, the scalability achieved at the extreme problem size of 327683 (and up to 8192 nodes — which corresponds to hardware rated at just under 1 exaflop/sec of theoretical peak computational performance) is arguably better than the scalability observed using prior state-of-the-art algorithms on Frontier's predecessor machine (Summit) at OLCF. New science results for the study of intermittency in turbulence enabled by this code and its extensions are to be reported separately in the near future.</div>","PeriodicalId":285,"journal":{"name":"Computer Physics Communications","volume":"306 ","pages":"Article 109364"},"PeriodicalIF":7.2000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Physics Communications","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S001046552400287X","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Fourier pseudo-spectral methods for nonlinear partial differential equations are of wide interest in many areas of advanced computational science, including direct numerical simulation of three-dimensional (3-D) turbulence governed by the Navier-Stokes equations in fluid dynamics. This paper presents a new capability for simulating turbulence at a new record resolution up to 35 trillion grid points, on the world's first exascale computer, Frontier, comprising AMD MI250x GPUs with HPE's Slingshot interconnect and operated by the US Department of Energy's Oak Ridge Leadership Computing Facility (OLCF). Key programming strategies designed to take maximum advantage of the machine architecture involve performing almost all computations on the GPU which has the same memory capacity as the CPU, performing all-to-all communication among sets of parallel processes directly on the GPU, and targeting GPUs efficiently using OpenMP offloading for intensive number-crunching including 1-D Fast Fourier Transforms (FFT) performed using AMD ROCm library calls. With 99% of computing power on Frontier being on the GPU, leaving the CPU idle leads to a net performance gain via avoiding the overhead of data movement between host and device except when needed for some I/O purposes. Memory footprint including the size of communication buffers for MPI_ALLTOALL is managed carefully to maximize the largest problem size possible for a given node count.

Detailed performance data including separate contributions from different categories of operations to the elapsed wall time per step are reported for five grid resolutions, from 2048³ on a single node to 32768³ on 4096 or 8192 nodes out of 9408 on the system. Both 1D and 2D domain decompositions which divide a 3D periodic domain into slabs and pencils respectively are implemented. The present code suite (labeled by the acronym GESTS, GPUs for Extreme Scale Turbulence Simulations) achieves a figure of merit (in grid points per second) exceeding goals set in the Center for Accelerated Application Readiness (CAAR) program for Frontier. The performance attained is highly favorable in both weak scaling and strong scaling, with notable departures only for 2048³ where communication is entirely intra-node, and for 32768³, where a challenge due to small message sizes does arise. Communication performance is addressed further using a lightweight test code that performs all-to-all communication in a manner matching the full turbulence simulation code. Performance at large problem sizes is affected by both small message size due to high node counts as well as dragonfly network topology features on the machine, but is consistent with official expectations of sustained performance on Frontier. Overall, although not perfect, the scalability achieved at the extreme problem size of 32768³ (and up to 8192 nodes — which corresponds to hardware rated at just under 1 exaflop/sec of theoretical peak computational performance) is arguably better than the scalability observed using prior state-of-the-art algorithms on Frontier's predecessor machine (Summit) at OLCF. New science results for the study of intermittency in turbulence enabled by this code and its extensions are to be reported separately in the near future.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

GPU 支持的极端尺度湍流模拟：使用 OpenMP 卸载在 exascale 上运行傅立叶伪光谱算法

用于非线性偏微分方程的傅立叶伪谱方法在先进计算科学的许多领域都受到广泛关注，其中包括流体动力学中纳维-斯托克斯方程控制的三维（3-D）湍流的直接数值模拟。本文介绍了在世界首台超大规模计算机 Frontier 上以创纪录的 35 万亿网格点分辨率模拟湍流的新功能，Frontier 由 AMD MI250x GPU 和 HPE 的 Slingshot 互连组成，由美国能源部橡树岭领先计算设施 (OLCF) 负责运行。为最大限度地利用计算机架构而设计的主要编程策略包括：在 GPU 上执行几乎所有计算（GPU 具有与 CPU 相同的内存容量）；直接在 GPU 上执行并行进程集之间的全对全通信；使用 OpenMP 卸载高效地针对 GPU 进行密集型数字运算，包括使用 AMD ROCm 库调用执行的一维快速傅立叶变换 (FFT)。由于 Frontier 上 99% 的计算能力都在 GPU 上，CPU 闲置可避免主机和设备之间的数据移动开销（某些 I/O 用途需要时除外），从而实现净性能提升。详细的性能数据（包括不同类别的操作对每一步所耗费的壁时间的贡献）针对五种网格分辨率进行了报告，从单个节点上的 20483 到 4096 节点上的 327683 或系统上 9408 个节点中的 8192 个。实现了一维和二维域分解，分别将三维周期域划分为板块和铅笔。本代码套件（缩写为 GESTS，GPUs for Extreme Scale Turbulence Simulations）的性能指标（每秒网格点数）超过了加速应用准备中心（CAAR）为前沿计划设定的目标。无论是弱扩展还是强扩展，所取得的性能都非常出色，只有 20483 和 327683 的性能存在明显差异，前者完全是节点内通信，而后者则因信息规模较小而面临挑战。使用轻量级测试代码进一步解决了通信性能问题，该代码以与完整湍流模拟代码相匹配的方式执行全对全通信。大问题规模下的性能会受到高节点数导致的小信息量以及机器上蜻蜓网络拓扑特征的影响，但与官方对 Frontier 持续性能的预期一致。总体而言，尽管并不完美，但在 327683（最多 8192 个节点--相当于理论峰值计算性能略低于 1 exaflop/秒的额定硬件）的极端问题规模下实现的可扩展性，可以说优于在 OLCF 的 Frontier 前代机器（Summit）上使用先前最先进算法观察到的可扩展性。在不久的将来，将单独报告利用该代码及其扩展功能研究湍流间歇性的新科学成果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computer Physics Communications 物理-计算机：跨学科应用

CiteScore

12.10

自引率

3.20%

发文量

287

审稿时长

5.3 months

期刊介绍： The focus of CPC is on contemporary computational methods and techniques and their implementation, the effectiveness of which will normally be evidenced by the author(s) within the context of a substantive problem in physics. Within this setting CPC publishes two types of paper. Computer Programs in Physics (CPiP) These papers describe significant computer programs to be archived in the CPC Program Library which is held in the Mendeley Data repository. The submitted software must be covered by an approved open source licence. Papers and associated computer programs that address a problem of contemporary interest in physics that cannot be solved by current software are particularly encouraged. Computational Physics Papers (CP) These are research papers in, but are not limited to, the following themes across computational physics and related disciplines. mathematical and numerical methods and algorithms; computational models including those associated with the design, control and analysis of experiments; and algebraic computation. Each will normally include software implementation and performance details. The software implementation should, ideally, be available via GitHub, Zenodo or an institutional repository.In addition, research papers on the impact of advanced computer architecture and special purpose computers on computing in the physical sciences and software topics related to, and of importance in, the physical sciences may be considered.

期刊最新文献

Editorial Board Editorial Board Logarithmically complex rigorous Fourier space solution to the 1D grating diffraction problem KinetiX: A performance portable code generator for chemical kinetics and transport properties Physical units helpful for multiscale modelling