Romain Pereira, A. Roussel, Miwako Tsuji, Patrick Carribault, Mitsuhisa Sato, Hitoshi Murai, Thierry Gautier
The adoption of ARM processor architectures is on the rise in the HPC ecosystem. Fugaku supercomputer is a homogeneous ARM-based machine, and is one among the most powerful machine in the world. In the programming world, dependent task-based programming models are gaining tractions due to their many advantages: dynamic load balancing, implicit expression of communication/computation overlap, early-bird communication posting,...MPI and OpenMP are two widespreads programming standards that make possible task-based programming at a distributed memory level. Despite its many advantages, mixed-use of the standard programming models using dependent tasks is still under-evaluated on large-scale machines. In this paper, we provide an overview on mixing OpenMP dependent tasking model with MPI with the state-of-the-art software stack (GCC-13, Clang17, MPC-OMP). We provide the level of performances to expect by porting applications to such mixed-use of the standard on the Fugaku supercomputers, using two benchmarks (Cholesky, HPCCG) and a proxy-application (LULESH). We show that software stack, resource binding and communication progression mechanisms are factors that have a significant impact on performance. On distributed applications, performances reaches up to 80% of effiency for task-based applications like HPCCG. We also point-out a few areas of improvements in OpenMP runtimes.
{"title":"An Overview on Mixing MPI and OpenMP Dependent Tasking on A64FX","authors":"Romain Pereira, A. Roussel, Miwako Tsuji, Patrick Carribault, Mitsuhisa Sato, Hitoshi Murai, Thierry Gautier","doi":"10.1145/3636480.3637094","DOIUrl":"https://doi.org/10.1145/3636480.3637094","url":null,"abstract":"The adoption of ARM processor architectures is on the rise in the HPC ecosystem. Fugaku supercomputer is a homogeneous ARM-based machine, and is one among the most powerful machine in the world. In the programming world, dependent task-based programming models are gaining tractions due to their many advantages: dynamic load balancing, implicit expression of communication/computation overlap, early-bird communication posting,...MPI and OpenMP are two widespreads programming standards that make possible task-based programming at a distributed memory level. Despite its many advantages, mixed-use of the standard programming models using dependent tasks is still under-evaluated on large-scale machines. In this paper, we provide an overview on mixing OpenMP dependent tasking model with MPI with the state-of-the-art software stack (GCC-13, Clang17, MPC-OMP). We provide the level of performances to expect by porting applications to such mixed-use of the standard on the Fugaku supercomputers, using two benchmarks (Cholesky, HPCCG) and a proxy-application (LULESH). We show that software stack, resource binding and communication progression mechanisms are factors that have a significant impact on performance. On distributed applications, performances reaches up to 80% of effiency for task-based applications like HPCCG. We also point-out a few areas of improvements in OpenMP runtimes.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"4 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Traditional steelmaking is a major source of carbon dioxide emissions, but green steel production offers a sustainable alternative. Green steel is produced using hydrogen as a reducing agent instead of carbon monoxide, which results in only water vapour as a by-product. Midrex is a well-established technology that plays a crucial role in the green steel supply chain by producing direct reduced iron (DRI), a more environmentally friendly alternative to traditional iron production methods. In this work, we model a Midrex blast furnace and propose a parallel multi-physics simulation tool based on the coupling between Discrete Element Method (DEM) and Computational Fluid Dynamics (CFD). The particulate phase is simulated with XDEM (parallelized with MPI+OpenMP), the fluid phase is solved by OpenFOAM (parallelized with MPI), and the two solvers are coupled together using the preCICE library. We perform a careful performance analysis that focuses first on each solver individually and then on the coupled application. Our results highlight the difficulty of distributing the computing resources appropriately between the solvers in order to achieve the best performance. Finally, our multi-physics coupled implementation runs in parallel on 1024 cores and can simulate 500 seconds of the Midrex blast furnace in 1 hour and 45 minutes. This work identifies the challenge related to the load balancing of coupled solvers and makes a step forward towards the simulation of a complete 3D blast furnace on High-Performance Computing platforms.
{"title":"Parallel Multi-Physics Coupled Simulation of a Midrex Blast Furnace","authors":"Xavier Besseron, P. Adhav, Bernhard Peters","doi":"10.1145/3636480.3636484","DOIUrl":"https://doi.org/10.1145/3636480.3636484","url":null,"abstract":"Traditional steelmaking is a major source of carbon dioxide emissions, but green steel production offers a sustainable alternative. Green steel is produced using hydrogen as a reducing agent instead of carbon monoxide, which results in only water vapour as a by-product. Midrex is a well-established technology that plays a crucial role in the green steel supply chain by producing direct reduced iron (DRI), a more environmentally friendly alternative to traditional iron production methods. In this work, we model a Midrex blast furnace and propose a parallel multi-physics simulation tool based on the coupling between Discrete Element Method (DEM) and Computational Fluid Dynamics (CFD). The particulate phase is simulated with XDEM (parallelized with MPI+OpenMP), the fluid phase is solved by OpenFOAM (parallelized with MPI), and the two solvers are coupled together using the preCICE library. We perform a careful performance analysis that focuses first on each solver individually and then on the coupled application. Our results highlight the difficulty of distributing the computing resources appropriately between the solvers in order to achieve the best performance. Finally, our multi-physics coupled implementation runs in parallel on 1024 cores and can simulate 500 seconds of the Midrex blast furnace in 1 hour and 45 minutes. This work identifies the challenge related to the load balancing of coupled solvers and makes a step forward towards the simulation of a complete 3D blast furnace on High-Performance Computing platforms.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"25 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Simakov, Matthew D. Jones, T. Furlani, E. Siegmann, Robert Harrison
The engineering samples of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchips were tested using different benchmarks and scientific applications. The benchmarks include HPCC and HPCG. The real application-based benchmark includes AI-Benchmark-Alpha (a TensorFlow benchmark), Gromacs, OpenFOAM, and ROMS. The performance was compared to multiple Intel, AMD, ARM CPUs and several x86 with NVIDIA GPU systems. A brief energy efficiency estimate was performed based on TDP values. We found that in HPCC benchmark tests, the per-core performance of Grace is similar to or faster than AMD Milan cores, and the high core count often allows NVIDIA Grace CPU Superchip to have per-node performance similar to Intel Sapphire Rapids with High Bandwidth Memory: slower in matrix multiplication (by 17%) and FFT (by 6%), faster in Linpack (by 9%)). In scientific applications, the NVIDIA Grace CPU Superchip performance is slower by 6% to 18% in Gromacs, faster by 7% in OpenFOAM, and right between HBM and DDR modes of Intel Sapphire Rapids in ROMS. The combined CPU-GPU performance in Gromacs is significantly faster (by 20% to 117% faster) than any tested x86-NVIDIA GPU system. Overall, the new NVIDIA Grace Hopper Superchip and NVIDIA Grace CPU Superchip Superchip are high-performance and most likely energy-efficient solutions for HPC centers.
英伟达™(NVIDIA®)Grace CPU 超级芯片和英伟达™(NVIDIA®)Grace Hopper 超级芯片的工程样品通过不同的基准和科学应用进行了测试。基准测试包括 HPCC 和 HPCG。基于实际应用的基准包括 AI-Benchmark-Alpha(TensorFlow 基准)、Gromacs、OpenFOAM 和 ROMS。性能与多个英特尔、AMD、ARM CPU 以及多个 x86 和英伟达™(NVIDIA®)GPU 系统进行了比较。根据 TDP 值对能效进行了简要评估。我们发现,在 HPCC 基准测试中,Grace 的单位内核性能与 AMD 米兰内核相近或更快,而高内核数往往使英伟达™ Grace CPU 超级芯片的单位节点性能与配备高带宽内存的英特尔蓝宝石锐龙相似:矩阵乘法(慢 17%)和 FFT(慢 6%),Linpack(快 9%))。在科学应用中,英伟达™(NVIDIA®)Grace CPU 超级芯片的性能在 Gromacs 中要慢 6% 到 18%,在 OpenFOAM 中要快 7%,在 ROMS 中则介于英特尔蓝宝石锐龙的 HBM 和 DDR 模式之间。在 Gromacs 中,CPU-GPU 的组合性能比任何经过测试的 x86-NVIDIA GPU 系统都要快得多(快 20% 到 117%)。总之,新的英伟达™(NVIDIA®)Grace Hopper 超级芯片和英伟达™(NVIDIA®)Grace CPU 超级芯片是高性能、高能效的高性能计算中心解决方案。
{"title":"First Impressions of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchip for Scientific Workloads","authors":"N. Simakov, Matthew D. Jones, T. Furlani, E. Siegmann, Robert Harrison","doi":"10.1145/3636480.3637097","DOIUrl":"https://doi.org/10.1145/3636480.3637097","url":null,"abstract":"The engineering samples of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchips were tested using different benchmarks and scientific applications. The benchmarks include HPCC and HPCG. The real application-based benchmark includes AI-Benchmark-Alpha (a TensorFlow benchmark), Gromacs, OpenFOAM, and ROMS. The performance was compared to multiple Intel, AMD, ARM CPUs and several x86 with NVIDIA GPU systems. A brief energy efficiency estimate was performed based on TDP values. We found that in HPCC benchmark tests, the per-core performance of Grace is similar to or faster than AMD Milan cores, and the high core count often allows NVIDIA Grace CPU Superchip to have per-node performance similar to Intel Sapphire Rapids with High Bandwidth Memory: slower in matrix multiplication (by 17%) and FFT (by 6%), faster in Linpack (by 9%)). In scientific applications, the NVIDIA Grace CPU Superchip performance is slower by 6% to 18% in Gromacs, faster by 7% in OpenFOAM, and right between HBM and DDR modes of Intel Sapphire Rapids in ROMS. The combined CPU-GPU performance in Gromacs is significantly faster (by 20% to 117% faster) than any tested x86-NVIDIA GPU system. Overall, the new NVIDIA Grace Hopper Superchip and NVIDIA Grace CPU Superchip Superchip are high-performance and most likely energy-efficient solutions for HPC centers.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"13 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Load balancing at both the process and thread levels is imperative for minimizing application computation time in the context of MPI/OpenMP hybrid parallelization. This necessity arises from the constraint that, within a typical hybrid parallel environment, an identical number of cores is bound to each process. Dynamic Core Binding, however, adjusts the core binding based on the process’s workload, thereby realizing load balancing at the core level. In prior research, we have implemented the DCB library, which has two policies for computation time reduction or power reduction. In this paper, we show that the two policies provided by the DCB library can be used together to achieve both computation time reduction and power consumption reduction.
{"title":"Optimize Efficiency of Utilizing Systems by Dynamic Core Binding","authors":"Masatoshi Kawai, Akihiro Ida, Toshihiro Hanawa, Tetsuya Hoshino","doi":"10.1145/3636480.3637221","DOIUrl":"https://doi.org/10.1145/3636480.3637221","url":null,"abstract":"Load balancing at both the process and thread levels is imperative for minimizing application computation time in the context of MPI/OpenMP hybrid parallelization. This necessity arises from the constraint that, within a typical hybrid parallel environment, an identical number of cores is bound to each process. Dynamic Core Binding, however, adjusts the core binding based on the process’s workload, thereby realizing load balancing at the core level. In prior research, we have implemented the DCB library, which has two policies for computation time reduction or power reduction. In this paper, we show that the two policies provided by the DCB library can be used together to achieve both computation time reduction and power consumption reduction.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"8 13","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Software pipelining is an essential optimization for accelerating High-Performance Computing(HPC) applications on CPUs. Modern CPUs achieve high performance through many-core and wide SIMD instructions. Software pipelining is an optimization that promotes further performance improvement of HPC applications by cooperating with these functions. Although open source compilers such as GCC and LLVM have implemented software pipelining, it is underutilized for the AArch64 architecture. We have implemented software pipelining for the A64FX processor on LLVM to improve this situation. This paper describes the details of this implementation. We also confirmed that our implementation improves the performance of several benchmark programs.
软件流水线是加速 CPU 上高性能计算(HPC)应用的重要优化手段。现代 CPU 通过多核和宽 SIMD 指令实现高性能。软件流水线技术是一种优化技术,通过与这些功能配合,可进一步提高高性能计算应用的性能。虽然 GCC 和 LLVM 等开源编译器已经实现了软件流水线功能,但在 AArch64 架构上还未得到充分利用。为了改善这种状况,我们在 LLVM 上为 A64FX 处理器实现了软件流水线功能。本文介绍了这一实现的细节。我们还证实,我们的实现提高了几个基准程序的性能。
{"title":"Introducing software pipelining for the A64FX processor into LLVM","authors":"Masaki Arai, Naoto Fukumoto, Hitoshi Murai","doi":"10.1145/3636480.3637093","DOIUrl":"https://doi.org/10.1145/3636480.3637093","url":null,"abstract":"Software pipelining is an essential optimization for accelerating High-Performance Computing(HPC) applications on CPUs. Modern CPUs achieve high performance through many-core and wide SIMD instructions. Software pipelining is an optimization that promotes further performance improvement of HPC applications by cooperating with these functions. Although open source compilers such as GCC and LLVM have implemented software pipelining, it is underutilized for the AArch64 architecture. We have implemented software pipelining for the A64FX processor on LLVM to improve this situation. This paper describes the details of this implementation. We also confirmed that our implementation improves the performance of several benchmark programs.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"9 35","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139437643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-performance computational kernels that optimally exploit modern vector-capable processors are critical in running large-scale drug discovery campaigns efficiently and promptly compatible with the constraints posed by urgent computing needs. Yet, state-of-the-art virtual screening workflows focus either on the broadness of features provided to the drug researcher or performance on high-throughput accelerators, leaving the task of deploying efficient CPU kernels to the compiler. We ported the key parts of the LiGen drug discovery pipeline, based on molecular docking, to the Fujitsu A64FX platform and leveraged its vector processing capabilities via an industry-proven retargetable SIMD programming model. By rethinking and optimizing key geometrical docking algorithms to leverage SVE instructions, we are able to provide efficient, high throughput execution on SVE-capable platforms.
高性能计算内核可优化利用现代矢量处理器,这对于高效、及时地运行大规模药物发现活动,并与紧迫的计算需求所带来的限制相匹配至关重要。然而,最先进的虚拟筛选工作流程要么侧重于为药物研究人员提供广泛的功能,要么侧重于高通量加速器上的性能,而将部署高效 CPU 内核的任务留给了编译器。我们将基于分子对接的 LiGen 药物发现流水线的关键部分移植到富士通 A64FX 平台,并通过业界公认的可重目标 SIMD 编程模型利用其矢量处理能力。通过重新思考和优化关键的几何对接算法以利用 SVE 指令,我们能够在支持 SVE 的平台上提供高效、高吞吐量的执行。
{"title":"High-throughput drug discovery on the Fujitsu A64FX architecture","authors":"Filippo Barbari, F. Ficarelli, Daniele Cesarini","doi":"10.1145/3636480.3637095","DOIUrl":"https://doi.org/10.1145/3636480.3637095","url":null,"abstract":"High-performance computational kernels that optimally exploit modern vector-capable processors are critical in running large-scale drug discovery campaigns efficiently and promptly compatible with the constraints posed by urgent computing needs. Yet, state-of-the-art virtual screening workflows focus either on the broadness of features provided to the drug researcher or performance on high-throughput accelerators, leaving the task of deploying efficient CPU kernels to the compiler. We ported the key parts of the LiGen drug discovery pipeline, based on molecular docking, to the Fujitsu A64FX platform and leveraged its vector processing capabilities via an industry-proven retargetable SIMD programming model. By rethinking and optimizing key geometrical docking algorithms to leverage SVE instructions, we are able to provide efficient, high throughput execution on SVE-capable platforms.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"4 11","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We proposed an implementation for surfactant transport simulations in gas-liquid two-phase flows. This implementation employs a tree-based interface-adapted adaptive mesh refinement (AMR) method, assigning a high-resolution mesh around the interface region, significantly reducing computational resources, such as memory and execution time. We developed GPU code by CUDA programming language for the AMR method to further enhance performance through GPU parallel computing. The piece-wise linear interface calculation (PLIC) method, an interface-capturing approach for two-phase flows, is implemented based on the tree-based AMR method and GPU computing. We adopted the height function (HF) method to calculate interface curvature for surface tension assessment to suppress the spurious currents, and implemented it on the AMR mesh as well. We incorporated the Langmuir model to describe surfactant transport, as well as surfactant adsorption and desorption at the gas-liquid interface. Our implementation was applied to simulate a two-dimensional process where a bubble freely rises to the liquid surface, forms a thin liquid film, and eventually results in the film’s rupture. This simulation confirmed a reduction in the number of mesh grids required with our proposed implementations.
我们提出了一种用于气液两相流中表面活性剂传输模拟的实现方法。该实施方案采用了基于树状界面的自适应网格细化(AMR)方法,在界面区域周围分配高分辨率网格,从而大大减少了内存和执行时间等计算资源。我们使用 CUDA 编程语言为 AMR 方法开发了 GPU 代码,通过 GPU 并行计算进一步提高性能。片断线性界面计算(PLIC)方法是一种针对两相流的界面捕捉方法,它是在基于树的 AMR 方法和 GPU 计算的基础上实现的。我们采用了高度函数(HF)方法来计算用于表面张力评估的界面曲率,以抑制杂散电流,并在 AMR 网格上实现了这一方法。我们采用 Langmuir 模型来描述表面活性剂的传输,以及表面活性剂在气液界面的吸附和解吸。我们将该模型用于模拟一个二维过程:气泡自由上升到液体表面,形成一层薄薄的液膜,并最终导致液膜破裂。这次模拟证实,我们提出的实现方法减少了所需的网格数量。
{"title":"The Implementation of Gas-liquid Two-phase Flow Simulations with Surfactant Transport Based on GPU Computing and Adaptive Mesh Refinement","authors":"Tongda Lian, Shintaro Matsushita, Takayuki Aoki","doi":"10.1145/3636480.3636485","DOIUrl":"https://doi.org/10.1145/3636480.3636485","url":null,"abstract":"We proposed an implementation for surfactant transport simulations in gas-liquid two-phase flows. This implementation employs a tree-based interface-adapted adaptive mesh refinement (AMR) method, assigning a high-resolution mesh around the interface region, significantly reducing computational resources, such as memory and execution time. We developed GPU code by CUDA programming language for the AMR method to further enhance performance through GPU parallel computing. The piece-wise linear interface calculation (PLIC) method, an interface-capturing approach for two-phase flows, is implemented based on the tree-based AMR method and GPU computing. We adopted the height function (HF) method to calculate interface curvature for surface tension assessment to suppress the spurious currents, and implemented it on the AMR mesh as well. We incorporated the Langmuir model to describe surfactant transport, as well as surfactant adsorption and desorption at the gas-liquid interface. Our implementation was applied to simulate a two-dimensional process where a bubble freely rises to the liquid surface, forms a thin liquid film, and eventually results in the film’s rupture. This simulation confirmed a reduction in the number of mesh grids required with our proposed implementations.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
At the Supercomputer System of Academic Center for Computing and Media Studies Kyoto University, the fourth-generation Xeon (code-named Sapphire Rapids) is employed. The system consists of two subsystems—one equipped solely with high-bandwidth memory, HBM2e, and the other with a large DDR5 memory capacity. Using benchmark applications, a performance evaluation of systems with each type of memory was conducted. Additionally, the study employed a real application, the electromagnetic fluid code, to investigate how application performance varies based on differences in memory characteristics. The results confirm the performance improvement due to the high bandwidth of HBM2e. However, it was also observed that the efficiency is lower when using HBM2e, and the effects of cache memory optimization are relatively minimal.
{"title":"Performance Evaluation of the Fourth-Generation Xeon with Different Memory Characteristics","authors":"Keiichiro Fukazawa, Riki Takahashi","doi":"10.1145/3636480.3637218","DOIUrl":"https://doi.org/10.1145/3636480.3637218","url":null,"abstract":"At the Supercomputer System of Academic Center for Computing and Media Studies Kyoto University, the fourth-generation Xeon (code-named Sapphire Rapids) is employed. The system consists of two subsystems—one equipped solely with high-bandwidth memory, HBM2e, and the other with a large DDR5 memory capacity. Using benchmark applications, a performance evaluation of systems with each type of memory was conducted. Additionally, the study employed a real application, the electromagnetic fluid code, to investigate how application performance varies based on differences in memory characteristics. The results confirm the performance improvement due to the high bandwidth of HBM2e. However, it was also observed that the efficiency is lower when using HBM2e, and the effects of cache memory optimization are relatively minimal.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"10 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Kang, Sayan Ghosh, M. Kandemir, Andrés Marquez
ARM-based CPU architectures are currently driving massive disruptions in the High Performance Computing (HPC) community. Deployment of the 48-core Fujitsu A64FX ARM architecture based processor in RIKEN “Fugaku” supercomputer (#2 in the June 2023 Top500 list) was a major inflection point in pushing ARM to mainstream HPC. A key design criteria of Fujitsu A64FX is to enhance the throughput of modern memory-bound applications, which happens to be a dominant pattern in contemporary HPC, as opposed to traditional compute-bound or floating-point intensive science workloads. One of the mechanisms to enhance the throughput concerns write-allocate operations (e.g., streaming write operations), which are quite common in science applications. In particular, eliminating write-allocate operations (allocate cache line on a write miss) through a special “zero fill” instruction available on the ARM CPU architectures can improve the overall memory bandwidth, by avoiding the memory read into a cache line, which is unnecessary since the cache line will be written consequently. While bandwidth implications are relatively straightforward to measure via synthetic benchmarks with fixed-stride memory accesses, it is important to consider irregular memory-access driven scenarios such as graph analytics, and analyze the impact of write-allocate elimination on diverse data-driven applications. In this paper, we examine the impact of “zero fill” on OpenMP-based multithreaded graph application scenarios (Graph500 Breadth First Search, GAP benchmark suite, and Louvain Graph Clustering) and five application proxies from the Rodinia heterogeneous benchmark suite (molecular dynamics, sequence alignment, image processing, etc.), using LLVM-based ARM and GNU compilers on the Fujitsu FX700 A64FX platform of the Ookami system from Stony Brook University. Our results indicate that facilitating “zero fill” through code modifications to certain critical kernels or code segments that exhibit temporal write patterns can positively impact the overall performance of a variety of applications. We observe performance variations across the compilers and input data, and note end-to-end improvements between 5–20% for the benchmarks and diverse spectrum of application scenarios owing to “zero fill” related adaptations.
基于 ARM 架构的 CPU 目前正在推动高性能计算 (HPC) 领域的大规模变革。基于 ARM 架构的 48 核富士通 A64FX 处理器在理研 "Fugaku "超级计算机(2023 年 6 月 Top500 榜单第二名)中的部署是将 ARM 推向主流 HPC 的一个重要拐点。富士通 A64FX 的一个关键设计标准是提高现代内存绑定应用的吞吐量,这恰好是当代 HPC 的主流模式,而不是传统的计算绑定或浮点密集型科学工作负载。提高吞吐量的机制之一涉及写入分配操作(如流式写入操作),这在科学应用中非常常见。特别是,通过 ARM CPU 架构上的特殊 "零填充 "指令消除写入分配操作(在写入未命中时分配高速缓存行),可以避免内存读入高速缓存行,从而提高整体内存带宽。虽然带宽影响可以通过固定跨度内存访问的合成基准进行相对直接的测量,但考虑不规则内存访问驱动的场景(如图形分析)并分析取消写分配对各种数据驱动应用的影响也很重要。在本文中,我们在石溪大学 Ookami 系统的富士通 FX700 A64FX 平台上,使用基于 LLVM 的 ARM 和 GNU 编译器,研究了 "零填充 "对基于 OpenMP 的多线程图应用场景(Graph500 Breadth First Search、GAP 基准套件和 Louvain Graph Clustering)和 Rodinia 异构基准套件(分子动力学、序列比对、图像处理等)中五个应用代理的影响。我们的研究结果表明,通过对某些关键内核或表现出时间写入模式的代码段进行代码修改来促进 "零填充",可以对各种应用程序的整体性能产生积极影响。我们观察到不同编译器和输入数据的性能差异,并注意到由于 "零填充 "相关的调整,基准和各种应用场景的端到端性能提高了 5-20%。
{"title":"Impact of Write-Allocate Elimination on Fujitsu A64FX","authors":"Yan Kang, Sayan Ghosh, M. Kandemir, Andrés Marquez","doi":"10.1145/3636480.3637283","DOIUrl":"https://doi.org/10.1145/3636480.3637283","url":null,"abstract":"ARM-based CPU architectures are currently driving massive disruptions in the High Performance Computing (HPC) community. Deployment of the 48-core Fujitsu A64FX ARM architecture based processor in RIKEN “Fugaku” supercomputer (#2 in the June 2023 Top500 list) was a major inflection point in pushing ARM to mainstream HPC. A key design criteria of Fujitsu A64FX is to enhance the throughput of modern memory-bound applications, which happens to be a dominant pattern in contemporary HPC, as opposed to traditional compute-bound or floating-point intensive science workloads. One of the mechanisms to enhance the throughput concerns write-allocate operations (e.g., streaming write operations), which are quite common in science applications. In particular, eliminating write-allocate operations (allocate cache line on a write miss) through a special “zero fill” instruction available on the ARM CPU architectures can improve the overall memory bandwidth, by avoiding the memory read into a cache line, which is unnecessary since the cache line will be written consequently. While bandwidth implications are relatively straightforward to measure via synthetic benchmarks with fixed-stride memory accesses, it is important to consider irregular memory-access driven scenarios such as graph analytics, and analyze the impact of write-allocate elimination on diverse data-driven applications. In this paper, we examine the impact of “zero fill” on OpenMP-based multithreaded graph application scenarios (Graph500 Breadth First Search, GAP benchmark suite, and Louvain Graph Clustering) and five application proxies from the Rodinia heterogeneous benchmark suite (molecular dynamics, sequence alignment, image processing, etc.), using LLVM-based ARM and GNU compilers on the Fujitsu FX700 A64FX platform of the Ookami system from Stony Brook University. Our results indicate that facilitating “zero fill” through code modifications to certain critical kernels or code segments that exhibit temporal write patterns can positively impact the overall performance of a variety of applications. We observe performance variations across the compilers and input data, and note end-to-end improvements between 5–20% for the benchmarks and diverse spectrum of application scenarios owing to “zero fill” related adaptations.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"5 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the area of high-performance computing (HPC), it is expected to extract extreme computing performance using a highly optimized framework without even common OS APIs and frameworks for personal desktops. However, this makes the development cost higher than normal application development and challenging for beginners. The demand for large-scale computation is increasing due to the growth of cloud computing environments and the AI boom resulting from deep learning and large-scale language models. Therefore, a framework that makes it easier to handle HPC application programming is needed. This study shows a concept model that makes it possible to write HPC applications using semantics like the shell command pipeline in Unix. It proposes a simple application framework for beginners in HPC called HPCnix.
{"title":"HPCnix: make HPC Apps more easier like shell script","authors":"Minoru Kanatsu, Hiroshi Yamada","doi":"10.1145/3636480.3637231","DOIUrl":"https://doi.org/10.1145/3636480.3637231","url":null,"abstract":"In the area of high-performance computing (HPC), it is expected to extract extreme computing performance using a highly optimized framework without even common OS APIs and frameworks for personal desktops. However, this makes the development cost higher than normal application development and challenging for beginners. The demand for large-scale computation is increasing due to the growth of cloud computing environments and the AI boom resulting from deep learning and large-scale language models. Therefore, a framework that makes it easier to handle HPC application programming is needed. This study shows a concept model that makes it possible to write HPC applications using semantics like the shell command pipeline in Unix. It proposes a simple application framework for beginners in HPC called HPCnix.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"1 11","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}