Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops最新文献

英文中文

An Overview on Mixing MPI and OpenMP Dependent Tasking on A64FX A64FX 上混合使用 MPI 和 OpenMP 任务分配概述

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops

Pub Date : 2024-01-11 DOI: 10.1145/3636480.3637094

Romain Pereira, A. Roussel, Miwako Tsuji, Patrick Carribault, Mitsuhisa Sato, Hitoshi Murai, Thierry Gautier

The adoption of ARM processor architectures is on the rise in the HPC ecosystem. Fugaku supercomputer is a homogeneous ARM-based machine, and is one among the most powerful machine in the world. In the programming world, dependent task-based programming models are gaining tractions due to their many advantages: dynamic load balancing, implicit expression of communication/computation overlap, early-bird communication posting,...MPI and OpenMP are two widespreads programming standards that make possible task-based programming at a distributed memory level. Despite its many advantages, mixed-use of the standard programming models using dependent tasks is still under-evaluated on large-scale machines. In this paper, we provide an overview on mixing OpenMP dependent tasking model with MPI with the state-of-the-art software stack (GCC-13, Clang17, MPC-OMP). We provide the level of performances to expect by porting applications to such mixed-use of the standard on the Fugaku supercomputers, using two benchmarks (Cholesky, HPCCG) and a proxy-application (LULESH). We show that software stack, resource binding and communication progression mechanisms are factors that have a significant impact on performance. On distributed applications, performances reaches up to 80% of effiency for task-based applications like HPCCG. We also point-out a few areas of improvements in OpenMP runtimes.

在高性能计算生态系统中，ARM 处理器架构的采用呈上升趋势。Fugaku 超级计算机是基于 ARM 的同构机器，也是世界上最强大的机器之一。在编程领域，基于任务的依赖性编程模型因其诸多优势而越来越受到青睐：动态负载平衡、通信/计算重叠的隐式表达、早起的鸟儿有虫吃......MPI 和 OpenMP 是两种广泛应用的编程标准，它们使分布式内存级别的基于任务的编程成为可能。尽管MPI和OpenMP有很多优点，但在大规模机器上使用依赖任务的标准编程模型的混合使用仍未得到充分评估。本文概述了将 OpenMP 依赖任务模型与 MPI 混合使用的最新软件栈（GCC-13、Clang17、MPC-OMP）。我们使用两个基准测试（Cholesky、HPCCG）和一个代理应用程序（LULESH），介绍了在 Fugaku 超级计算机上将应用程序移植到这种混合使用标准所能达到的性能水平。我们发现，软件栈、资源绑定和通信进展机制是对性能有重大影响的因素。在分布式应用中，HPCCG 等基于任务的应用的性能最高可达效率的 80%。我们还指出了 OpenMP 运行时需要改进的几个方面。

{"title":"An Overview on Mixing MPI and OpenMP Dependent Tasking on A64FX","authors":"Romain Pereira, A. Roussel, Miwako Tsuji, Patrick Carribault, Mitsuhisa Sato, Hitoshi Murai, Thierry Gautier","doi":"10.1145/3636480.3637094","DOIUrl":"https://doi.org/10.1145/3636480.3637094","url":null,"abstract":"The adoption of ARM processor architectures is on the rise in the HPC ecosystem. Fugaku supercomputer is a homogeneous ARM-based machine, and is one among the most powerful machine in the world. In the programming world, dependent task-based programming models are gaining tractions due to their many advantages: dynamic load balancing, implicit expression of communication/computation overlap, early-bird communication posting,...MPI and OpenMP are two widespreads programming standards that make possible task-based programming at a distributed memory level. Despite its many advantages, mixed-use of the standard programming models using dependent tasks is still under-evaluated on large-scale machines. In this paper, we provide an overview on mixing OpenMP dependent tasking model with MPI with the state-of-the-art software stack (GCC-13, Clang17, MPC-OMP). We provide the level of performances to expect by porting applications to such mixed-use of the standard on the Fugaku supercomputers, using two benchmarks (Cholesky, HPCCG) and a proxy-application (LULESH). We show that software stack, resource binding and communication progression mechanisms are factors that have a significant impact on performance. On distributed applications, performances reaches up to 80% of effiency for task-based applications like HPCCG. We also point-out a few areas of improvements in OpenMP runtimes.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"4 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Parallel Multi-Physics Coupled Simulation of a Midrex Blast Furnace Midrex 高炉的并行多物理场耦合模拟

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops

Pub Date : 2024-01-11 DOI: 10.1145/3636480.3636484

Xavier Besseron, P. Adhav, Bernhard Peters

Traditional steelmaking is a major source of carbon dioxide emissions, but green steel production offers a sustainable alternative. Green steel is produced using hydrogen as a reducing agent instead of carbon monoxide, which results in only water vapour as a by-product. Midrex is a well-established technology that plays a crucial role in the green steel supply chain by producing direct reduced iron (DRI), a more environmentally friendly alternative to traditional iron production methods. In this work, we model a Midrex blast furnace and propose a parallel multi-physics simulation tool based on the coupling between Discrete Element Method (DEM) and Computational Fluid Dynamics (CFD). The particulate phase is simulated with XDEM (parallelized with MPI+OpenMP), the fluid phase is solved by OpenFOAM (parallelized with MPI), and the two solvers are coupled together using the preCICE library. We perform a careful performance analysis that focuses first on each solver individually and then on the coupled application. Our results highlight the difficulty of distributing the computing resources appropriately between the solvers in order to achieve the best performance. Finally, our multi-physics coupled implementation runs in parallel on 1024 cores and can simulate 500 seconds of the Midrex blast furnace in 1 hour and 45 minutes. This work identifies the challenge related to the load balancing of coupled solvers and makes a step forward towards the simulation of a complete 3D blast furnace on High-Performance Computing platforms.

传统炼钢是二氧化碳排放的主要来源，但绿色钢铁生产提供了一种可持续的替代方法。绿色钢铁是用氢气代替一氧化碳作为还原剂进行生产的，因此只有水蒸气作为副产品。Midrex 是一项成熟的技术，通过生产直接还原铁（DRI）在绿色钢铁供应链中发挥着至关重要的作用。在这项工作中，我们对 Midrex 高炉进行了建模，并提出了一种基于离散元素法（DEM）和计算流体动力学（CFD）耦合的并行多物理场仿真工具。颗粒阶段用 XDEM（MPI+OpenMP 并行化）模拟，流体阶段用 OpenFOAM（MPI 并行化）求解，两个求解器使用 preCICE 库耦合在一起。我们首先对每个求解器进行了仔细的性能分析，然后对耦合应用进行了分析。我们的结果凸显了在求解器之间合理分配计算资源以实现最佳性能的难度。最后，我们的多物理场耦合实现在 1024 个内核上并行运行，可以在 1 小时 45 分钟内模拟 500 秒的 Midrex 高炉。这项工作确定了与耦合求解器负载平衡相关的挑战，并朝着在高性能计算平台上模拟完整的三维高炉迈出了一步。

{"title":"Parallel Multi-Physics Coupled Simulation of a Midrex Blast Furnace","authors":"Xavier Besseron, P. Adhav, Bernhard Peters","doi":"10.1145/3636480.3636484","DOIUrl":"https://doi.org/10.1145/3636480.3636484","url":null,"abstract":"Traditional steelmaking is a major source of carbon dioxide emissions, but green steel production offers a sustainable alternative. Green steel is produced using hydrogen as a reducing agent instead of carbon monoxide, which results in only water vapour as a by-product. Midrex is a well-established technology that plays a crucial role in the green steel supply chain by producing direct reduced iron (DRI), a more environmentally friendly alternative to traditional iron production methods. In this work, we model a Midrex blast furnace and propose a parallel multi-physics simulation tool based on the coupling between Discrete Element Method (DEM) and Computational Fluid Dynamics (CFD). The particulate phase is simulated with XDEM (parallelized with MPI+OpenMP), the fluid phase is solved by OpenFOAM (parallelized with MPI), and the two solvers are coupled together using the preCICE library. We perform a careful performance analysis that focuses first on each solver individually and then on the coupled application. Our results highlight the difficulty of distributing the computing resources appropriately between the solvers in order to achieve the best performance. Finally, our multi-physics coupled implementation runs in parallel on 1024 cores and can simulate 500 seconds of the Midrex blast furnace in 1 hour and 45 minutes. This work identifies the challenge related to the load balancing of coupled solvers and makes a step forward towards the simulation of a complete 3D blast furnace on High-Performance Computing platforms.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"25 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

First Impressions of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchip for Scientific Workloads 面向科学工作负载的英伟达™（NVIDIA®）Grace CPU 超级芯片和英伟达™（NVIDIA®）Grace Hopper 超级芯片初体验

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops

Pub Date : 2024-01-11 DOI: 10.1145/3636480.3637097

N. Simakov, Matthew D. Jones, T. Furlani, E. Siegmann, Robert Harrison

The engineering samples of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchips were tested using different benchmarks and scientific applications. The benchmarks include HPCC and HPCG. The real application-based benchmark includes AI-Benchmark-Alpha (a TensorFlow benchmark), Gromacs, OpenFOAM, and ROMS. The performance was compared to multiple Intel, AMD, ARM CPUs and several x86 with NVIDIA GPU systems. A brief energy efficiency estimate was performed based on TDP values. We found that in HPCC benchmark tests, the per-core performance of Grace is similar to or faster than AMD Milan cores, and the high core count often allows NVIDIA Grace CPU Superchip to have per-node performance similar to Intel Sapphire Rapids with High Bandwidth Memory: slower in matrix multiplication (by 17%) and FFT (by 6%), faster in Linpack (by 9%)). In scientific applications, the NVIDIA Grace CPU Superchip performance is slower by 6% to 18% in Gromacs, faster by 7% in OpenFOAM, and right between HBM and DDR modes of Intel Sapphire Rapids in ROMS. The combined CPU-GPU performance in Gromacs is significantly faster (by 20% to 117% faster) than any tested x86-NVIDIA GPU system. Overall, the new NVIDIA Grace Hopper Superchip and NVIDIA Grace CPU Superchip Superchip are high-performance and most likely energy-efficient solutions for HPC centers.

英伟达™（NVIDIA®）Grace CPU 超级芯片和英伟达™（NVIDIA®）Grace Hopper 超级芯片的工程样品通过不同的基准和科学应用进行了测试。基准测试包括 HPCC 和 HPCG。基于实际应用的基准包括 AI-Benchmark-Alpha（TensorFlow 基准）、Gromacs、OpenFOAM 和 ROMS。性能与多个英特尔、AMD、ARM CPU 以及多个 x86 和英伟达™（NVIDIA®）GPU 系统进行了比较。根据 TDP 值对能效进行了简要评估。我们发现，在 HPCC 基准测试中，Grace 的单位内核性能与 AMD 米兰内核相近或更快，而高内核数往往使英伟达™ Grace CPU 超级芯片的单位节点性能与配备高带宽内存的英特尔蓝宝石锐龙相似：矩阵乘法（慢 17%）和 FFT（慢 6%），Linpack（快 9%））。在科学应用中，英伟达™（NVIDIA®）Grace CPU 超级芯片的性能在 Gromacs 中要慢 6% 到 18%，在 OpenFOAM 中要快 7%，在 ROMS 中则介于英特尔蓝宝石锐龙的 HBM 和 DDR 模式之间。在 Gromacs 中，CPU-GPU 的组合性能比任何经过测试的 x86-NVIDIA GPU 系统都要快得多（快 20% 到 117%）。总之，新的英伟达™（NVIDIA®）Grace Hopper 超级芯片和英伟达™（NVIDIA®）Grace CPU 超级芯片是高性能、高能效的高性能计算中心解决方案。

{"title":"First Impressions of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchip for Scientific Workloads","authors":"N. Simakov, Matthew D. Jones, T. Furlani, E. Siegmann, Robert Harrison","doi":"10.1145/3636480.3637097","DOIUrl":"https://doi.org/10.1145/3636480.3637097","url":null,"abstract":"The engineering samples of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchips were tested using different benchmarks and scientific applications. The benchmarks include HPCC and HPCG. The real application-based benchmark includes AI-Benchmark-Alpha (a TensorFlow benchmark), Gromacs, OpenFOAM, and ROMS. The performance was compared to multiple Intel, AMD, ARM CPUs and several x86 with NVIDIA GPU systems. A brief energy efficiency estimate was performed based on TDP values. We found that in HPCC benchmark tests, the per-core performance of Grace is similar to or faster than AMD Milan cores, and the high core count often allows NVIDIA Grace CPU Superchip to have per-node performance similar to Intel Sapphire Rapids with High Bandwidth Memory: slower in matrix multiplication (by 17%) and FFT (by 6%), faster in Linpack (by 9%)). In scientific applications, the NVIDIA Grace CPU Superchip performance is slower by 6% to 18% in Gromacs, faster by 7% in OpenFOAM, and right between HBM and DDR modes of Intel Sapphire Rapids in ROMS. The combined CPU-GPU performance in Gromacs is significantly faster (by 20% to 117% faster) than any tested x86-NVIDIA GPU system. Overall, the new NVIDIA Grace Hopper Superchip and NVIDIA Grace CPU Superchip Superchip are high-performance and most likely energy-efficient solutions for HPC centers.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"13 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimize Efficiency of Utilizing Systems by Dynamic Core Binding 通过动态核心绑定优化系统利用效率

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops

Pub Date : 2024-01-11 DOI: 10.1145/3636480.3637221

Masatoshi Kawai, Akihiro Ida, Toshihiro Hanawa, Tetsuya Hoshino

Load balancing at both the process and thread levels is imperative for minimizing application computation time in the context of MPI/OpenMP hybrid parallelization. This necessity arises from the constraint that, within a typical hybrid parallel environment, an identical number of cores is bound to each process. Dynamic Core Binding, however, adjusts the core binding based on the process’s workload, thereby realizing load balancing at the core level. In prior research, we have implemented the DCB library, which has two policies for computation time reduction or power reduction. In this paper, we show that the two policies provided by the DCB library can be used together to achieve both computation time reduction and power consumption reduction.

在 MPI/OpenMP 混合并行化的背景下，要最大限度地减少应用计算时间，进程和线程层面的负载平衡势在必行。在典型的混合并行环境中，每个进程都绑定了相同数量的内核，这就产生了这种必要性。而动态内核绑定会根据进程的工作量调整内核绑定，从而在内核层面实现负载平衡。在之前的研究中，我们已经实现了 DCB 库，该库有减少计算时间或降低功耗两种策略。在本文中，我们展示了 DCB 库提供的两种策略可以一起使用，以实现减少计算时间和降低功耗。

引用次数: 0

Introducing software pipelining for the A64FX processor into LLVM 将 A64FX 处理器的软件流水线引入 LLVM

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops

Pub Date : 2024-01-11 DOI: 10.1145/3636480.3637093

Masaki Arai, Naoto Fukumoto, Hitoshi Murai

Software pipelining is an essential optimization for accelerating High-Performance Computing(HPC) applications on CPUs. Modern CPUs achieve high performance through many-core and wide SIMD instructions. Software pipelining is an optimization that promotes further performance improvement of HPC applications by cooperating with these functions. Although open source compilers such as GCC and LLVM have implemented software pipelining, it is underutilized for the AArch64 architecture. We have implemented software pipelining for the A64FX processor on LLVM to improve this situation. This paper describes the details of this implementation. We also confirmed that our implementation improves the performance of several benchmark programs.

软件流水线是加速 CPU 上高性能计算（HPC）应用的重要优化手段。现代 CPU 通过多核和宽 SIMD 指令实现高性能。软件流水线技术是一种优化技术，通过与这些功能配合，可进一步提高高性能计算应用的性能。虽然 GCC 和 LLVM 等开源编译器已经实现了软件流水线功能，但在 AArch64 架构上还未得到充分利用。为了改善这种状况，我们在 LLVM 上为 A64FX 处理器实现了软件流水线功能。本文介绍了这一实现的细节。我们还证实，我们的实现提高了几个基准程序的性能。

引用次数: 0

High-throughput drug discovery on the Fujitsu A64FX architecture 富士通 A64FX 架构上的高通量药物发现

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops

Pub Date : 2024-01-11 DOI: 10.1145/3636480.3637095

Filippo Barbari, F. Ficarelli, Daniele Cesarini

High-performance computational kernels that optimally exploit modern vector-capable processors are critical in running large-scale drug discovery campaigns efficiently and promptly compatible with the constraints posed by urgent computing needs. Yet, state-of-the-art virtual screening workflows focus either on the broadness of features provided to the drug researcher or performance on high-throughput accelerators, leaving the task of deploying efficient CPU kernels to the compiler. We ported the key parts of the LiGen drug discovery pipeline, based on molecular docking, to the Fujitsu A64FX platform and leveraged its vector processing capabilities via an industry-proven retargetable SIMD programming model. By rethinking and optimizing key geometrical docking algorithms to leverage SVE instructions, we are able to provide efficient, high throughput execution on SVE-capable platforms.

高性能计算内核可优化利用现代矢量处理器，这对于高效、及时地运行大规模药物发现活动，并与紧迫的计算需求所带来的限制相匹配至关重要。然而，最先进的虚拟筛选工作流程要么侧重于为药物研究人员提供广泛的功能，要么侧重于高通量加速器上的性能，而将部署高效 CPU 内核的任务留给了编译器。我们将基于分子对接的 LiGen 药物发现流水线的关键部分移植到富士通 A64FX 平台，并通过业界公认的可重目标 SIMD 编程模型利用其矢量处理能力。通过重新思考和优化关键的几何对接算法以利用 SVE 指令，我们能够在支持 SVE 的平台上提供高效、高吞吐量的执行。

引用次数: 0

The Implementation of Gas-liquid Two-phase Flow Simulations with Surfactant Transport Based on GPU Computing and Adaptive Mesh Refinement 基于 GPU 计算和自适应网格细化的带表面活性剂传输的气液两相流模拟的实现

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops

Pub Date : 2024-01-11 DOI: 10.1145/3636480.3636485

Tongda Lian, Shintaro Matsushita, Takayuki Aoki

We proposed an implementation for surfactant transport simulations in gas-liquid two-phase flows. This implementation employs a tree-based interface-adapted adaptive mesh refinement (AMR) method, assigning a high-resolution mesh around the interface region, significantly reducing computational resources, such as memory and execution time. We developed GPU code by CUDA programming language for the AMR method to further enhance performance through GPU parallel computing. The piece-wise linear interface calculation (PLIC) method, an interface-capturing approach for two-phase flows, is implemented based on the tree-based AMR method and GPU computing. We adopted the height function (HF) method to calculate interface curvature for surface tension assessment to suppress the spurious currents, and implemented it on the AMR mesh as well. We incorporated the Langmuir model to describe surfactant transport, as well as surfactant adsorption and desorption at the gas-liquid interface. Our implementation was applied to simulate a two-dimensional process where a bubble freely rises to the liquid surface, forms a thin liquid film, and eventually results in the film’s rupture. This simulation confirmed a reduction in the number of mesh grids required with our proposed implementations.

我们提出了一种用于气液两相流中表面活性剂传输模拟的实现方法。该实施方案采用了基于树状界面的自适应网格细化（AMR）方法，在界面区域周围分配高分辨率网格，从而大大减少了内存和执行时间等计算资源。我们使用 CUDA 编程语言为 AMR 方法开发了 GPU 代码，通过 GPU 并行计算进一步提高性能。片断线性界面计算（PLIC）方法是一种针对两相流的界面捕捉方法，它是在基于树的 AMR 方法和 GPU 计算的基础上实现的。我们采用了高度函数（HF）方法来计算用于表面张力评估的界面曲率，以抑制杂散电流，并在 AMR 网格上实现了这一方法。我们采用 Langmuir 模型来描述表面活性剂的传输，以及表面活性剂在气液界面的吸附和解吸。我们将该模型用于模拟一个二维过程：气泡自由上升到液体表面，形成一层薄薄的液膜，并最终导致液膜破裂。这次模拟证实，我们提出的实现方法减少了所需的网格数量。

{"title":"The Implementation of Gas-liquid Two-phase Flow Simulations with Surfactant Transport Based on GPU Computing and Adaptive Mesh Refinement","authors":"Tongda Lian, Shintaro Matsushita, Takayuki Aoki","doi":"10.1145/3636480.3636485","DOIUrl":"https://doi.org/10.1145/3636480.3636485","url":null,"abstract":"We proposed an implementation for surfactant transport simulations in gas-liquid two-phase flows. This implementation employs a tree-based interface-adapted adaptive mesh refinement (AMR) method, assigning a high-resolution mesh around the interface region, significantly reducing computational resources, such as memory and execution time. We developed GPU code by CUDA programming language for the AMR method to further enhance performance through GPU parallel computing. The piece-wise linear interface calculation (PLIC) method, an interface-capturing approach for two-phase flows, is implemented based on the tree-based AMR method and GPU computing. We adopted the height function (HF) method to calculate interface curvature for surface tension assessment to suppress the spurious currents, and implemented it on the AMR mesh as well. We incorporated the Langmuir model to describe surfactant transport, as well as surfactant adsorption and desorption at the gas-liquid interface. Our implementation was applied to simulate a two-dimensional process where a bubble freely rises to the liquid surface, forms a thin liquid film, and eventually results in the film’s rupture. This simulation confirmed a reduction in the number of mesh grids required with our proposed implementations.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Performance Evaluation of the Fourth-Generation Xeon with Different Memory Characteristics 采用不同内存特性的第四代至强处理器性能评估

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops

Pub Date : 2024-01-11 DOI: 10.1145/3636480.3637218

Keiichiro Fukazawa, Riki Takahashi

At the Supercomputer System of Academic Center for Computing and Media Studies Kyoto University, the fourth-generation Xeon (code-named Sapphire Rapids) is employed. The system consists of two subsystems—one equipped solely with high-bandwidth memory, HBM2e, and the other with a large DDR5 memory capacity. Using benchmark applications, a performance evaluation of systems with each type of memory was conducted. Additionally, the study employed a real application, the electromagnetic fluid code, to investigate how application performance varies based on differences in memory characteristics. The results confirm the performance improvement due to the high bandwidth of HBM2e. However, it was also observed that the efficiency is lower when using HBM2e, and the effects of cache memory optimization are relatively minimal.

京都大学计算与媒体研究学术中心的超级计算机系统采用了第四代至强（代号 Sapphire Rapids）。该系统由两个子系统组成，一个仅配备高带宽内存 HBM2e，另一个配备大容量 DDR5 内存。通过使用基准应用程序，对配备每种类型内存的系统进行了性能评估。此外，该研究还采用了一个实际应用--电磁流体代码，以研究应用性能如何根据内存特性的不同而变化。结果证实了 HBM2e 的高带宽带来的性能提升。但同时也发现，使用 HBM2e 时效率较低，高速缓冲存储器优化的效果相对较小。

引用次数: 0

Impact of Write-Allocate Elimination on Fujitsu A64FX 取消写入分配对富士通 A64FX 的影响

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops

Pub Date : 2024-01-11 DOI: 10.1145/3636480.3637283

Yan Kang, Sayan Ghosh, M. Kandemir, Andrés Marquez

ARM-based CPU architectures are currently driving massive disruptions in the High Performance Computing (HPC) community. Deployment of the 48-core Fujitsu A64FX ARM architecture based processor in RIKEN “Fugaku” supercomputer (#2 in the June 2023 Top500 list) was a major inflection point in pushing ARM to mainstream HPC. A key design criteria of Fujitsu A64FX is to enhance the throughput of modern memory-bound applications, which happens to be a dominant pattern in contemporary HPC, as opposed to traditional compute-bound or floating-point intensive science workloads. One of the mechanisms to enhance the throughput concerns write-allocate operations (e.g., streaming write operations), which are quite common in science applications. In particular, eliminating write-allocate operations (allocate cache line on a write miss) through a special “zero fill” instruction available on the ARM CPU architectures can improve the overall memory bandwidth, by avoiding the memory read into a cache line, which is unnecessary since the cache line will be written consequently. While bandwidth implications are relatively straightforward to measure via synthetic benchmarks with fixed-stride memory accesses, it is important to consider irregular memory-access driven scenarios such as graph analytics, and analyze the impact of write-allocate elimination on diverse data-driven applications. In this paper, we examine the impact of “zero fill” on OpenMP-based multithreaded graph application scenarios (Graph500 Breadth First Search, GAP benchmark suite, and Louvain Graph Clustering) and five application proxies from the Rodinia heterogeneous benchmark suite (molecular dynamics, sequence alignment, image processing, etc.), using LLVM-based ARM and GNU compilers on the Fujitsu FX700 A64FX platform of the Ookami system from Stony Brook University. Our results indicate that facilitating “zero fill” through code modifications to certain critical kernels or code segments that exhibit temporal write patterns can positively impact the overall performance of a variety of applications. We observe performance variations across the compilers and input data, and note end-to-end improvements between 5–20% for the benchmarks and diverse spectrum of application scenarios owing to “zero fill” related adaptations.

基于 ARM 架构的 CPU 目前正在推动高性能计算 (HPC) 领域的大规模变革。基于 ARM 架构的 48 核富士通 A64FX 处理器在理研 "Fugaku "超级计算机（2023 年 6 月 Top500 榜单第二名）中的部署是将 ARM 推向主流 HPC 的一个重要拐点。富士通 A64FX 的一个关键设计标准是提高现代内存绑定应用的吞吐量，这恰好是当代 HPC 的主流模式，而不是传统的计算绑定或浮点密集型科学工作负载。提高吞吐量的机制之一涉及写入分配操作（如流式写入操作），这在科学应用中非常常见。特别是，通过 ARM CPU 架构上的特殊 "零填充 "指令消除写入分配操作（在写入未命中时分配高速缓存行），可以避免内存读入高速缓存行，从而提高整体内存带宽。虽然带宽影响可以通过固定跨度内存访问的合成基准进行相对直接的测量，但考虑不规则内存访问驱动的场景（如图形分析）并分析取消写分配对各种数据驱动应用的影响也很重要。在本文中，我们在石溪大学 Ookami 系统的富士通 FX700 A64FX 平台上，使用基于 LLVM 的 ARM 和 GNU 编译器，研究了 "零填充 "对基于 OpenMP 的多线程图应用场景（Graph500 Breadth First Search、GAP 基准套件和 Louvain Graph Clustering）和 Rodinia 异构基准套件（分子动力学、序列比对、图像处理等）中五个应用代理的影响。我们的研究结果表明，通过对某些关键内核或表现出时间写入模式的代码段进行代码修改来促进 "零填充"，可以对各种应用程序的整体性能产生积极影响。我们观察到不同编译器和输入数据的性能差异，并注意到由于 "零填充 "相关的调整，基准和各种应用场景的端到端性能提高了 5-20%。

{"title":"Impact of Write-Allocate Elimination on Fujitsu A64FX","authors":"Yan Kang, Sayan Ghosh, M. Kandemir, Andrés Marquez","doi":"10.1145/3636480.3637283","DOIUrl":"https://doi.org/10.1145/3636480.3637283","url":null,"abstract":"ARM-based CPU architectures are currently driving massive disruptions in the High Performance Computing (HPC) community. Deployment of the 48-core Fujitsu A64FX ARM architecture based processor in RIKEN “Fugaku” supercomputer (#2 in the June 2023 Top500 list) was a major inflection point in pushing ARM to mainstream HPC. A key design criteria of Fujitsu A64FX is to enhance the throughput of modern memory-bound applications, which happens to be a dominant pattern in contemporary HPC, as opposed to traditional compute-bound or floating-point intensive science workloads. One of the mechanisms to enhance the throughput concerns write-allocate operations (e.g., streaming write operations), which are quite common in science applications. In particular, eliminating write-allocate operations (allocate cache line on a write miss) through a special “zero fill” instruction available on the ARM CPU architectures can improve the overall memory bandwidth, by avoiding the memory read into a cache line, which is unnecessary since the cache line will be written consequently. While bandwidth implications are relatively straightforward to measure via synthetic benchmarks with fixed-stride memory accesses, it is important to consider irregular memory-access driven scenarios such as graph analytics, and analyze the impact of write-allocate elimination on diverse data-driven applications. In this paper, we examine the impact of “zero fill” on OpenMP-based multithreaded graph application scenarios (Graph500 Breadth First Search, GAP benchmark suite, and Louvain Graph Clustering) and five application proxies from the Rodinia heterogeneous benchmark suite (molecular dynamics, sequence alignment, image processing, etc.), using LLVM-based ARM and GNU compilers on the Fujitsu FX700 A64FX platform of the Ookami system from Stony Brook University. Our results indicate that facilitating “zero fill” through code modifications to certain critical kernels or code segments that exhibit temporal write patterns can positively impact the overall performance of a variety of applications. We observe performance variations across the compilers and input data, and note end-to-end improvements between 5–20% for the benchmarks and diverse spectrum of application scenarios owing to “zero fill” related adaptations.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"5 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HPCnix: make HPC Apps more easier like shell script HPCnix：让高性能计算应用程序像 shell 脚本一样更简单

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops

Pub Date : 2024-01-11 DOI: 10.1145/3636480.3637231

Minoru Kanatsu, Hiroshi Yamada

In the area of high-performance computing (HPC), it is expected to extract extreme computing performance using a highly optimized framework without even common OS APIs and frameworks for personal desktops. However, this makes the development cost higher than normal application development and challenging for beginners. The demand for large-scale computation is increasing due to the growth of cloud computing environments and the AI boom resulting from deep learning and large-scale language models. Therefore, a framework that makes it easier to handle HPC application programming is needed. This study shows a concept model that makes it possible to write HPC applications using semantics like the shell command pipeline in Unix. It proposes a simple application framework for beginners in HPC called HPCnix.

在高性能计算（HPC）领域，人们希望使用高度优化的框架来提取极高的计算性能，而无需使用通用的操作系统 API 和个人桌面框架。然而，这使得开发成本高于普通应用程序开发，对初学者来说也具有挑战性。由于云计算环境的发展以及深度学习和大规模语言模型带来的人工智能热潮，对大规模计算的需求日益增长。因此，我们需要一个能让高性能计算应用编程更容易处理的框架。本研究展示了一种概念模型，它可以使用类似于 Unix 中 shell 命令流水线的语义来编写 HPC 应用程序。它为 HPC 初学者提出了一个名为 HPCnix 的简单应用框架。

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀