在 SX-Aurora TSUBASA 矢量发动机上进行流体动力学 LBM 仿真的性能评估

IF 3.4 2区物理与天体物理 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Computer Physics Communications Pub Date : 2025-02-01 Epub Date: 2024-10-28 DOI:10.1016/j.cpc.2024.109411

Xiangcheng Sun , Keichi Takahashi , Yoichi Shimomura , Hiroyuki Takizawa , Xian Wang

{"title":"在 SX-Aurora TSUBASA 矢量发动机上进行流体动力学 LBM 仿真的性能评估","authors":"Xiangcheng Sun , Keichi Takahashi , Yoichi Shimomura , Hiroyuki Takizawa , Xian Wang","doi":"10.1016/j.cpc.2024.109411","DOIUrl":null,"url":null,"abstract":"<div><div>Currently, the lattice Boltzmann method (LBM) with high-performance computing (HPC) technologies, such as graphics processing units (GPUs), has been widely adopted to solve various complex problems in fluid dynamics. In addition to GPUs, the vector engine (VE) developed by NEC Corporation has also emerged as an effective solution for memory-intensive numerical simulations such as LBM. Consequently, it is imperative to evaluate the performance of LBM simulations accelerated by VE. This study discusses our self-developed LBM code for both classical and fused implementations on the VE. Through numerical simulations of 2D and 3D lid-driven cavity flows, the performance of the brand-new VE Type 30A (VE30) in conducting large-scale grid is evaluated and analyzed, and a comparison is made against the results obtained with VE Type 20B (VE20), NVIDIA A100 GPU (A100) and H100 GPU (H100). The results indicate that, regardless of the LBM implementation, H100 achieves the highest performance. Furthermore, owing to the substantial enhancements in VE30's memory hierarchy, the performance of the streaming kernel in the classical implementation of LBM has been significantly improved compared to VE20 and A100, approaching that of H100. However, due to the characteristic of fused implementation requiring fewer memory accesses, the performance of VE30 is inferior to that of H100 in the fused implementation. Additionally, it is anticipated that, under specific physical issues and requirements, VE30 will exhibit evident performance potential in LBM simulations with large-scale grid sizes.</div></div>","PeriodicalId":285,"journal":{"name":"Computer Physics Communications","volume":"307 ","pages":"Article 109411"},"PeriodicalIF":3.4000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance evaluation of the LBM simulations in fluid dynamics on SX-Aurora TSUBASA vector engine\",\"authors\":\"Xiangcheng Sun , Keichi Takahashi , Yoichi Shimomura , Hiroyuki Takizawa , Xian Wang\",\"doi\":\"10.1016/j.cpc.2024.109411\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Currently, the lattice Boltzmann method (LBM) with high-performance computing (HPC) technologies, such as graphics processing units (GPUs), has been widely adopted to solve various complex problems in fluid dynamics. In addition to GPUs, the vector engine (VE) developed by NEC Corporation has also emerged as an effective solution for memory-intensive numerical simulations such as LBM. Consequently, it is imperative to evaluate the performance of LBM simulations accelerated by VE. This study discusses our self-developed LBM code for both classical and fused implementations on the VE. Through numerical simulations of 2D and 3D lid-driven cavity flows, the performance of the brand-new VE Type 30A (VE30) in conducting large-scale grid is evaluated and analyzed, and a comparison is made against the results obtained with VE Type 20B (VE20), NVIDIA A100 GPU (A100) and H100 GPU (H100). The results indicate that, regardless of the LBM implementation, H100 achieves the highest performance. Furthermore, owing to the substantial enhancements in VE30's memory hierarchy, the performance of the streaming kernel in the classical implementation of LBM has been significantly improved compared to VE20 and A100, approaching that of H100. However, due to the characteristic of fused implementation requiring fewer memory accesses, the performance of VE30 is inferior to that of H100 in the fused implementation. Additionally, it is anticipated that, under specific physical issues and requirements, VE30 will exhibit evident performance potential in LBM simulations with large-scale grid sizes.</div></div>\",\"PeriodicalId\":285,\"journal\":{\"name\":\"Computer Physics Communications\",\"volume\":\"307 \",\"pages\":\"Article 109411\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Physics Communications\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0010465524003345\",\"RegionNum\":2,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/10/28 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Physics Communications","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0010465524003345","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/10/28 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

目前，采用图形处理器（GPU）等高性能计算（HPC）技术的格子波尔兹曼法（LBM）已被广泛用于解决流体动力学中的各种复杂问题。除了 GPU 之外，NEC 公司开发的矢量引擎（VE）也已成为 LBM 等内存密集型数值模拟的有效解决方案。因此，评估由 VE 加速的 LBM 仿真的性能势在必行。本研究讨论了我们自主开发的 LBM 代码在 VE 上的经典实现和融合实现。通过对二维和三维顶盖驱动空腔流的数值模拟，评估和分析了全新的 VE 30A 型（VE30）在进行大规模网格中的性能，并与 VE 20B 型（VE20）、英伟达 A100 GPU（A100）和 H100 GPU（H100）获得的结果进行了比较。结果表明，无论采用哪种 LBM 实现方式，H100 的性能都是最高的。此外，由于 VE30 内存层次结构的大幅增强，经典 LBM 实现中的流内核性能与 VE20 和 A100 相比有了显著提高，接近 H100 的性能。不过，由于融合实现的特点是需要较少的内存访问，因此在融合实现中，VE30 的性能不如 H100。此外，预计在特定的物理问题和要求下，VE30 将在大规模网格尺寸的 LBM 模拟中表现出明显的性能潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Performance evaluation of the LBM simulations in fluid dynamics on SX-Aurora TSUBASA vector engine

Currently, the lattice Boltzmann method (LBM) with high-performance computing (HPC) technologies, such as graphics processing units (GPUs), has been widely adopted to solve various complex problems in fluid dynamics. In addition to GPUs, the vector engine (VE) developed by NEC Corporation has also emerged as an effective solution for memory-intensive numerical simulations such as LBM. Consequently, it is imperative to evaluate the performance of LBM simulations accelerated by VE. This study discusses our self-developed LBM code for both classical and fused implementations on the VE. Through numerical simulations of 2D and 3D lid-driven cavity flows, the performance of the brand-new VE Type 30A (VE30) in conducting large-scale grid is evaluated and analyzed, and a comparison is made against the results obtained with VE Type 20B (VE20), NVIDIA A100 GPU (A100) and H100 GPU (H100). The results indicate that, regardless of the LBM implementation, H100 achieves the highest performance. Furthermore, owing to the substantial enhancements in VE30's memory hierarchy, the performance of the streaming kernel in the classical implementation of LBM has been significantly improved compared to VE20 and A100, approaching that of H100. However, due to the characteristic of fused implementation requiring fewer memory accesses, the performance of VE30 is inferior to that of H100 in the fused implementation. Additionally, it is anticipated that, under specific physical issues and requirements, VE30 will exhibit evident performance potential in LBM simulations with large-scale grid sizes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Physics Communications 物理-计算机：跨学科应用

CiteScore

12.10

自引率

3.20%

发文量

287

审稿时长

5.3 months

期刊介绍： The focus of CPC is on contemporary computational methods and techniques and their implementation, the effectiveness of which will normally be evidenced by the author(s) within the context of a substantive problem in physics. Within this setting CPC publishes two types of paper. Computer Programs in Physics (CPiP) These papers describe significant computer programs to be archived in the CPC Program Library which is held in the Mendeley Data repository. The submitted software must be covered by an approved open source licence. Papers and associated computer programs that address a problem of contemporary interest in physics that cannot be solved by current software are particularly encouraged. Computational Physics Papers (CP) These are research papers in, but are not limited to, the following themes across computational physics and related disciplines. mathematical and numerical methods and algorithms; computational models including those associated with the design, control and analysis of experiments; and algebraic computation. Each will normally include software implementation and performance details. The software implementation should, ideally, be available via GitHub, Zenodo or an institutional repository.In addition, research papers on the impact of advanced computer architecture and special purpose computers on computing in the physical sciences and software topics related to, and of importance in, the physical sciences may be considered.