Multiple-GPU accelerated high-order gas-kinetic scheme on three-dimensional unstructured meshes

IF 3.4 2区物理与天体物理 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Computer Physics Communications Pub Date : 2025-05-01 Epub Date: 2025-01-22 DOI:10.1016/j.cpc.2025.109513

Yuhang Wang, Waixiang Cao, Liang Pan

{"title":"Multiple-GPU accelerated high-order gas-kinetic scheme on three-dimensional unstructured meshes","authors":"Yuhang Wang, Waixiang Cao, Liang Pan","doi":"10.1016/j.cpc.2025.109513","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, successes have been achieved for the high-order gas-kinetic schemes (HGKS) on unstructured meshes for compressible flows. In this paper, to accelerate the computation, HGKS is implemented with the graphical processing unit (GPU) using the compute unified device architecture (CUDA). HGKS on unstructured meshes is a fully explicit scheme, and the acceleration framework can be developed based on the cell-level parallelism. For single-GPU computation, the connectivity of geometric information is generated for the requirement of data localization and independence. Based on such data structure, the kernels and corresponding girds of CUDA are set. With the one-to-one mapping between the indices of cells and CUDA threads, the single-GPU computation using CUDA can be implemented for HGKS. For multiple-GPU computation, the domain decomposition and data exchange need to be taken into account. The domain is decomposed into subdomains by METIS, and the MPI processes are created for the control of each process and communication among GPUs. With reconstruction of connectivity and adding ghost cells, the main configuration of CUDA for single-GPU can be inherited by each GPU. The benchmark cases for compressible flows, including accuracy test and flow passing through a sphere, are presented to assess the numerical performance of HGKS with Nvidia RTX A5000 and Tesla V100 GPUs. For single-GPU computation, compared with the parallel central processing unit (CPU) code running on the Intel Xeon Gold 5120 CPU with open multi-processing (OpenMP) directives, 5x speedup is achieved by RTX A5000 and 9x speedup is achieved by Tesla V100. For multiple-GPU computation, HGKS code scales properly with the increasing number of GPU. Numerical results confirm the excellent performance of multiple-GPU accelerated HGKS on unstructured meshes.</div></div>","PeriodicalId":285,"journal":{"name":"Computer Physics Communications","volume":"310 ","pages":"Article 109513"},"PeriodicalIF":3.4000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Physics Communications","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0010465525000165","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/22 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Recently, successes have been achieved for the high-order gas-kinetic schemes (HGKS) on unstructured meshes for compressible flows. In this paper, to accelerate the computation, HGKS is implemented with the graphical processing unit (GPU) using the compute unified device architecture (CUDA). HGKS on unstructured meshes is a fully explicit scheme, and the acceleration framework can be developed based on the cell-level parallelism. For single-GPU computation, the connectivity of geometric information is generated for the requirement of data localization and independence. Based on such data structure, the kernels and corresponding girds of CUDA are set. With the one-to-one mapping between the indices of cells and CUDA threads, the single-GPU computation using CUDA can be implemented for HGKS. For multiple-GPU computation, the domain decomposition and data exchange need to be taken into account. The domain is decomposed into subdomains by METIS, and the MPI processes are created for the control of each process and communication among GPUs. With reconstruction of connectivity and adding ghost cells, the main configuration of CUDA for single-GPU can be inherited by each GPU. The benchmark cases for compressible flows, including accuracy test and flow passing through a sphere, are presented to assess the numerical performance of HGKS with Nvidia RTX A5000 and Tesla V100 GPUs. For single-GPU computation, compared with the parallel central processing unit (CPU) code running on the Intel Xeon Gold 5120 CPU with open multi-processing (OpenMP) directives, 5x speedup is achieved by RTX A5000 and 9x speedup is achieved by Tesla V100. For multiple-GPU computation, HGKS code scales properly with the increasing number of GPU. Numerical results confirm the excellent performance of multiple-GPU accelerated HGKS on unstructured meshes.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

三维非结构化网格的多gpu加速高阶气体动力学格式

近年来，可压缩流动的非结构化网格高阶气体动力学格式（HGKS）取得了成功。为了加快计算速度，本文采用CUDA计算统一设备架构（CUDA）的图形处理单元（GPU）实现了HGKS。非结构化网格上的HGKS是一种完全显式的方案，可以基于单元级并行性开发加速框架。对于单gpu计算，为了满足数据定位和独立性的要求，生成几何信息的连通性。基于这种数据结构，设置了CUDA的核和相应的网格。利用单元格索引与CUDA线程之间的一对一映射，可以实现HGKS的CUDA单gpu计算。对于多gpu计算，需要考虑域分解和数据交换。通过METIS将域分解为子域，并创建MPI进程来控制每个进程和gpu之间的通信。通过重建连通性和添加鬼细胞，单个GPU的CUDA主配置可以被每个GPU继承。采用Nvidia RTX A5000和Tesla V100 gpu对可压缩流进行了精度测试和通过球体流的基准测试，以评估HGKS的数值性能。对于单gpu计算，与运行在Intel至强Gold 5120 CPU上具有开放多处理（OpenMP）指令的并行中央处理单元（CPU）代码相比，RTX A5000实现了5倍的加速，Tesla V100实现了9倍的加速。对于多GPU计算，HGKS代码随着GPU数量的增加而适当扩展。数值结果证实了多gpu加速HGKS在非结构化网格上的优异性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computer Physics Communications 物理-计算机：跨学科应用

CiteScore

12.10

自引率

3.20%

发文量

287

审稿时长

5.3 months

期刊介绍： The focus of CPC is on contemporary computational methods and techniques and their implementation, the effectiveness of which will normally be evidenced by the author(s) within the context of a substantive problem in physics. Within this setting CPC publishes two types of paper. Computer Programs in Physics (CPiP) These papers describe significant computer programs to be archived in the CPC Program Library which is held in the Mendeley Data repository. The submitted software must be covered by an approved open source licence. Papers and associated computer programs that address a problem of contemporary interest in physics that cannot be solved by current software are particularly encouraged. Computational Physics Papers (CP) These are research papers in, but are not limited to, the following themes across computational physics and related disciplines. mathematical and numerical methods and algorithms; computational models including those associated with the design, control and analysis of experiments; and algebraic computation. Each will normally include software implementation and performance details. The software implementation should, ideally, be available via GitHub, Zenodo or an institutional repository.In addition, research papers on the impact of advanced computer architecture and special purpose computers on computing in the physical sciences and software topics related to, and of importance in, the physical sciences may be considered.