An efficient sparse stiffness matrix vector multiplication using compressed sparse row storage format on AMD GPU

Concurrency and Computation: Practice and Experience Pub Date : 2022-07-20 DOI:10.1002/cpe.7186

Longyue Xing, Zhaoshun Wang, Zhezhao Ding, Genshen Chu, Lingyu Dong, Nan Xiao

{"title":"An efficient sparse stiffness matrix vector multiplication using compressed sparse row storage format on AMD GPU","authors":"Longyue Xing, Zhaoshun Wang, Zhezhao Ding, Genshen Chu, Lingyu Dong, Nan Xiao","doi":"10.1002/cpe.7186","DOIUrl":null,"url":null,"abstract":"The performance of sparse stiffness matrix‐vector multiplication is essential for large‐scale structural mechanics numerical simulation. Compressed sparse row (CSR) is the most common format for storing sparse stiffness matrices. However, the high sparsity of the sparse stiffness matrix makes the number of nonzero elements per row very small. Therefore, the CSR‐scalar algorithm, light algorithm, and HOLA algorithm in the calculation will cause some threads in the GPU to be in idle state, which will not only affect the computing performance but also waste computing resources. In this article, a new algorithm, CSR‐vector row, is proposed for fine‐grained computing optimization based on the AMD GPU architecture on heterogeneous supercomputers. This algorithm can set a vector to calculate a row based on the number of nonzero elements of the stiffness matrix. CSR‐vector row has efficient reduce operations, deep memory access optimization, better memory access, and calculation overlapping kernel function configuration scheme. The access bandwidth of the algorithm on AMD GPU is more than 700 GB/s. Compared with CSR‐scalar algorithm, the parallel efficiency of CSR‐vector row is improved by 7.2 times. And floating‐point computing performance is 41%–95% higher than that of light algorithm and HOLA algorithm. In addition, CSR‐vector row is used to calculate the examples from CFD, electromagnetics, quantum chemistry, power network, and semiconductor process, the memory access bandwidth and double floating‐point performance are also improved compared with rocSPARSE‐CSR‐vector.","PeriodicalId":10584,"journal":{"name":"Concurrency and Computation: Practice and Experience","volume":"130 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation: Practice and Experience","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/cpe.7186","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

The performance of sparse stiffness matrix‐vector multiplication is essential for large‐scale structural mechanics numerical simulation. Compressed sparse row (CSR) is the most common format for storing sparse stiffness matrices. However, the high sparsity of the sparse stiffness matrix makes the number of nonzero elements per row very small. Therefore, the CSR‐scalar algorithm, light algorithm, and HOLA algorithm in the calculation will cause some threads in the GPU to be in idle state, which will not only affect the computing performance but also waste computing resources. In this article, a new algorithm, CSR‐vector row, is proposed for fine‐grained computing optimization based on the AMD GPU architecture on heterogeneous supercomputers. This algorithm can set a vector to calculate a row based on the number of nonzero elements of the stiffness matrix. CSR‐vector row has efficient reduce operations, deep memory access optimization, better memory access, and calculation overlapping kernel function configuration scheme. The access bandwidth of the algorithm on AMD GPU is more than 700 GB/s. Compared with CSR‐scalar algorithm, the parallel efficiency of CSR‐vector row is improved by 7.2 times. And floating‐point computing performance is 41%–95% higher than that of light algorithm and HOLA algorithm. In addition, CSR‐vector row is used to calculate the examples from CFD, electromagnetics, quantum chemistry, power network, and semiconductor process, the memory access bandwidth and double floating‐point performance are also improved compared with rocSPARSE‐CSR‐vector.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在AMD GPU上使用压缩稀疏行存储格式的高效稀疏刚度矩阵向量乘法

稀疏刚度矩阵向量乘法的性能是大规模结构力学数值模拟的必要条件。压缩稀疏行(CSR)是存储稀疏刚度矩阵最常用的格式。然而，稀疏刚度矩阵的高稀疏性使得每行非零元素的数量非常少。因此，在计算中使用CSR - scalar算法、light算法和HOLA算法会导致GPU中的一些线程处于空闲状态，不仅会影响计算性能，还会浪费计算资源。本文提出了一种新的基于AMD GPU架构的异构超级计算机上的细粒度计算优化算法——CSR向量行算法。该算法可以根据刚度矩阵的非零元素个数设置一个向量来计算一行。CSR‐vector行具有高效的约简操作、深度内存访问优化、更好的内存访问和计算重叠核函数配置方案。该算法在AMD GPU上的访问带宽大于700gb /s。与CSR -标量算法相比，CSR -向量行并行效率提高了7.2倍。浮点运算性能比光算法和HOLA算法提高41% ~ 95%。此外，CSR - vector行用于计算CFD、电磁学、量子化学、电力网络和半导体工艺的实例，与rocSPARSE - CSR - vector相比，其存储器访问带宽和双浮点性能也得到了提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Concurrency and Computation: Practice and Experience

自引率

0.00%

发文量