An efficient sparse stiffness matrix vector multiplication using compressed sparse row storage format on AMD GPU

Longyue Xing, Zhaoshun Wang, Zhezhao Ding, Genshen Chu, Lingyu Dong, Nan Xiao
{"title":"An efficient sparse stiffness matrix vector multiplication using compressed sparse row storage format on AMD GPU","authors":"Longyue Xing, Zhaoshun Wang, Zhezhao Ding, Genshen Chu, Lingyu Dong, Nan Xiao","doi":"10.1002/cpe.7186","DOIUrl":null,"url":null,"abstract":"The performance of sparse stiffness matrix‐vector multiplication is essential for large‐scale structural mechanics numerical simulation. Compressed sparse row (CSR) is the most common format for storing sparse stiffness matrices. However, the high sparsity of the sparse stiffness matrix makes the number of nonzero elements per row very small. Therefore, the CSR‐scalar algorithm, light algorithm, and HOLA algorithm in the calculation will cause some threads in the GPU to be in idle state, which will not only affect the computing performance but also waste computing resources. In this article, a new algorithm, CSR‐vector row, is proposed for fine‐grained computing optimization based on the AMD GPU architecture on heterogeneous supercomputers. This algorithm can set a vector to calculate a row based on the number of nonzero elements of the stiffness matrix. CSR‐vector row has efficient reduce operations, deep memory access optimization, better memory access, and calculation overlapping kernel function configuration scheme. The access bandwidth of the algorithm on AMD GPU is more than 700 GB/s. Compared with CSR‐scalar algorithm, the parallel efficiency of CSR‐vector row is improved by 7.2 times. And floating‐point computing performance is 41%–95% higher than that of light algorithm and HOLA algorithm. In addition, CSR‐vector row is used to calculate the examples from CFD, electromagnetics, quantum chemistry, power network, and semiconductor process, the memory access bandwidth and double floating‐point performance are also improved compared with rocSPARSE‐CSR‐vector.","PeriodicalId":10584,"journal":{"name":"Concurrency and Computation: Practice and Experience","volume":"130 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation: Practice and Experience","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/cpe.7186","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

The performance of sparse stiffness matrix‐vector multiplication is essential for large‐scale structural mechanics numerical simulation. Compressed sparse row (CSR) is the most common format for storing sparse stiffness matrices. However, the high sparsity of the sparse stiffness matrix makes the number of nonzero elements per row very small. Therefore, the CSR‐scalar algorithm, light algorithm, and HOLA algorithm in the calculation will cause some threads in the GPU to be in idle state, which will not only affect the computing performance but also waste computing resources. In this article, a new algorithm, CSR‐vector row, is proposed for fine‐grained computing optimization based on the AMD GPU architecture on heterogeneous supercomputers. This algorithm can set a vector to calculate a row based on the number of nonzero elements of the stiffness matrix. CSR‐vector row has efficient reduce operations, deep memory access optimization, better memory access, and calculation overlapping kernel function configuration scheme. The access bandwidth of the algorithm on AMD GPU is more than 700 GB/s. Compared with CSR‐scalar algorithm, the parallel efficiency of CSR‐vector row is improved by 7.2 times. And floating‐point computing performance is 41%–95% higher than that of light algorithm and HOLA algorithm. In addition, CSR‐vector row is used to calculate the examples from CFD, electromagnetics, quantum chemistry, power network, and semiconductor process, the memory access bandwidth and double floating‐point performance are also improved compared with rocSPARSE‐CSR‐vector.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在AMD GPU上使用压缩稀疏行存储格式的高效稀疏刚度矩阵向量乘法
稀疏刚度矩阵向量乘法的性能是大规模结构力学数值模拟的必要条件。压缩稀疏行(CSR)是存储稀疏刚度矩阵最常用的格式。然而,稀疏刚度矩阵的高稀疏性使得每行非零元素的数量非常少。因此,在计算中使用CSR - scalar算法、light算法和HOLA算法会导致GPU中的一些线程处于空闲状态,不仅会影响计算性能,还会浪费计算资源。本文提出了一种新的基于AMD GPU架构的异构超级计算机上的细粒度计算优化算法——CSR向量行算法。该算法可以根据刚度矩阵的非零元素个数设置一个向量来计算一行。CSR‐vector行具有高效的约简操作、深度内存访问优化、更好的内存访问和计算重叠核函数配置方案。该算法在AMD GPU上的访问带宽大于700gb /s。与CSR -标量算法相比,CSR -向量行并行效率提高了7.2倍。浮点运算性能比光算法和HOLA算法提高41% ~ 95%。此外,CSR - vector行用于计算CFD、电磁学、量子化学、电力网络和半导体工艺的实例,与rocSPARSE - CSR - vector相比,其存储器访问带宽和双浮点性能也得到了提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Time‐based DDoS attack detection through hybrid LSTM‐CNN model architectures: An investigation of many‐to‐one and many‐to‐many approaches Distributed low‐latency broadcast scheduling for multi‐channel duty‐cycled wireless IoT networks Open‐domain event schema induction via weighted attentive hypergraph neural network Fused GEMMs towards an efficient GPU implementation of the ADER‐DG method in SeisSol Simulation method for infrared radiation transmission characteristics of typical ship targets based on optical remote sensing
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1