LightSpMV:在支持cuda的gpu上更快的基于csr的稀疏矩阵向量乘法

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP) Pub Date : 2015-07-27 DOI:10.1109/ASAP.2015.7245713

Yongchao Liu, B. Schmidt

{"title":"LightSpMV:在支持cuda的gpu上更快的基于csr的稀疏矩阵向量乘法","authors":"Yongchao Liu, B. Schmidt","doi":"10.1109/ASAP.2015.7245713","DOIUrl":null,"url":null,"abstract":"Compressed sparse row (CSR) is a frequently used format for sparse matrix storage. However, the state-of-the-art CSR-based sparse matrix-vector multiplication (SpMV) implementations on CUDA-enabled GPUs do not exhibit very high efficiency. This has motivated the development of some alternative storage formats for GPU computing. Unfortunately, these alternatives are incompatible with most CPU-centric programs and require dynamic conversion from CSR at runtime, thus incurring significant computational and storage overheads. We present LightSpMV, a novel CUDA-compatible SpMV algorithm using the standard CSR format, which achieves high speed by benefiting from the fine-grained dynamic distribution of matrix rows over warps/vectors. In LightSpMV, two dynamic row distribution approaches have been investigated at the vector and warp levels with atomic operations and warp shuffle functions as the fundamental building blocks. We have evaluated LightSpMV using various sparse matrices and further compared it to the CSR-based SpMV subprograms in the state-of-the-art CUSP and cuSPARSE libraries. Performance evaluation reveals that on the same Tesla K40c GPU, LightSpMV is superior to both CUSP and cuSPARSE, with a speedup of up to 2.60 and 2.63 over CUSP, and up to 1.93 and 1.79 over cuSPARSE for single and double precision, respectively. LightSpMV is available at http://lightspmv.sourceforge.net.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"12 1","pages":"82-89"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"41","resultStr":"{\"title\":\"LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs\",\"authors\":\"Yongchao Liu, B. Schmidt\",\"doi\":\"10.1109/ASAP.2015.7245713\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Compressed sparse row (CSR) is a frequently used format for sparse matrix storage. However, the state-of-the-art CSR-based sparse matrix-vector multiplication (SpMV) implementations on CUDA-enabled GPUs do not exhibit very high efficiency. This has motivated the development of some alternative storage formats for GPU computing. Unfortunately, these alternatives are incompatible with most CPU-centric programs and require dynamic conversion from CSR at runtime, thus incurring significant computational and storage overheads. We present LightSpMV, a novel CUDA-compatible SpMV algorithm using the standard CSR format, which achieves high speed by benefiting from the fine-grained dynamic distribution of matrix rows over warps/vectors. In LightSpMV, two dynamic row distribution approaches have been investigated at the vector and warp levels with atomic operations and warp shuffle functions as the fundamental building blocks. We have evaluated LightSpMV using various sparse matrices and further compared it to the CSR-based SpMV subprograms in the state-of-the-art CUSP and cuSPARSE libraries. Performance evaluation reveals that on the same Tesla K40c GPU, LightSpMV is superior to both CUSP and cuSPARSE, with a speedup of up to 2.60 and 2.63 over CUSP, and up to 1.93 and 1.79 over cuSPARSE for single and double precision, respectively. LightSpMV is available at http://lightspmv.sourceforge.net.\",\"PeriodicalId\":6642,\"journal\":{\"name\":\"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)\",\"volume\":\"12 1\",\"pages\":\"82-89\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-07-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"41\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASAP.2015.7245713\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASAP.2015.7245713","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 41

摘要

压缩稀疏行(CSR)是一种常用的稀疏矩阵存储格式。然而，在支持cuda的gpu上，最先进的基于csr的稀疏矩阵向量乘法(SpMV)实现并没有表现出非常高的效率。这推动了GPU计算的一些替代存储格式的发展。不幸的是，这些替代方案与大多数以cpu为中心的程序不兼容，并且需要在运行时从CSR进行动态转换，因此会产生大量的计算和存储开销。我们提出了LightSpMV，一种使用标准CSR格式的新型cuda兼容SpMV算法，该算法通过利用扭曲/矢量上矩阵行的细粒度动态分布来实现高速。在LightSpMV中，以原子操作和warp shuffle函数作为基本构建块，在矢量和warp级别研究了两种动态行分布方法。我们使用各种稀疏矩阵对LightSpMV进行了评估，并进一步将其与最先进的CUSP和cuSPARSE库中基于csr的SpMV子程序进行了比较。性能评估显示，在相同的Tesla K40c GPU上，LightSpMV优于CUSP和cuSPARSE，在单精度和双精度上，LightSpMV的加速分别比CUSP高2.60和2.63，比cuSPARSE高1.93和1.79。LightSpMV可在http://lightspmv.sourceforge.net上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs

Compressed sparse row (CSR) is a frequently used format for sparse matrix storage. However, the state-of-the-art CSR-based sparse matrix-vector multiplication (SpMV) implementations on CUDA-enabled GPUs do not exhibit very high efficiency. This has motivated the development of some alternative storage formats for GPU computing. Unfortunately, these alternatives are incompatible with most CPU-centric programs and require dynamic conversion from CSR at runtime, thus incurring significant computational and storage overheads. We present LightSpMV, a novel CUDA-compatible SpMV algorithm using the standard CSR format, which achieves high speed by benefiting from the fine-grained dynamic distribution of matrix rows over warps/vectors. In LightSpMV, two dynamic row distribution approaches have been investigated at the vector and warp levels with atomic operations and warp shuffle functions as the fundamental building blocks. We have evaluated LightSpMV using various sparse matrices and further compared it to the CSR-based SpMV subprograms in the state-of-the-art CUSP and cuSPARSE libraries. Performance evaluation reveals that on the same Tesla K40c GPU, LightSpMV is superior to both CUSP and cuSPARSE, with a speedup of up to 2.60 and 2.63 over CUSP, and up to 1.93 and 1.79 over cuSPARSE for single and double precision, respectively. LightSpMV is available at http://lightspmv.sourceforge.net.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

自引率

0.00%

发文量

期刊最新文献

Message from the Conference Chairs - ASAP 2020 Message from the ASAP 2016 chairs An IEEE 754 double-precision floating-point multiplier for denormalized and normalized floating-point numbers Application-set driven exploration for custom processor architectures Stochastic circuit design and performance evaluation of vector quantization