Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing Pub Date : 2021-06-03 DOI:10.1145/3447818.3461472

Oguz Selvitopi, Benjamin Brock, Israt Nisa, Alok Tripathy, K. Yelick, A. Buluç

{"title":"Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication","authors":"Oguz Selvitopi, Benjamin Brock, Israt Nisa, Alok Tripathy, K. Yelick, A. Buluç","doi":"10.1145/3447818.3461472","DOIUrl":null,"url":null,"abstract":"Sparse times dense matrix multiplication (SpMM) finds its applications in well-established fields such as computational linear algebra as well as emerging fields such as graph neural networks. In this study, we evaluate the performance of various techniques for performing SpMM as a distributed computation across many nodes by focusing on GPU accelerators. We examine how the actual local computational performance of state-of-the-art SpMM implementations affect computational efficiency as dimensions change when we scale to large numbers of nodes, which proves to be an unexpectedly important bottleneck. We also consider various distribution strategies, including A-Stationary, B-Stationary, and C-Stationary algorithms, 1.5D and 2D algorithms, and RDMA-based and bulk synchronous methods of data transfer. Our results show that the best choice of algorithm and implementation technique depends not only on the cost of communication for particular matrix sizes and dimensions, but also on the performance of local SpMM operations. Our evaluations reveal that with the involvement of GPU accelerators, the best design choices for SpMM differ from the conventional algorithms that are known to perform well for dense matrix-matrix or sparse matrix-sparse matrix multiplies.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"13 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447818.3461472","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

Sparse times dense matrix multiplication (SpMM) finds its applications in well-established fields such as computational linear algebra as well as emerging fields such as graph neural networks. In this study, we evaluate the performance of various techniques for performing SpMM as a distributed computation across many nodes by focusing on GPU accelerators. We examine how the actual local computational performance of state-of-the-art SpMM implementations affect computational efficiency as dimensions change when we scale to large numbers of nodes, which proves to be an unexpectedly important bottleneck. We also consider various distribution strategies, including A-Stationary, B-Stationary, and C-Stationary algorithms, 1.5D and 2D algorithms, and RDMA-based and bulk synchronous methods of data transfer. Our results show that the best choice of algorithm and implementation technique depends not only on the cost of communication for particular matrix sizes and dimensions, but also on the performance of local SpMM operations. Our evaluations reveal that with the involvement of GPU accelerators, the best design choices for SpMM differ from the conventional algorithms that are known to perform well for dense matrix-matrix or sparse matrix-sparse matrix multiplies.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

稀疏倍高瘦密矩阵乘法的分布式内存并行算法

稀疏次密矩阵乘法(SpMM)在计算线性代数等成熟领域以及图神经网络等新兴领域都有应用。在本研究中，我们通过关注GPU加速器，评估了将SpMM作为跨多个节点的分布式计算执行的各种技术的性能。我们将研究最先进的SpMM实现的实际本地计算性能如何影响计算效率，因为当我们扩展到大量节点时，维度发生变化，这被证明是一个意想不到的重要瓶颈。我们还考虑了各种分布策略，包括A-Stationary, B-Stationary和C-Stationary算法，1.5D和2D算法，以及基于rdma和批量同步的数据传输方法。我们的研究结果表明，算法和实现技术的最佳选择不仅取决于特定矩阵大小和维度的通信成本，还取决于局部SpMM操作的性能。我们的评估表明，在GPU加速器的参与下，SpMM的最佳设计选择不同于已知在密集矩阵-矩阵或稀疏矩阵-稀疏矩阵乘法中表现良好的传统算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

自引率

0.00%

发文量

期刊最新文献

Accelerating BWA-MEM Read Mapping on GPUs. Dynamic Memory Management in Massively Parallel Systems: A Case on GPUs. Priority Algorithms with Advice for Disjoint Path Allocation Problems From Data of Internet of Things to Domain Knowledge: A Case Study of Exploration in Smart Agriculture On Two Variants of Induced Matchings