探索 GPU 上稀疏张量倍密集矩阵的数据布局

IF 1.8 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE ACM Transactions on Architecture and Code Optimization Pub Date : 2023-12-28 DOI:10.1145/3633462

Khalid Ahmad, Cris Cecka, Michael Garland, Mary Hall

{"title":"探索 GPU 上稀疏张量倍密集矩阵的数据布局","authors":"Khalid Ahmad, Cris Cecka, Michael Garland, Mary Hall","doi":"10.1145/3633462","DOIUrl":null,"url":null,"abstract":"<p>An important sparse tensor computation is sparse-tensor-dense-matrix multiplication (SpTM), which is used in tensor decomposition and applications. SpTM is a multi-dimensional analog to sparse-matrix-dense-matrix multiplication (SpMM). In this paper, we employ a hierarchical tensor data layout that can unfold a multidimensional tensor to derive a 2D matrix, making it possible to compute SpTM using SpMM kernel implementations for GPUs. We compare two SpMM implementations to the state-of-the-art PASTA sparse tensor contraction implementation using: (1) SpMM with hierarchical tensor data layout; and, (2) unfolding followed by an invocation of cuSPARSE’s SpMM. Results show that SpMM can outperform PASTA 70.9% of the time, but none of the three approaches is best overall. Therefore, we use a decision tree classifier to identify the best performing sparse tensor contraction kernel based on precomputed properties of the sparse tensor.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"33 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploring Data Layout for Sparse Tensor Times Dense Matrix on GPUs\",\"authors\":\"Khalid Ahmad, Cris Cecka, Michael Garland, Mary Hall\",\"doi\":\"10.1145/3633462\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>An important sparse tensor computation is sparse-tensor-dense-matrix multiplication (SpTM), which is used in tensor decomposition and applications. SpTM is a multi-dimensional analog to sparse-matrix-dense-matrix multiplication (SpMM). In this paper, we employ a hierarchical tensor data layout that can unfold a multidimensional tensor to derive a 2D matrix, making it possible to compute SpTM using SpMM kernel implementations for GPUs. We compare two SpMM implementations to the state-of-the-art PASTA sparse tensor contraction implementation using: (1) SpMM with hierarchical tensor data layout; and, (2) unfolding followed by an invocation of cuSPARSE’s SpMM. Results show that SpMM can outperform PASTA 70.9% of the time, but none of the three approaches is best overall. Therefore, we use a decision tree classifier to identify the best performing sparse tensor contraction kernel based on precomputed properties of the sparse tensor.</p>\",\"PeriodicalId\":50920,\"journal\":{\"name\":\"ACM Transactions on Architecture and Code Optimization\",\"volume\":\"33 1\",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2023-12-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Architecture and Code Optimization\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3633462\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Architecture and Code Optimization","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3633462","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

一种重要的稀疏张量计算是稀疏-张量-密集-矩阵乘法（SpTM），它用于张量分解和应用。SpTM 是稀疏矩阵-密集矩阵乘法（SpMM）的多维类比。在本文中，我们采用了分层张量数据布局，这种布局可以展开多维张量，推导出二维矩阵，从而可以使用 GPU 的 SpMM 内核实现来计算 SpTM。我们将两种 SpMM 实现与最先进的 PASTA 稀疏张量收缩实现进行了比较：(1) 采用分层张量数据布局的 SpMM；(2) 展开后调用 cuSPARSE 的 SpMM。结果表明，SpMM 在 70.9% 的情况下优于 PASTA，但这三种方法总体上都不是最好的。因此，我们使用决策树分类器，根据预先计算的稀疏张量属性，确定性能最佳的稀疏张量收缩核。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Exploring Data Layout for Sparse Tensor Times Dense Matrix on GPUs

An important sparse tensor computation is sparse-tensor-dense-matrix multiplication (SpTM), which is used in tensor decomposition and applications. SpTM is a multi-dimensional analog to sparse-matrix-dense-matrix multiplication (SpMM). In this paper, we employ a hierarchical tensor data layout that can unfold a multidimensional tensor to derive a 2D matrix, making it possible to compute SpTM using SpMM kernel implementations for GPUs. We compare two SpMM implementations to the state-of-the-art PASTA sparse tensor contraction implementation using: (1) SpMM with hierarchical tensor data layout; and, (2) unfolding followed by an invocation of cuSPARSE’s SpMM. Results show that SpMM can outperform PASTA 70.9% of the time, but none of the three approaches is best overall. Therefore, we use a decision tree classifier to identify the best performing sparse tensor contraction kernel based on precomputed properties of the sparse tensor.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Architecture and Code Optimization 工程技术-计算机：理论方法

CiteScore

3.60

自引率

6.20%

发文量

审稿时长

6-12 weeks

期刊介绍： ACM Transactions on Architecture and Code Optimization (TACO) focuses on hardware, software, and system research spanning the fields of computer architecture and code optimization. Articles that appear in TACO will either present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to architects, hardware or software developers, designers, builders, and users will be emphasized.

期刊最新文献

Hermes: Efficient Serving of LLM Applications with Probabilistic Demand Modeling RACER: Avoiding End-to-End Slowdowns in Accelerated Chip Multi-Processors DLAS: A Conceptual Model for Across-Stack Deep Learning Acceleration A Survey of General-purpose Polyhedral Compilers Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture