OuterSPACE:一个基于外积的稀疏矩阵乘法加速器

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2018-03-27 DOI:10.1109/HPCA.2018.00067

S. Pal, Jonathan Beaumont, Dong-hyeon Park, Aporva Amarnath, Siying Feng, C. Chakrabarti, Hun-Seok Kim, D. Blaauw, T. Mudge, R. Dreslinski

{"title":"OuterSPACE:一个基于外积的稀疏矩阵乘法加速器","authors":"S. Pal, Jonathan Beaumont, Dong-hyeon Park, Aporva Amarnath, Siying Feng, C. Chakrabarti, Hun-Seok Kim, D. Blaauw, T. Mudge, R. Dreslinski","doi":"10.1109/HPCA.2018.00067","DOIUrl":null,"url":null,"abstract":"Sparse matrices are widely used in graph and data analytics, machine learning, engineering and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator targeted at applications that involve large sparse matrices. OuterSPACE is a highly-scalable, energy-efficient, reconfigurable design, consisting of massively parallel Single Program, Multiple Data (SPMD)-style processing units, distributed memories, high-speed crossbars and High Bandwidth Memory (HBM). We identify redundant memory accesses to non-zeros as a key bottleneck in traditional sparse matrix-matrix multiplication algorithms. To ameliorate this, we implement an outer product based matrix multiplication technique that eliminates redundant accesses by decoupling multiplication from accumulation. We demonstrate that traditional architectures, due to limitations in their memory hierarchies and ability to harness parallelism in the algorithm, are unable to take advantage of this reduction without incurring significant overheads. OuterSPACE is designed to specifically overcome these challenges. We simulate the key components of our architecture using gem5 on a diverse set of matrices from the University of Florida's SuiteSparse collection and the Stanford Network Analysis Project and show a mean speedup of 7.9× over Intel Math Kernel Library on a Xeon CPU, 13.0× against cuSPARSE and 14.0× against CUSP when run on an NVIDIA K40 GPU, while achieving an average throughput of 2.9 GFLOPS within a 24 W power budget in an area of 87 mm2.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"179","resultStr":"{\"title\":\"OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator\",\"authors\":\"S. Pal, Jonathan Beaumont, Dong-hyeon Park, Aporva Amarnath, Siying Feng, C. Chakrabarti, Hun-Seok Kim, D. Blaauw, T. Mudge, R. Dreslinski\",\"doi\":\"10.1109/HPCA.2018.00067\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sparse matrices are widely used in graph and data analytics, machine learning, engineering and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator targeted at applications that involve large sparse matrices. OuterSPACE is a highly-scalable, energy-efficient, reconfigurable design, consisting of massively parallel Single Program, Multiple Data (SPMD)-style processing units, distributed memories, high-speed crossbars and High Bandwidth Memory (HBM). We identify redundant memory accesses to non-zeros as a key bottleneck in traditional sparse matrix-matrix multiplication algorithms. To ameliorate this, we implement an outer product based matrix multiplication technique that eliminates redundant accesses by decoupling multiplication from accumulation. We demonstrate that traditional architectures, due to limitations in their memory hierarchies and ability to harness parallelism in the algorithm, are unable to take advantage of this reduction without incurring significant overheads. OuterSPACE is designed to specifically overcome these challenges. We simulate the key components of our architecture using gem5 on a diverse set of matrices from the University of Florida's SuiteSparse collection and the Stanford Network Analysis Project and show a mean speedup of 7.9× over Intel Math Kernel Library on a Xeon CPU, 13.0× against cuSPARSE and 14.0× against CUSP when run on an NVIDIA K40 GPU, while achieving an average throughput of 2.9 GFLOPS within a 24 W power budget in an area of 87 mm2.\",\"PeriodicalId\":154694,\"journal\":{\"name\":\"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)\",\"volume\":\"46 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-03-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"179\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCA.2018.00067\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2018.00067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 179

摘要

稀疏矩阵广泛应用于图和数据分析、机器学习、工程和科学应用。本文描述并分析了针对大型稀疏矩阵应用的加速器OuterSPACE。OuterSPACE是一种高度可扩展、节能、可重构的设计，由大规模并行单程序、多数据(SPMD)式处理单元、分布式存储器、高速交叉条和高带宽存储器(HBM)组成。我们发现非零的冗余内存访问是传统稀疏矩阵-矩阵乘法算法的关键瓶颈。为了改善这一点，我们实现了一种基于外部乘积的矩阵乘法技术，该技术通过将乘法与积累解耦来消除冗余访问。我们证明，传统架构由于其内存层次结构和在算法中利用并行性的能力的限制，无法在不产生显著开销的情况下利用这种减少。“外层空间”就是专门为克服这些挑战而设计的。我们在来自佛罗里达大学的SuiteSparse集合和斯坦福网络分析项目的各种矩阵上使用gem5模拟我们架构的关键组件，并显示在Xeon CPU上比Intel Math Kernel Library平均加速7.9倍，在NVIDIA K40 GPU上运行时比cuSPARSE平均加速13.0倍，比CUSP平均加速14.0倍，同时在24w功率预算下实现2.9 GFLOPS的平均吞吐量在87 mm2的面积上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator

Sparse matrices are widely used in graph and data analytics, machine learning, engineering and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator targeted at applications that involve large sparse matrices. OuterSPACE is a highly-scalable, energy-efficient, reconfigurable design, consisting of massively parallel Single Program, Multiple Data (SPMD)-style processing units, distributed memories, high-speed crossbars and High Bandwidth Memory (HBM). We identify redundant memory accesses to non-zeros as a key bottleneck in traditional sparse matrix-matrix multiplication algorithms. To ameliorate this, we implement an outer product based matrix multiplication technique that eliminates redundant accesses by decoupling multiplication from accumulation. We demonstrate that traditional architectures, due to limitations in their memory hierarchies and ability to harness parallelism in the algorithm, are unable to take advantage of this reduction without incurring significant overheads. OuterSPACE is designed to specifically overcome these challenges. We simulate the key components of our architecture using gem5 on a diverse set of matrices from the University of Florida's SuiteSparse collection and the Stanford Network Analysis Project and show a mean speedup of 7.9× over Intel Math Kernel Library on a Xeon CPU, 13.0× against cuSPARSE and 14.0× against CUSP when run on an NVIDIA K40 GPU, while achieving an average throughput of 2.9 GFLOPS within a 24 W power budget in an area of 87 mm2.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量

期刊最新文献

Record-Replay Architecture as a General Security Framework LATTE-CC: Latency Tolerance Aware Adaptive Cache Compression Management for Energy Efficient GPUs Secure DIMM: Moving ORAM Primitives Closer to Memory OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator WIR: Warp Instruction Reuse to Minimize Repeated Computations in GPUs