SparseP

Proceedings of the ACM on Measurement and Analysis of Computing Systems Pub Date : 2022-02-24 DOI:10.1145/3508041

Christina Giannoula, Ivan Fernandez, Juan Gómez-Luna, N. Koziris, G. Goumas, O. Mutlu

{"title":"SparseP","authors":"Christina Giannoula, Ivan Fernandez, Juan Gómez-Luna, N. Koziris, G. Goumas, O. Mutlu","doi":"10.1145/3508041","DOIUrl":null,"url":null,"abstract":"Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures, after decades of research efforts. Near-bank PIM architectures place simple cores close to DRAM banks. Recent research demonstrates that they can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the Sparse Matrix Vector Multiplication (SpMV) kernel. SpMV has been characterized as one of the most significant and thoroughly studied scientific computation kernels. It is primarily a memory-bound kernel with intensive memory accesses due its algorithmic nature, the compressed matrix format used, and the sparsity patterns of the input matrices given. This paper provides the first comprehensive analysis of SpMV on a real-world PIM architecture, and presents SparseP, the first SpMV library for real PIM architectures. We make three key contributions. First, we implement a wide variety of software strategies on SpMV for a multithreaded PIM core, including (1) various compressed matrix formats, (2) load balancing schemes across parallel threads and (3) synchronization approaches, and characterize the computational limits of a single multithreaded PIM core. Second, we design various load balancing schemes across multiple PIM cores, and two types of data partitioning techniques to execute SpMV on thousands of PIM cores: (1) 1D-partitioned kernels to perform the complete SpMV computation only using PIM cores, and (2) 2D-partitioned kernels to strive a balance between computation and data transfer costs to PIM-enabled memory. Third, we compare SpMV execution on a real-world PIM system with 2528 PIM cores to an Intel Xeon CPU and an NVIDIA Tesla V100 GPU to study the performance and energy efficiency of various devices, i.e., both memory-centric PIM systems and conventional processor-centric CPU/GPU systems, for the SpMV kernel. SparseP software package provides 25 SpMV kernels for real PIM systems supporting the four most widely used compressed matrix formats, i.e., CSR, COO, BCSR and BCOO, and a wide range of data types. SparseP is publicly and freely available at https://github.com/CMU-SAFARI/SparseP. Our extensive evaluation using 26 matrices with various sparsity patterns provides new insights and recommendations for software designers and hardware architects to efficiently accelerate the SpMV kernel on real PIM systems.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3508041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures, after decades of research efforts. Near-bank PIM architectures place simple cores close to DRAM banks. Recent research demonstrates that they can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the Sparse Matrix Vector Multiplication (SpMV) kernel. SpMV has been characterized as one of the most significant and thoroughly studied scientific computation kernels. It is primarily a memory-bound kernel with intensive memory accesses due its algorithmic nature, the compressed matrix format used, and the sparsity patterns of the input matrices given. This paper provides the first comprehensive analysis of SpMV on a real-world PIM architecture, and presents SparseP, the first SpMV library for real PIM architectures. We make three key contributions. First, we implement a wide variety of software strategies on SpMV for a multithreaded PIM core, including (1) various compressed matrix formats, (2) load balancing schemes across parallel threads and (3) synchronization approaches, and characterize the computational limits of a single multithreaded PIM core. Second, we design various load balancing schemes across multiple PIM cores, and two types of data partitioning techniques to execute SpMV on thousands of PIM cores: (1) 1D-partitioned kernels to perform the complete SpMV computation only using PIM cores, and (2) 2D-partitioned kernels to strive a balance between computation and data transfer costs to PIM-enabled memory. Third, we compare SpMV execution on a real-world PIM system with 2528 PIM cores to an Intel Xeon CPU and an NVIDIA Tesla V100 GPU to study the performance and energy efficiency of various devices, i.e., both memory-centric PIM systems and conventional processor-centric CPU/GPU systems, for the SpMV kernel. SparseP software package provides 25 SpMV kernels for real PIM systems supporting the four most widely used compressed matrix formats, i.e., CSR, COO, BCSR and BCOO, and a wide range of data types. SparseP is publicly and freely available at https://github.com/CMU-SAFARI/SparseP. Our extensive evaluation using 26 matrices with various sparsity patterns provides new insights and recommendations for software designers and hardware architects to efficiently accelerate the SpMV kernel on real PIM systems.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SparseP

经过几十年的研究努力，一些制造商已经开始将近银行内存处理(PIM)架构商业化。近库PIM架构将简单内核放置在DRAM库附近。最近的研究表明，通过降低数据访问成本，它们可以显著提高并行应用程序的性能和能耗。真正的PIM系统可以提供高水平的并行性、大的聚合内存带宽和低的内存访问延迟，因此非常适合加速稀疏矩阵向量乘法(SpMV)内核。SpMV是目前研究最深入、最重要的科学计算核之一。它主要是一个内存受限的内核，由于其算法性质、使用的压缩矩阵格式和给定的输入矩阵的稀疏模式，它具有密集的内存访问。本文首次在实际的PIM体系结构上对SpMV进行了全面的分析，并提出了SparseP，这是第一个用于实际PIM体系结构的SpMV库。我们做出了三个关键贡献。首先，我们在多线程PIM核心的SpMV上实现了各种各样的软件策略，包括(1)各种压缩矩阵格式，(2)跨并行线程的负载平衡方案和(3)同步方法，并表征了单个多线程PIM核心的计算限制。其次，我们设计了跨多个PIM内核的各种负载平衡方案，以及在数千个PIM内核上执行SpMV的两种类型的数据分区技术:(1)仅使用PIM内核执行完整的SpMV计算的2d分区内核，以及(2)2d分区内核努力平衡计算和数据传输到支持PIM的内存之间的成本。第三，我们将SpMV在具有2528个PIM内核的真实PIM系统上的执行情况与Intel Xeon CPU和NVIDIA Tesla V100 GPU进行比较，以研究各种设备的性能和能效，即以内存为中心的PIM系统和传统的以处理器为中心的CPU/GPU系统，用于SpMV内核。SparseP软件包为实际PIM系统提供了25个SpMV内核，支持四种最广泛使用的压缩矩阵格式，即CSR, COO, BCSR和BCOO，以及广泛的数据类型。SparseP可以在https://github.com/CMU-SAFARI/SparseP上免费公开获取。我们使用26个具有各种稀疏性模式的矩阵进行了广泛的评估，为软件设计人员和硬件架构师提供了新的见解和建议，以便在实际的PIM系统上有效地加速SpMV内核。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the ACM on Measurement and Analysis of Computing Systems

CiteScore

3.20

自引率

0.00%

发文量