在配备hbm的fpga上加速图线性代数

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD) Pub Date : 2021-11-01 DOI:10.1109/ICCAD51958.2021.9643582

Yuwei Hu, Yixiao Du, Ecenur Ustun, Zhiru Zhang

{"title":"在配备hbm的fpga上加速图线性代数","authors":"Yuwei Hu, Yixiao Du, Ecenur Ustun, Zhiru Zhang","doi":"10.1109/ICCAD51958.2021.9643582","DOIUrl":null,"url":null,"abstract":"Graph processing is typically memory bound due to low compute to memory access ratio and irregular data access pattern. The emerging high-bandwidth memory (HBM) delivers exceptional bandwidth by providing multiple channels that can service memory requests concurrently, thus bringing the potential to significantly boost the performance of graph processing. This paper proposes GraphLily, a graph linear algebra overlay, to accelerate graph processing on HBM-equipped FPGAs. GraphLily supports a rich set of graph algorithms by adopting the GraphBLAS programming interface, which formulates graph algorithms as sparse linear algebra operations. GraphLily provides efficient, memory-optimized accelerators for the two widely-used kernels in GraphBLAS, namely, sparse-matrix dense-vector multiplication (SpMV) and sparse-matrix sparse-vector multiplication (SpMSpV). The SpMV accelerator uses a sparse matrix storage format tailored to HBM that enables streaming, vectorized accesses to each channel and concurrent accesses to multiple channels. Besides, the SpMV accelerator exploits data reuse in accesses of the dense vector by introducing a scalable on-chip buffer design. The SpMSpV accelerator complements the SpMV accelerator to handle cases where the input vector has a high sparsity. GraphLily further builds a middleware to provide runtime support. With this middleware, we can port existing GraphBLAS programs to FPGAs with slight modifications to the original code intended for CPU/GPU execution. Evaluation shows that compared with state-of-the-art graph processing frameworks on CPUs and GPUs, GraphLily achieves up to 2.5 x and 1.1 x higher throughput, while reducing the energy consumption by 8.1 x and 2.4 x; compared with prior single-purpose graph accelerators on FPGAs, GraphLily achieves 1.2 x -1.9 x higher throughput.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":"{\"title\":\"GraphLily: Accelerating Graph Linear Algebra on HBM-Equipped FPGAs\",\"authors\":\"Yuwei Hu, Yixiao Du, Ecenur Ustun, Zhiru Zhang\",\"doi\":\"10.1109/ICCAD51958.2021.9643582\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graph processing is typically memory bound due to low compute to memory access ratio and irregular data access pattern. The emerging high-bandwidth memory (HBM) delivers exceptional bandwidth by providing multiple channels that can service memory requests concurrently, thus bringing the potential to significantly boost the performance of graph processing. This paper proposes GraphLily, a graph linear algebra overlay, to accelerate graph processing on HBM-equipped FPGAs. GraphLily supports a rich set of graph algorithms by adopting the GraphBLAS programming interface, which formulates graph algorithms as sparse linear algebra operations. GraphLily provides efficient, memory-optimized accelerators for the two widely-used kernels in GraphBLAS, namely, sparse-matrix dense-vector multiplication (SpMV) and sparse-matrix sparse-vector multiplication (SpMSpV). The SpMV accelerator uses a sparse matrix storage format tailored to HBM that enables streaming, vectorized accesses to each channel and concurrent accesses to multiple channels. Besides, the SpMV accelerator exploits data reuse in accesses of the dense vector by introducing a scalable on-chip buffer design. The SpMSpV accelerator complements the SpMV accelerator to handle cases where the input vector has a high sparsity. GraphLily further builds a middleware to provide runtime support. With this middleware, we can port existing GraphBLAS programs to FPGAs with slight modifications to the original code intended for CPU/GPU execution. Evaluation shows that compared with state-of-the-art graph processing frameworks on CPUs and GPUs, GraphLily achieves up to 2.5 x and 1.1 x higher throughput, while reducing the energy consumption by 8.1 x and 2.4 x; compared with prior single-purpose graph accelerators on FPGAs, GraphLily achieves 1.2 x -1.9 x higher throughput.\",\"PeriodicalId\":370791,\"journal\":{\"name\":\"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"36\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCAD51958.2021.9643582\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCAD51958.2021.9643582","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 36

摘要

由于计算对内存的访问比低和数据访问模式不规则，图形处理通常受内存限制。新兴的高带宽内存(HBM)通过提供可以并发处理内存请求的多个通道来提供卓越的带宽，从而有可能显著提高图形处理的性能。本文提出了一种图形线性代数叠加GraphLily来加速fpga上的图形处理。GraphLily通过采用GraphBLAS编程接口支持丰富的图算法集，该接口将图算法表述为稀疏线性代数运算。GraphLily为GraphBLAS中两种广泛使用的内核(即稀疏矩阵密集向量乘法(SpMV)和稀疏矩阵稀疏向量乘法(SpMSpV))提供了高效的内存优化加速器。SpMV加速器使用为HBM量身定制的稀疏矩阵存储格式，支持对每个通道的流式、矢量化访问和对多个通道的并发访问。此外，SpMV加速器通过引入可扩展的片上缓冲器设计，利用密集向量访问中的数据重用。SpMSpV加速器是SpMV加速器的补充，用于处理输入向量具有高稀疏性的情况。GraphLily进一步构建一个中间件来提供运行时支持。有了这个中间件，我们可以将现有的GraphBLAS程序移植到fpga上，只需对用于CPU/GPU执行的原始代码稍加修改。评估表明，与目前最先进的cpu和gpu图形处理框架相比，GraphLily实现了高达2.5倍和1.1倍的吞吐量，同时降低了8.1倍和2.4倍的能耗;与先前fpga上的单一用途图形加速器相比，GraphLily实现了1.2 x -1.9倍的高吞吐量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

GraphLily: Accelerating Graph Linear Algebra on HBM-Equipped FPGAs

Graph processing is typically memory bound due to low compute to memory access ratio and irregular data access pattern. The emerging high-bandwidth memory (HBM) delivers exceptional bandwidth by providing multiple channels that can service memory requests concurrently, thus bringing the potential to significantly boost the performance of graph processing. This paper proposes GraphLily, a graph linear algebra overlay, to accelerate graph processing on HBM-equipped FPGAs. GraphLily supports a rich set of graph algorithms by adopting the GraphBLAS programming interface, which formulates graph algorithms as sparse linear algebra operations. GraphLily provides efficient, memory-optimized accelerators for the two widely-used kernels in GraphBLAS, namely, sparse-matrix dense-vector multiplication (SpMV) and sparse-matrix sparse-vector multiplication (SpMSpV). The SpMV accelerator uses a sparse matrix storage format tailored to HBM that enables streaming, vectorized accesses to each channel and concurrent accesses to multiple channels. Besides, the SpMV accelerator exploits data reuse in accesses of the dense vector by introducing a scalable on-chip buffer design. The SpMSpV accelerator complements the SpMV accelerator to handle cases where the input vector has a high sparsity. GraphLily further builds a middleware to provide runtime support. With this middleware, we can port existing GraphBLAS programs to FPGAs with slight modifications to the original code intended for CPU/GPU execution. Evaluation shows that compared with state-of-the-art graph processing frameworks on CPUs and GPUs, GraphLily achieves up to 2.5 x and 1.1 x higher throughput, while reducing the energy consumption by 8.1 x and 2.4 x; compared with prior single-purpose graph accelerators on FPGAs, GraphLily achieves 1.2 x -1.9 x higher throughput.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

自引率

0.00%

发文量