ScalaBFS2:基于 HBM 增强型 FPGA 芯片的高性能 BFS 加速器

IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-02-29 DOI:10.1145/3650037
Kexin Li, Shaoxian Xu, Zhiyuan Shao, Ran Zheng, Xiaofei Liao, Hai Jin
{"title":"ScalaBFS2:基于 HBM 增强型 FPGA 芯片的高性能 BFS 加速器","authors":"Kexin Li, Shaoxian Xu, Zhiyuan Shao, Ran Zheng, Xiaofei Liao, Hai Jin","doi":"10.1145/3650037","DOIUrl":null,"url":null,"abstract":"<p>The introduction of High Bandwidth Memory (HBM) to the FPGA chip makes it possible for an FPGA-based accelerator to leverage the huge memory bandwidth of HBM to improve its performance when implementing a specific algorithm, which is especially true for the Breadth-First Search (BFS) algorithm that demands a high bandwidth on accessing the graph data stored in memory. Different from traditional FPGA-DRAM platforms where memory bandwidth is the precious resource due to the limited DRAM channels, FPGA chips equipped with HBM have much higher memory bandwidths provided by the large quantities of HBM channels, but still limited amount of logic (LUT, FF, and BRAM/URAM) resources. Therefore, the key to design a high performance BFS accelerator on an HBM-enhanced FPGA chip is to efficiently use the logic resources to build as many as possible Processing Elements (PEs), and configure them flexibly to obtain as high as possible <i>effective memory bandwidth</i> that is useful to the algorithm from the HBM, rather than partially emphasizing the absolute memory bandwidth. To exploit as high as possible effective bandwidth from the HBM, ScalaBFS2 conducts BFS in graphs with the vertex-centric manner, and proposes designs, including the independent module (HBM Reader) for memory accessing, multi-layer crossbar, and PEs that implement hybrid mode (i.e., capable of working in both push and pull modes) algorithm processing, to utilize the FPGA logic resources efficiently. Consequently, ScalaBFS2 is able to build up to 128 PEs on the XCU280 FPGA chip (produced with the 16nm process and configured with two HBM2 stacks) of a Xilinx Alveo U280 board, and achieves the performance of 56.92 GTEPS (Giga Traversed Edges Per Second) by fully using its 32 HBM memory channels. Compared with the state-of-the-art graph processing system (i.e., ReGraph) built on top of the same board, ScalaBFS2 achieves 2.52x ∼ 4.40x performance speedups. Moreover, when compared with Gunrock running on an Nvidia A100 GPU that is produced with the 7nm process and configured with five HBM2e stacks, ScalaBFS2 achieves 1.34x ∼ 2.40x speedups on absolute performance, and 7.35x ∼ 13.18x speedups on power efficiency.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":3.1000,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ScalaBFS2: A High Performance BFS Accelerator on an HBM-enhanced FPGA Chip\",\"authors\":\"Kexin Li, Shaoxian Xu, Zhiyuan Shao, Ran Zheng, Xiaofei Liao, Hai Jin\",\"doi\":\"10.1145/3650037\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The introduction of High Bandwidth Memory (HBM) to the FPGA chip makes it possible for an FPGA-based accelerator to leverage the huge memory bandwidth of HBM to improve its performance when implementing a specific algorithm, which is especially true for the Breadth-First Search (BFS) algorithm that demands a high bandwidth on accessing the graph data stored in memory. Different from traditional FPGA-DRAM platforms where memory bandwidth is the precious resource due to the limited DRAM channels, FPGA chips equipped with HBM have much higher memory bandwidths provided by the large quantities of HBM channels, but still limited amount of logic (LUT, FF, and BRAM/URAM) resources. Therefore, the key to design a high performance BFS accelerator on an HBM-enhanced FPGA chip is to efficiently use the logic resources to build as many as possible Processing Elements (PEs), and configure them flexibly to obtain as high as possible <i>effective memory bandwidth</i> that is useful to the algorithm from the HBM, rather than partially emphasizing the absolute memory bandwidth. To exploit as high as possible effective bandwidth from the HBM, ScalaBFS2 conducts BFS in graphs with the vertex-centric manner, and proposes designs, including the independent module (HBM Reader) for memory accessing, multi-layer crossbar, and PEs that implement hybrid mode (i.e., capable of working in both push and pull modes) algorithm processing, to utilize the FPGA logic resources efficiently. Consequently, ScalaBFS2 is able to build up to 128 PEs on the XCU280 FPGA chip (produced with the 16nm process and configured with two HBM2 stacks) of a Xilinx Alveo U280 board, and achieves the performance of 56.92 GTEPS (Giga Traversed Edges Per Second) by fully using its 32 HBM memory channels. Compared with the state-of-the-art graph processing system (i.e., ReGraph) built on top of the same board, ScalaBFS2 achieves 2.52x ∼ 4.40x performance speedups. Moreover, when compared with Gunrock running on an Nvidia A100 GPU that is produced with the 7nm process and configured with five HBM2e stacks, ScalaBFS2 achieves 1.34x ∼ 2.40x speedups on absolute performance, and 7.35x ∼ 13.18x speedups on power efficiency.</p>\",\"PeriodicalId\":49248,\"journal\":{\"name\":\"ACM Transactions on Reconfigurable Technology and Systems\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2024-02-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Reconfigurable Technology and Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3650037\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Reconfigurable Technology and Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3650037","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

摘要

在 FPGA 芯片中引入高带宽内存 (HBM),使得基于 FPGA 的加速器在执行特定算法时可以利用 HBM 的巨大内存带宽来提高性能,这对于访问存储在内存中的图形数据时需要高带宽的广度优先搜索 (BFS) 算法来说尤其如此。与传统的 FPGA-DRAM 平台不同,传统的 FPGA-DRAM 平台由于 DRAM 通道有限,因此内存带宽是宝贵的资源,而配备 HBM 的 FPGA 芯片由于拥有大量的 HBM 通道,因此内存带宽要高得多,但逻辑(LUT、FF 和 BRAM/URAM)资源仍然有限。因此,在 HBM 增强型 FPGA 芯片上设计高性能 BFS 加速器的关键是有效利用逻辑资源,构建尽可能多的处理单元 (PE),并灵活配置这些处理单元,以便从 HBM 中获得对算法有用的尽可能高的有效内存带宽,而不是片面强调绝对内存带宽。为了尽可能利用 HBM 的有效带宽,ScalaBFS2 以顶点为中心在图中进行 BFS,并提出了包括用于内存访问的独立模块(HBM 阅读器)、多层交叉条和实现混合模式(即能够在推模式和拉模式下工作)算法处理的 PE 等设计,以有效利用 FPGA 逻辑资源。因此,ScalaBFS2 能够在 Xilinx Alveo U280 板的 XCU280 FPGA 芯片(采用 16nm 工艺生产,配置了两个 HBM2 堆栈)上构建多达 128 个 PE,并通过充分利用其 32 个 HBM 内存通道实现了 56.92 GTEPS(每秒千兆遍历边)的性能。与基于同一板卡的最先进图处理系统(即 ReGraph)相比,ScalaBFS2 的性能提升了 2.52 倍~4.40 倍。此外,与运行在采用 7nm 工艺生产并配置了五个 HBM2e 堆栈的 Nvidia A100 GPU 上的 Gunrock 相比,ScalaBFS2 的绝对性能提高了 1.34 倍 ∼ 2.40 倍,能效提高了 7.35 倍 ∼ 13.18 倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
ScalaBFS2: A High Performance BFS Accelerator on an HBM-enhanced FPGA Chip

The introduction of High Bandwidth Memory (HBM) to the FPGA chip makes it possible for an FPGA-based accelerator to leverage the huge memory bandwidth of HBM to improve its performance when implementing a specific algorithm, which is especially true for the Breadth-First Search (BFS) algorithm that demands a high bandwidth on accessing the graph data stored in memory. Different from traditional FPGA-DRAM platforms where memory bandwidth is the precious resource due to the limited DRAM channels, FPGA chips equipped with HBM have much higher memory bandwidths provided by the large quantities of HBM channels, but still limited amount of logic (LUT, FF, and BRAM/URAM) resources. Therefore, the key to design a high performance BFS accelerator on an HBM-enhanced FPGA chip is to efficiently use the logic resources to build as many as possible Processing Elements (PEs), and configure them flexibly to obtain as high as possible effective memory bandwidth that is useful to the algorithm from the HBM, rather than partially emphasizing the absolute memory bandwidth. To exploit as high as possible effective bandwidth from the HBM, ScalaBFS2 conducts BFS in graphs with the vertex-centric manner, and proposes designs, including the independent module (HBM Reader) for memory accessing, multi-layer crossbar, and PEs that implement hybrid mode (i.e., capable of working in both push and pull modes) algorithm processing, to utilize the FPGA logic resources efficiently. Consequently, ScalaBFS2 is able to build up to 128 PEs on the XCU280 FPGA chip (produced with the 16nm process and configured with two HBM2 stacks) of a Xilinx Alveo U280 board, and achieves the performance of 56.92 GTEPS (Giga Traversed Edges Per Second) by fully using its 32 HBM memory channels. Compared with the state-of-the-art graph processing system (i.e., ReGraph) built on top of the same board, ScalaBFS2 achieves 2.52x ∼ 4.40x performance speedups. Moreover, when compared with Gunrock running on an Nvidia A100 GPU that is produced with the 7nm process and configured with five HBM2e stacks, ScalaBFS2 achieves 1.34x ∼ 2.40x speedups on absolute performance, and 7.35x ∼ 13.18x speedups on power efficiency.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
ACM Transactions on Reconfigurable Technology and Systems
ACM Transactions on Reconfigurable Technology and Systems COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-
CiteScore
4.90
自引率
8.70%
发文量
79
审稿时长
>12 weeks
期刊介绍: TRETS is the top journal focusing on research in, on, and with reconfigurable systems and on their underlying technology. The scope, rationale, and coverage by other journals are often limited to particular aspects of reconfigurable technology or reconfigurable systems. TRETS is a journal that covers reconfigurability in its own right. Topics that would be appropriate for TRETS would include all levels of reconfigurable system abstractions and all aspects of reconfigurable technology including platforms, programming environments and application successes that support these systems for computing or other applications. -The board and systems architectures of a reconfigurable platform. -Programming environments of reconfigurable systems, especially those designed for use with reconfigurable systems that will lead to increased programmer productivity. -Languages and compilers for reconfigurable systems. -Logic synthesis and related tools, as they relate to reconfigurable systems. -Applications on which success can be demonstrated. The underlying technology from which reconfigurable systems are developed. (Currently this technology is that of FPGAs, but research on the nature and use of follow-on technologies is appropriate for TRETS.) In considering whether a paper is suitable for TRETS, the foremost question should be whether reconfigurability has been essential to success. Topics such as architecture, programming languages, compilers, and environments, logic synthesis, and high performance applications are all suitable if the context is appropriate. For example, an architecture for an embedded application that happens to use FPGAs is not necessarily suitable for TRETS, but an architecture using FPGAs for which the reconfigurability of the FPGAs is an inherent part of the specifications (perhaps due to a need for re-use on multiple applications) would be appropriate for TRETS.
期刊最新文献
Codesign of reactor-oriented hardware and software for cyber-physical systems Turn on, Tune in, Listen up: Maximizing Side-Channel Recovery in Cross-Platform Time-to-Digital Converters Efficient SpMM Accelerator for Deep Learning: Sparkle and Its Automated Generator End-to-end codesign of Hessian-aware quantized neural networks for FPGAs DyRecMul: Fast and Low-Cost Approximate Multiplier for FPGAs using Dynamic Reconfiguration
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1