ScalaBFS2：基于 HBM 增强型 FPGA 芯片的高性能 BFS 加速器

IF 2.8 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-02-29 DOI:10.1145/3650037

Kexin Li, Shaoxian Xu, Zhiyuan Shao, Ran Zheng, Xiaofei Liao, Hai Jin

{"title":"ScalaBFS2：基于 HBM 增强型 FPGA 芯片的高性能 BFS 加速器","authors":"Kexin Li, Shaoxian Xu, Zhiyuan Shao, Ran Zheng, Xiaofei Liao, Hai Jin","doi":"10.1145/3650037","DOIUrl":null,"url":null,"abstract":"The introduction of High Bandwidth Memory (HBM) to the FPGA chip makes it possible for an FPGA-based accelerator to leverage the huge memory bandwidth of HBM to improve its performance when implementing a specific algorithm, which is especially true for the Breadth-First Search (BFS) algorithm that demands a high bandwidth on accessing the graph data stored in memory. Different from traditional FPGA-DRAM platforms where memory bandwidth is the precious resource due to the limited DRAM channels, FPGA chips equipped with HBM have much higher memory bandwidths provided by the large quantities of HBM channels, but still limited amount of logic (LUT, FF, and BRAM/URAM) resources. Therefore, the key to design a high performance BFS accelerator on an HBM-enhanced FPGA chip is to efficiently use the logic resources to build as many as possible Processing Elements (PEs), and configure them flexibly to obtain as high as possible effective memory bandwidth that is useful to the algorithm from the HBM, rather than partially emphasizing the absolute memory bandwidth. To exploit as high as possible effective bandwidth from the HBM, ScalaBFS2 conducts BFS in graphs with the vertex-centric manner, and proposes designs, including the independent module (HBM Reader) for memory accessing, multi-layer crossbar, and PEs that implement hybrid mode (i.e., capable of working in both push and pull modes) algorithm processing, to utilize the FPGA logic resources efficiently. Consequently, ScalaBFS2 is able to build up to 128 PEs on the XCU280 FPGA chip (produced with the 16nm process and configured with two HBM2 stacks) of a Xilinx Alveo U280 board, and achieves the performance of 56.92 GTEPS (Giga Traversed Edges Per Second) by fully using its 32 HBM memory channels. Compared with the state-of-the-art graph processing system (i.e., ReGraph) built on top of the same board, ScalaBFS2 achieves 2.52x ∼ 4.40x performance speedups. Moreover, when compared with Gunrock running on an Nvidia A100 GPU that is produced with the 7nm process and configured with five HBM2e stacks, ScalaBFS2 achieves 1.34x ∼ 2.40x speedups on absolute performance, and 7.35x ∼ 13.18x speedups on power efficiency.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"33 1","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ScalaBFS2: A High Performance BFS Accelerator on an HBM-enhanced FPGA Chip\",\"authors\":\"Kexin Li, Shaoxian Xu, Zhiyuan Shao, Ran Zheng, Xiaofei Liao, Hai Jin\",\"doi\":\"10.1145/3650037\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The introduction of High Bandwidth Memory (HBM) to the FPGA chip makes it possible for an FPGA-based accelerator to leverage the huge memory bandwidth of HBM to improve its performance when implementing a specific algorithm, which is especially true for the Breadth-First Search (BFS) algorithm that demands a high bandwidth on accessing the graph data stored in memory. Different from traditional FPGA-DRAM platforms where memory bandwidth is the precious resource due to the limited DRAM channels, FPGA chips equipped with HBM have much higher memory bandwidths provided by the large quantities of HBM channels, but still limited amount of logic (LUT, FF, and BRAM/URAM) resources. Therefore, the key to design a high performance BFS accelerator on an HBM-enhanced FPGA chip is to efficiently use the logic resources to build as many as possible Processing Elements (PEs), and configure them flexibly to obtain as high as possible effective memory bandwidth that is useful to the algorithm from the HBM, rather than partially emphasizing the absolute memory bandwidth. To exploit as high as possible effective bandwidth from the HBM, ScalaBFS2 conducts BFS in graphs with the vertex-centric manner, and proposes designs, including the independent module (HBM Reader) for memory accessing, multi-layer crossbar, and PEs that implement hybrid mode (i.e., capable of working in both push and pull modes) algorithm processing, to utilize the FPGA logic resources efficiently. Consequently, ScalaBFS2 is able to build up to 128 PEs on the XCU280 FPGA chip (produced with the 16nm process and configured with two HBM2 stacks) of a Xilinx Alveo U280 board, and achieves the performance of 56.92 GTEPS (Giga Traversed Edges Per Second) by fully using its 32 HBM memory channels. Compared with the state-of-the-art graph processing system (i.e., ReGraph) built on top of the same board, ScalaBFS2 achieves 2.52x ∼ 4.40x performance speedups. Moreover, when compared with Gunrock running on an Nvidia A100 GPU that is produced with the 7nm process and configured with five HBM2e stacks, ScalaBFS2 achieves 1.34x ∼ 2.40x speedups on absolute performance, and 7.35x ∼ 13.18x speedups on power efficiency.\",\"PeriodicalId\":49248,\"journal\":{\"name\":\"ACM Transactions on Reconfigurable Technology and Systems\",\"volume\":\"33 1\",\"pages\":\"\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-02-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Reconfigurable Technology and Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3650037\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Reconfigurable Technology and Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3650037","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

在 FPGA 芯片中引入高带宽内存 (HBM)，使得基于 FPGA 的加速器在执行特定算法时可以利用 HBM 的巨大内存带宽来提高性能，这对于访问存储在内存中的图形数据时需要高带宽的广度优先搜索 (BFS) 算法来说尤其如此。与传统的 FPGA-DRAM 平台不同，传统的 FPGA-DRAM 平台由于 DRAM 通道有限，因此内存带宽是宝贵的资源，而配备 HBM 的 FPGA 芯片由于拥有大量的 HBM 通道，因此内存带宽要高得多，但逻辑（LUT、FF 和 BRAM/URAM）资源仍然有限。因此，在 HBM 增强型 FPGA 芯片上设计高性能 BFS 加速器的关键是有效利用逻辑资源，构建尽可能多的处理单元 (PE)，并灵活配置这些处理单元，以便从 HBM 中获得对算法有用的尽可能高的有效内存带宽，而不是片面强调绝对内存带宽。为了尽可能利用 HBM 的有效带宽，ScalaBFS2 以顶点为中心在图中进行 BFS，并提出了包括用于内存访问的独立模块（HBM 阅读器）、多层交叉条和实现混合模式（即能够在推模式和拉模式下工作）算法处理的 PE 等设计，以有效利用 FPGA 逻辑资源。因此，ScalaBFS2 能够在 Xilinx Alveo U280 板的 XCU280 FPGA 芯片（采用 16nm 工艺生产，配置了两个 HBM2 堆栈）上构建多达 128 个 PE，并通过充分利用其 32 个 HBM 内存通道实现了 56.92 GTEPS（每秒千兆遍历边）的性能。与基于同一板卡的最先进图处理系统（即 ReGraph）相比，ScalaBFS2 的性能提升了 2.52 倍～4.40 倍。此外，与运行在采用 7nm 工艺生产并配置了五个 HBM2e 堆栈的 Nvidia A100 GPU 上的 Gunrock 相比，ScalaBFS2 的绝对性能提高了 1.34 倍 ∼ 2.40 倍，能效提高了 7.35 倍 ∼ 13.18 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ScalaBFS2: A High Performance BFS Accelerator on an HBM-enhanced FPGA Chip

The introduction of High Bandwidth Memory (HBM) to the FPGA chip makes it possible for an FPGA-based accelerator to leverage the huge memory bandwidth of HBM to improve its performance when implementing a specific algorithm, which is especially true for the Breadth-First Search (BFS) algorithm that demands a high bandwidth on accessing the graph data stored in memory. Different from traditional FPGA-DRAM platforms where memory bandwidth is the precious resource due to the limited DRAM channels, FPGA chips equipped with HBM have much higher memory bandwidths provided by the large quantities of HBM channels, but still limited amount of logic (LUT, FF, and BRAM/URAM) resources. Therefore, the key to design a high performance BFS accelerator on an HBM-enhanced FPGA chip is to efficiently use the logic resources to build as many as possible Processing Elements (PEs), and configure them flexibly to obtain as high as possible effective memory bandwidth that is useful to the algorithm from the HBM, rather than partially emphasizing the absolute memory bandwidth. To exploit as high as possible effective bandwidth from the HBM, ScalaBFS2 conducts BFS in graphs with the vertex-centric manner, and proposes designs, including the independent module (HBM Reader) for memory accessing, multi-layer crossbar, and PEs that implement hybrid mode (i.e., capable of working in both push and pull modes) algorithm processing, to utilize the FPGA logic resources efficiently. Consequently, ScalaBFS2 is able to build up to 128 PEs on the XCU280 FPGA chip (produced with the 16nm process and configured with two HBM2 stacks) of a Xilinx Alveo U280 board, and achieves the performance of 56.92 GTEPS (Giga Traversed Edges Per Second) by fully using its 32 HBM memory channels. Compared with the state-of-the-art graph processing system (i.e., ReGraph) built on top of the same board, ScalaBFS2 achieves 2.52x ∼ 4.40x performance speedups. Moreover, when compared with Gunrock running on an Nvidia A100 GPU that is produced with the 7nm process and configured with five HBM2e stacks, ScalaBFS2 achieves 1.34x ∼ 2.40x speedups on absolute performance, and 7.35x ∼ 13.18x speedups on power efficiency.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Reconfigurable Technology and Systems COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-

CiteScore

4.90

自引率

8.70%

发文量

审稿时长

>12 weeks

期刊介绍： TRETS is the top journal focusing on research in, on, and with reconfigurable systems and on their underlying technology. The scope, rationale, and coverage by other journals are often limited to particular aspects of reconfigurable technology or reconfigurable systems. TRETS is a journal that covers reconfigurability in its own right. Topics that would be appropriate for TRETS would include all levels of reconfigurable system abstractions and all aspects of reconfigurable technology including platforms, programming environments and application successes that support these systems for computing or other applications. -The board and systems architectures of a reconfigurable platform. -Programming environments of reconfigurable systems, especially those designed for use with reconfigurable systems that will lead to increased programmer productivity. -Languages and compilers for reconfigurable systems. -Logic synthesis and related tools, as they relate to reconfigurable systems. -Applications on which success can be demonstrated. The underlying technology from which reconfigurable systems are developed. (Currently this technology is that of FPGAs, but research on the nature and use of follow-on technologies is appropriate for TRETS.) In considering whether a paper is suitable for TRETS, the foremost question should be whether reconfigurability has been essential to success. Topics such as architecture, programming languages, compilers, and environments, logic synthesis, and high performance applications are all suitable if the context is appropriate. For example, an architecture for an embedded application that happens to use FPGAs is not necessarily suitable for TRETS, but an architecture using FPGAs for which the reconfigurability of the FPGAs is an inherent part of the specifications (perhaps due to a need for re-use on multiple applications) would be appropriate for TRETS.