Rapid growth in data size poses significant computational and memory challenges to data processing. FPGA accelerators and near-storage processing have emerged as compelling solutions for tackling the growing computational and memory requirements. Many FPGA-based accelerators have shown to be effective in processing large data sets by leveraging the storage capability of either host-attached or FPGA-attached storage devices. However, the current HLS development environment does not allow direct access to host- or FPGA-attached NVMe storage from the HLS code. As such, users must frequently hand off between HLS and host code to access data in storage, and such a process requires tedious programming to ensure functional correctness. Moreover, since the HLS code uses radically different methods to access storage compared to DRAM, the HLS codebase targeting DRAM-based platforms cannot be easily ported to NVMe-based platforms, resulting in limited code portability and reusability. Furthermore, frequent suspension of HLS kernel and synchronization between CPU and FPGA introduce significant latency overhead and require sophisticated scheduling mechanisms to hide latency.
To address these challenges, we propose a new HLS storage interface named DONGLE 2.0 that enables direct FPGA-orchestrated NVMe storage access. By providing a unified interface for storage and memory access, DONGLE 2.0 allows a single-source HLS program to target multiple memory/storage devices, thus making the codebase cleaner, portable, and more efficient. DONGLE 2.0 is an extension to DONGLE 1.0 [1] but adds support for host-attached storage. While its primary focus is still on FPGA NVMe access in near-storage configurations, the added host storage support ensures its compatibility with platforms that lack native support for FPGA-attached NVMe storage. We implemented a prototype of DONGLE 2.0 using an AMD/Xilinx Alveo U200 FPGA and Solidigm DC-P4610 SSD. Our evaluation on various workloads showed a geometric mean speed-up of 2.3 × and a reduction in lines of code by 2.4 × compared to the state-of-the-art commercial platform when using FPGA-attached NVMe storage. Moreover, DONGLE 2.0 demonstrated a geometric mean speed-up of 1.5 × and a reduction in lines of code by 2.4 × compared to the state-of-the-art commercial platform when using host-attached NVMe storage.
{"title":"DONGLE 2.0: Direct FPGA-Orchestrated NVMe Storage for HLS","authors":"Linus Y. Wong, Jialiang Zhang, Jing (Jane) Li","doi":"10.1145/3650038","DOIUrl":"https://doi.org/10.1145/3650038","url":null,"abstract":"<p>Rapid growth in data size poses significant computational and memory challenges to data processing. FPGA accelerators and near-storage processing have emerged as compelling solutions for tackling the growing computational and memory requirements. Many FPGA-based accelerators have shown to be effective in processing large data sets by leveraging the storage capability of either host-attached or FPGA-attached storage devices. However, the current HLS development environment does not allow direct access to host- or FPGA-attached NVMe storage from the HLS code. As such, users must frequently hand off between HLS and host code to access data in storage, and such a process requires tedious programming to ensure functional correctness. Moreover, since the HLS code uses radically different methods to access storage compared to DRAM, the HLS codebase targeting DRAM-based platforms cannot be easily ported to NVMe-based platforms, resulting in limited code portability and reusability. Furthermore, frequent suspension of HLS kernel and synchronization between CPU and FPGA introduce significant latency overhead and require sophisticated scheduling mechanisms to hide latency. </p><p>To address these challenges, we propose a new HLS storage interface named DONGLE 2.0 that enables direct FPGA-orchestrated NVMe storage access. By providing a unified interface for storage and memory access, DONGLE 2.0 allows a single-source HLS program to target multiple memory/storage devices, thus making the codebase cleaner, portable, and more efficient. DONGLE 2.0 is an extension to DONGLE 1.0 [1] but adds support for host-attached storage. While its primary focus is still on FPGA NVMe access in near-storage configurations, the added host storage support ensures its compatibility with platforms that lack native support for FPGA-attached NVMe storage. We implemented a prototype of DONGLE 2.0 using an AMD/Xilinx Alveo U200 FPGA and Solidigm DC-P4610 SSD. Our evaluation on various workloads showed a geometric mean speed-up of 2.3 × and a reduction in lines of code by 2.4 × compared to the state-of-the-art commercial platform when using FPGA-attached NVMe storage. Moreover, DONGLE 2.0 demonstrated a geometric mean speed-up of 1.5 × and a reduction in lines of code by 2.4 × compared to the state-of-the-art commercial platform when using host-attached NVMe storage.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140047928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kexin Li, Shaoxian Xu, Zhiyuan Shao, Ran Zheng, Xiaofei Liao, Hai Jin
The introduction of High Bandwidth Memory (HBM) to the FPGA chip makes it possible for an FPGA-based accelerator to leverage the huge memory bandwidth of HBM to improve its performance when implementing a specific algorithm, which is especially true for the Breadth-First Search (BFS) algorithm that demands a high bandwidth on accessing the graph data stored in memory. Different from traditional FPGA-DRAM platforms where memory bandwidth is the precious resource due to the limited DRAM channels, FPGA chips equipped with HBM have much higher memory bandwidths provided by the large quantities of HBM channels, but still limited amount of logic (LUT, FF, and BRAM/URAM) resources. Therefore, the key to design a high performance BFS accelerator on an HBM-enhanced FPGA chip is to efficiently use the logic resources to build as many as possible Processing Elements (PEs), and configure them flexibly to obtain as high as possible effective memory bandwidth that is useful to the algorithm from the HBM, rather than partially emphasizing the absolute memory bandwidth. To exploit as high as possible effective bandwidth from the HBM, ScalaBFS2 conducts BFS in graphs with the vertex-centric manner, and proposes designs, including the independent module (HBM Reader) for memory accessing, multi-layer crossbar, and PEs that implement hybrid mode (i.e., capable of working in both push and pull modes) algorithm processing, to utilize the FPGA logic resources efficiently. Consequently, ScalaBFS2 is able to build up to 128 PEs on the XCU280 FPGA chip (produced with the 16nm process and configured with two HBM2 stacks) of a Xilinx Alveo U280 board, and achieves the performance of 56.92 GTEPS (Giga Traversed Edges Per Second) by fully using its 32 HBM memory channels. Compared with the state-of-the-art graph processing system (i.e., ReGraph) built on top of the same board, ScalaBFS2 achieves 2.52x ∼ 4.40x performance speedups. Moreover, when compared with Gunrock running on an Nvidia A100 GPU that is produced with the 7nm process and configured with five HBM2e stacks, ScalaBFS2 achieves 1.34x ∼ 2.40x speedups on absolute performance, and 7.35x ∼ 13.18x speedups on power efficiency.
{"title":"ScalaBFS2: A High Performance BFS Accelerator on an HBM-enhanced FPGA Chip","authors":"Kexin Li, Shaoxian Xu, Zhiyuan Shao, Ran Zheng, Xiaofei Liao, Hai Jin","doi":"10.1145/3650037","DOIUrl":"https://doi.org/10.1145/3650037","url":null,"abstract":"<p>The introduction of High Bandwidth Memory (HBM) to the FPGA chip makes it possible for an FPGA-based accelerator to leverage the huge memory bandwidth of HBM to improve its performance when implementing a specific algorithm, which is especially true for the Breadth-First Search (BFS) algorithm that demands a high bandwidth on accessing the graph data stored in memory. Different from traditional FPGA-DRAM platforms where memory bandwidth is the precious resource due to the limited DRAM channels, FPGA chips equipped with HBM have much higher memory bandwidths provided by the large quantities of HBM channels, but still limited amount of logic (LUT, FF, and BRAM/URAM) resources. Therefore, the key to design a high performance BFS accelerator on an HBM-enhanced FPGA chip is to efficiently use the logic resources to build as many as possible Processing Elements (PEs), and configure them flexibly to obtain as high as possible <i>effective memory bandwidth</i> that is useful to the algorithm from the HBM, rather than partially emphasizing the absolute memory bandwidth. To exploit as high as possible effective bandwidth from the HBM, ScalaBFS2 conducts BFS in graphs with the vertex-centric manner, and proposes designs, including the independent module (HBM Reader) for memory accessing, multi-layer crossbar, and PEs that implement hybrid mode (i.e., capable of working in both push and pull modes) algorithm processing, to utilize the FPGA logic resources efficiently. Consequently, ScalaBFS2 is able to build up to 128 PEs on the XCU280 FPGA chip (produced with the 16nm process and configured with two HBM2 stacks) of a Xilinx Alveo U280 board, and achieves the performance of 56.92 GTEPS (Giga Traversed Edges Per Second) by fully using its 32 HBM memory channels. Compared with the state-of-the-art graph processing system (i.e., ReGraph) built on top of the same board, ScalaBFS2 achieves 2.52x ∼ 4.40x performance speedups. Moreover, when compared with Gunrock running on an Nvidia A100 GPU that is produced with the 7nm process and configured with five HBM2e stacks, ScalaBFS2 achieves 1.34x ∼ 2.40x speedups on absolute performance, and 7.35x ∼ 13.18x speedups on power efficiency.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140002890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the increasing application of machine learning (ML) algorithms in embedded systems, there is a rising necessity to design low-cost computer arithmetic for these resource-constrained systems. As a result, emerging models of computation, such as approximate and stochastic computing, that leverage the inherent error-resilience of such algorithms are being actively explored for implementing ML inference on resource-constrained systems. Approximate computing (AxC) aims to provide disproportionate gains in the power, performance, and area (PPA) of an application by allowing some level of reduction in its behavioral accuracy (BEHAV). Using approximate operators (AxOs) for computer arithmetic forms one of the more prevalent methods of implementing AxC. AxOs provide the additional scope for finer granularity of optimization, compared to only precision scaling of computer arithmetic. To this end, the design of platform-specific and cost-efficient approximate operators forms an important research goal. Recently, multiple works have reported the use of AI/ML-based approaches for synthesizing novel FPGA-based AxOs. However, most of such works limit the use of AI/ML to designing ML-based surrogate functions that are used during iterative optimization processes. To this end, we propose a novel data analysis-driven mathematical programming-based approach to synthesizing approximate operators for FPGAs. Specifically, we formulate mixed integer quadratically constrained programs