An efficient architecture for image descriptor matching that uses a partitioned content-addressable memory (CAM)-based approach is proposed. CAM is frequently used in high-speed content-matching applications. However, due to its lack of functionality to support approximate matching, conventional CAM is not directly useful for image descriptor matching. Our modifications improve the CAM architecture to support approximate content matching for selecting image matches with local binary descriptors. Matches are based on Hamming distances computed for all possible pairs of binary descriptors extracted from two images. We demonstrate an FPGA-based implementation of our CAM-based descriptor matching unit to illustrate the high matching speed of our design. The time complexity of our modified CAM method for binary descriptor matching is O(n). Our method performs binary descriptor matching at a rate of one descriptor per clock cycle at a frequency of 102 MHz. The resource utilization and timing metrics of several experiments are reported to demonstrate the efficacy and scalability of our design.
{"title":"A Partitioned CAM Architecture with FPGA Acceleration for Binary Descriptor Matching","authors":"Parastoo Soleimani, David W. Capson, Kin Fun Li","doi":"10.1145/3624749","DOIUrl":"https://doi.org/10.1145/3624749","url":null,"abstract":"An efficient architecture for image descriptor matching that uses a partitioned content-addressable memory (CAM)-based approach is proposed. CAM is frequently used in high-speed content-matching applications. However, due to its lack of functionality to support approximate matching, conventional CAM is not directly useful for image descriptor matching. Our modifications improve the CAM architecture to support approximate content matching for selecting image matches with local binary descriptors. Matches are based on Hamming distances computed for all possible pairs of binary descriptors extracted from two images. We demonstrate an FPGA-based implementation of our CAM-based descriptor matching unit to illustrate the high matching speed of our design. The time complexity of our modified CAM method for binary descriptor matching is O(n). Our method performs binary descriptor matching at a rate of one descriptor per clock cycle at a frequency of 102 MHz. The resource utilization and timing metrics of several experiments are reported to demonstrate the efficacy and scalability of our design.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135481609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexandre Honorat, Mickaël Dardaillon, Hugo Miomandre, Jean-François Nezan
High-Level Synthesis (HLS) tools are mature enough to provide efficient code generation for computation kernels on FPGA hardware. For more complex applications, multiple kernels may be connected by a dataflow graph. Although some tools, such as Xilinx Vitis HLS, support dataflow directives, they lack efficient analysis methods to compute the buffer sizes between kernels in a dataflow graph. This paper proposes an original method to safely approximate such buffer sizes. The first contribution computes an initial overestimation of buffer sizes, wihout knowing the memory access patterns of kernels. The second contribution iteratively refines those buffer sizes thanks to cosimulation. Moreover, the paper introduces an open source framework using these methods to facilitate dataflow programming on FPGA using HLS. The proposed methods and framework have been tested on 7 dataflow applications, and outperform Vitis HLS cosimulation in 5 benchmarks, either in terms of BRAM and LUT usage, or in term of exploration time. In the 2 other benchmarks, our best method gets results similar to Vitis HLS. Last but not least, our method admits directed cycles in the application graphs.
{"title":"Automated Buffer Sizing of Dataflow Applications in a High-Level Synthesis Workflow","authors":"Alexandre Honorat, Mickaël Dardaillon, Hugo Miomandre, Jean-François Nezan","doi":"10.1145/3626103","DOIUrl":"https://doi.org/10.1145/3626103","url":null,"abstract":"High-Level Synthesis (HLS) tools are mature enough to provide efficient code generation for computation kernels on FPGA hardware. For more complex applications, multiple kernels may be connected by a dataflow graph. Although some tools, such as Xilinx Vitis HLS, support dataflow directives, they lack efficient analysis methods to compute the buffer sizes between kernels in a dataflow graph. This paper proposes an original method to safely approximate such buffer sizes. The first contribution computes an initial overestimation of buffer sizes, wihout knowing the memory access patterns of kernels. The second contribution iteratively refines those buffer sizes thanks to cosimulation. Moreover, the paper introduces an open source framework using these methods to facilitate dataflow programming on FPGA using HLS. The proposed methods and framework have been tested on 7 dataflow applications, and outperform Vitis HLS cosimulation in 5 benchmarks, either in terms of BRAM and LUT usage, or in term of exploration time. In the 2 other benchmarks, our best method gets results similar to Vitis HLS. Last but not least, our method admits directed cycles in the application graphs.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135246216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This effort develops the first rich suite of analog & mixed-signal benchmark of various sizes and domains, intended for use with contemporary analog and mixed-signal designs and synthesis tools. Benchmarking enables analog-digital co-design exploration as well as extensive evaluation of analog synthesis tools and the generated analog/mixed-signal circuit or device. The goals of this effort are defining analog computation system benchmarks, developing the required concepts for higher-level analog & mixed-signal tools to utilize these benchmarks, and enabling future automated architectural design space exploration (DSE) to determine the best configurable architecture (e.g., a new FPAA) for a certain family of applications. The benchmarks comprise multiple levels of an acoustic , a vision , a communications , and an analog filter system that must be simultaneously satisfied for a complete system.
{"title":"Programmable Analog System Benchmarks leading to Efficient Analog Computation Synthesis","authors":"Jennifer Hasler, Cong Hao","doi":"10.1145/3625298","DOIUrl":"https://doi.org/10.1145/3625298","url":null,"abstract":"This effort develops the first rich suite of analog & mixed-signal benchmark of various sizes and domains, intended for use with contemporary analog and mixed-signal designs and synthesis tools. Benchmarking enables analog-digital co-design exploration as well as extensive evaluation of analog synthesis tools and the generated analog/mixed-signal circuit or device. The goals of this effort are defining analog computation system benchmarks, developing the required concepts for higher-level analog & mixed-signal tools to utilize these benchmarks, and enabling future automated architectural design space exploration (DSE) to determine the best configurable architecture (e.g., a new FPAA) for a certain family of applications. The benchmarks comprise multiple levels of an acoustic , a vision , a communications , and an analog filter system that must be simultaneously satisfied for a complete system.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135246909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Olivia Weng, Gabriel Marcano, Vladimir Loncar, Alireza Khodamoradi, Abarajithan G, Nojan Sheybani, Andres Meza, Farinaz Koushanfar, Kristof Denolf, Javier Mauricio Duarte, Ryan Kastner
Deep neural networks use skip connections to improve training convergence. However, these skip connections are costly in hardware, requiring extra buffers and increasing on- and off-chip memory utilization and bandwidth requirements. In this paper, we show that skip connections can be optimized for hardware when tackled with a hardware-software codesign approach. We argue that while a network’s skip connections are needed for the network to learn, they can later be removed or shortened to provide a more hardware efficient implementation with minimal to no accuracy loss. We introduce Tailor , a codesign tool whose hardware-aware training algorithm gradually removes or shortens a fully trained network’s skip connections to lower their hardware cost. Tailor improves resource utilization by up to 34% for BRAMs, 13% for FFs, and 16% for LUTs for on-chip, dataflow-style architectures. Tailor increases performance by 30% and reduces memory bandwidth by 45% for a 2D processing element array architecture.
{"title":"<scp>Tailor</scp> : Altering Skip Connections for Resource-Efficient Inference","authors":"Olivia Weng, Gabriel Marcano, Vladimir Loncar, Alireza Khodamoradi, Abarajithan G, Nojan Sheybani, Andres Meza, Farinaz Koushanfar, Kristof Denolf, Javier Mauricio Duarte, Ryan Kastner","doi":"10.1145/3624990","DOIUrl":"https://doi.org/10.1145/3624990","url":null,"abstract":"Deep neural networks use skip connections to improve training convergence. However, these skip connections are costly in hardware, requiring extra buffers and increasing on- and off-chip memory utilization and bandwidth requirements. In this paper, we show that skip connections can be optimized for hardware when tackled with a hardware-software codesign approach. We argue that while a network’s skip connections are needed for the network to learn, they can later be removed or shortened to provide a more hardware efficient implementation with minimal to no accuracy loss. We introduce Tailor , a codesign tool whose hardware-aware training algorithm gradually removes or shortens a fully trained network’s skip connections to lower their hardware cost. Tailor improves resource utilization by up to 34% for BRAMs, 13% for FFs, and 16% for LUTs for on-chip, dataflow-style architectures. Tailor increases performance by 30% and reduces memory bandwidth by 45% for a 2D processing element array architecture.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136059961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yingchun Lu, Yun Yang, Rong Hu, Huaguo Liang, Maoxiang Yi, Huang Zhengfeng, Yuanming Ma, Tian Chen, Liang Yao
Unpredictable true random numbers are required in security technology fields such as information encryption, key generation, mask generation for anti-side-channel analysis, algorithm initialization, etc. At present, the true random number generator (TRNG) is not enough to provide fast random bits by low-speed bits generation. Therefore, it is necessary to design a faster TRNG. This work presents an ultra-compact TRNG with high throughput based on a novel extendable dual-ring oscillator (DRO). Owing to multiple bits output per cycle in DRO can be used to obtain the original random sequence, the proposed DRO achieves a maximum resource utilization to build a more efficient TRNG, compared with the conventional TRNG system based on ring oscillator (RO), which only has a single output and needs to build multiple groups of ring oscillators. TRNG based on the 2-bit DRO and its 8-bit derivative structure has been verified on Xilinx Artix-7 and Kintex-7 FPGA under the automatic layout and routing and has achieved a throughput of 550Mbps and 1100Mbps respectively. Moreover, in terms of throughput performance over operating frequency, hardware consumption, and entropy, the proposed scheme has obvious advantages. Finally, the generated sequences show good randomness in the test of NIST SP800-22 and Dieharder test suite and pass the entropy estimation test kit NIST SP800-90B and AIS-31.
{"title":"High-Efficiency TRNG Design based on Multi-bit Dual-ring Oscillator","authors":"Yingchun Lu, Yun Yang, Rong Hu, Huaguo Liang, Maoxiang Yi, Huang Zhengfeng, Yuanming Ma, Tian Chen, Liang Yao","doi":"10.1145/3624991","DOIUrl":"https://doi.org/10.1145/3624991","url":null,"abstract":"Unpredictable true random numbers are required in security technology fields such as information encryption, key generation, mask generation for anti-side-channel analysis, algorithm initialization, etc. At present, the true random number generator (TRNG) is not enough to provide fast random bits by low-speed bits generation. Therefore, it is necessary to design a faster TRNG. This work presents an ultra-compact TRNG with high throughput based on a novel extendable dual-ring oscillator (DRO). Owing to multiple bits output per cycle in DRO can be used to obtain the original random sequence, the proposed DRO achieves a maximum resource utilization to build a more efficient TRNG, compared with the conventional TRNG system based on ring oscillator (RO), which only has a single output and needs to build multiple groups of ring oscillators. TRNG based on the 2-bit DRO and its 8-bit derivative structure has been verified on Xilinx Artix-7 and Kintex-7 FPGA under the automatic layout and routing and has achieved a throughput of 550Mbps and 1100Mbps respectively. Moreover, in terms of throughput performance over operating frequency, hardware consumption, and entropy, the proposed scheme has obvious advantages. Finally, the generated sequences show good randomness in the test of NIST SP800-22 and Dieharder test suite and pass the entropy estimation test kit NIST SP800-90B and AIS-31.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136153407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Licheng Guo, Yuze Chi, Jason Lau, Linghao Song, Xingyu Tian, Moazin Khatti, Weikang Qiao, Jie Wang, Ecenur Ustun, Zhenman Fang, Zhiru Zhang, Jason Cong
In this paper, we propose TAPA, an end-to-end framework that compiles a C++ task-parallel dataflow program into a high-frequency FPGA accelerator. Compared to existing solutions, TAPA has two major advantages. First, TAPA provides a set of convenient APIs that allow users to easily express flexible and complex inter-task communication structures. Second, TAPA adopts a coarse-grained floorplanning step during HLS compilation for accurate pipelining of potential critical paths. In addition, TAPA implements several optimization techniques specifically tailored for modern HBM-based FPGAs. In our experiments with a total of 43 designs, we improve the average frequency from 147 MHz to 297 MHz (a 102% improvement) with no loss of throughput and a negligible change in resource utilization. Notably, in 16 experiments we make the originally unroutable designs achieve 274 MHz on average. The framework is available at https://github.com/UCLA-VAST/tapa and the core floorplan module is available at https://github.com/UCLA-VAST/AutoBridge .
{"title":"TAPA: A Scalable Task-Parallel Dataflow Programming Framework for Modern FPGAs with Co-Optimization of HLS and Physical Design","authors":"Licheng Guo, Yuze Chi, Jason Lau, Linghao Song, Xingyu Tian, Moazin Khatti, Weikang Qiao, Jie Wang, Ecenur Ustun, Zhenman Fang, Zhiru Zhang, Jason Cong","doi":"10.1145/3609335","DOIUrl":"https://doi.org/10.1145/3609335","url":null,"abstract":"In this paper, we propose TAPA, an end-to-end framework that compiles a C++ task-parallel dataflow program into a high-frequency FPGA accelerator. Compared to existing solutions, TAPA has two major advantages. First, TAPA provides a set of convenient APIs that allow users to easily express flexible and complex inter-task communication structures. Second, TAPA adopts a coarse-grained floorplanning step during HLS compilation for accurate pipelining of potential critical paths. In addition, TAPA implements several optimization techniques specifically tailored for modern HBM-based FPGAs. In our experiments with a total of 43 designs, we improve the average frequency from 147 MHz to 297 MHz (a 102% improvement) with no loss of throughput and a negligible change in resource utilization. Notably, in 16 experiments we make the originally unroutable designs achieve 274 MHz on average. The framework is available at https://github.com/UCLA-VAST/tapa and the core floorplan module is available at https://github.com/UCLA-VAST/AutoBridge .","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135153380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Louis Noyez, Nadia El Mrabet, Olivier Potin, Pascal Veron
This paper describes an extensive study of the use of DSP48E2 Slices in Ultrascale FPGAs to design hardware versions of the Montgomery Multiplication algorithm for the hardware acceleration of modular multiplications. Our fully scalable systolic architectures result in parallelized, DSP48E2-optimized scheduling of operations analogous to the FIOS block variant of the Montgomery Multiplication. We explore the impacts of different pipelining strategies within DSP blocks, scheduling of operations, processing element configurations, global design structures and their trade-offs in terms of performance and resource costs. We discuss the application of our methodology to multiple types of DSP primitives. We provide ready to use fast, efficient and fully parametrizable designs which can adapt to a wide range of requirements and applications. Implementations are scalable to any operand width. Our most efficient designs can perform 128, 256, 512, 1024, 2048 and 4096 bits Montgomery modular multiplications in 0.0992 μ s, 0.2032 μ s, 0.3952 μ s, 0.7792 μ s, 1.550 μ s and 3.099 μ s using 4, 6, 11, 21, 41 and 82 DSP blocks respectively.
{"title":"Montgomery Multiplication Scalable Systolic Designs Optimized for DSP48E2","authors":"Louis Noyez, Nadia El Mrabet, Olivier Potin, Pascal Veron","doi":"10.1145/3624571","DOIUrl":"https://doi.org/10.1145/3624571","url":null,"abstract":"This paper describes an extensive study of the use of DSP48E2 Slices in Ultrascale FPGAs to design hardware versions of the Montgomery Multiplication algorithm for the hardware acceleration of modular multiplications. Our fully scalable systolic architectures result in parallelized, DSP48E2-optimized scheduling of operations analogous to the FIOS block variant of the Montgomery Multiplication. We explore the impacts of different pipelining strategies within DSP blocks, scheduling of operations, processing element configurations, global design structures and their trade-offs in terms of performance and resource costs. We discuss the application of our methodology to multiple types of DSP primitives. We provide ready to use fast, efficient and fully parametrizable designs which can adapt to a wide range of requirements and applications. Implementations are scalable to any operand width. Our most efficient designs can perform 128, 256, 512, 1024, 2048 and 4096 bits Montgomery modular multiplications in 0.0992 μ s, 0.2032 μ s, 0.3952 μ s, 0.7792 μ s, 1.550 μ s and 3.099 μ s using 4, 6, 11, 21, 41 and 82 DSP blocks respectively.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135396531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanlong Xiao, Dongjoon Park, Zeyu Jason Niu, Aditya Hota, André DeHon
Partial Reconfiguration (PR) is a key technique in the application design on modern FPGAs. However, current PR tools heavily rely on the developer to manually conduct PR module definition, floorplanning, and flow control at a low level. The existing PR tools do not consider High-Level-Synthesis languages either, which are of great interest to software developers. We propose HiPR, an open-source framework, to bridge the gap between HLS and PR. HiPR allows the developer to define partially reconfigurable C/C++ functions, instead of Verilog modules, to accelerate the FPGA incremental compilation and automate the flow from C/C++ to bitstreams. We use a lightweight Simulated Annealing floorplanner and show that it can produce high-quality PR floorplans an order of magnitude faster than analytic methods. By mapping Rosetta HLS benchmarks, we demonstrate that the incremental compilation can be accelerated by 3–10 × compared with state-of-the-art Xilinx Vitis flow without performance loss, at the cost of 15-67% one-time overlay set-up time.
{"title":"ExHiPR: Extended High-level Partial Reconfiguration for Fast Incremental FPGA Compilation","authors":"Yuanlong Xiao, Dongjoon Park, Zeyu Jason Niu, Aditya Hota, André DeHon","doi":"10.1145/3617837","DOIUrl":"https://doi.org/10.1145/3617837","url":null,"abstract":"Partial Reconfiguration (PR) is a key technique in the application design on modern FPGAs. However, current PR tools heavily rely on the developer to manually conduct PR module definition, floorplanning, and flow control at a low level. The existing PR tools do not consider High-Level-Synthesis languages either, which are of great interest to software developers. We propose HiPR, an open-source framework, to bridge the gap between HLS and PR. HiPR allows the developer to define partially reconfigurable C/C++ functions, instead of Verilog modules, to accelerate the FPGA incremental compilation and automate the flow from C/C++ to bitstreams. We use a lightweight Simulated Annealing floorplanner and show that it can produce high-quality PR floorplans an order of magnitude faster than analytic methods. By mapping Rosetta HLS benchmarks, we demonstrate that the incremental compilation can be accelerated by 3–10 × compared with state-of-the-art Xilinx Vitis flow without performance loss, at the cost of 15-67% one-time overlay set-up time.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134912217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the trend towards ever larger “big data” applications, many of the gains achievable by using specialized compute accelerators become diminished due to the growing I/O overheads. While there have been several research efforts into computational storage and FPGA implementations of the NVMe interface, to our knowledge there have been only very limited efforts to move larger parts of the Linux block I/O stack into FPGA-based hardware accelerators. Our hardware/software framework DeLiBA initially addressed this deficiency by allowing high-productivity development of software components of the I/O stack in user instead of kernel space and leverages a proven FPGA SoC framework to quickly compose and deploy the actual FPGA-based I/O accelerators. In its initial form, it achieves 10% higher throughput and up to 2.3× the I/Os per second (IOPS) for a proof-of-concept Ceph accelerator running in a real multi-node Ceph cluster. In DeLiBA2, we have extended the framework further to better support distributed storage systems, specifically by directly integrating the block I/O accelerators with a hardware-accelerated network stack, as well as by accelerating more storage functions. With these improvements, performance grows significantly: The cluster-level speed-ups now reach up to 2.8× for both throughput and IOPS relative to Ceph in software in synthetic benchmarks, and achieve end-to-end wall clock speed-ups of 20% for the real workload of building a large software package.
随着越来越大的“大数据”应用程序的趋势,由于I/O开销的增加,使用专门的计算加速器可以获得的许多收益会减少。虽然已经有一些关于NVMe接口的计算存储和FPGA实现的研究工作,但据我们所知,将Linux块I/O堆栈的大部分移动到基于FPGA的硬件加速器上的努力非常有限。我们的硬件/软件框架DeLiBA最初通过允许在用户空间而不是内核空间中高生产率地开发I/O堆栈的软件组件来解决这一缺陷,并利用经过验证的FPGA SoC框架来快速组成和部署实际的基于FPGA的I/O加速器。在其初始形式中,对于在真实的多节点Ceph集群中运行的概念验证Ceph加速器,它实现了10%的高吞吐量和高达2.3倍的每秒I/ o (IOPS)。在DeLiBA2中,我们进一步扩展了框架,以更好地支持分布式存储系统,特别是通过直接将块I/O加速器与硬件加速的网络堆栈集成,以及通过加速更多的存储功能。有了这些改进,性能显著提高:在综合基准测试中,相对于软件中的Ceph,集群级的吞吐量和IOPS加速现在达到2.8倍,对于构建大型软件包的实际工作负载,端到端时钟加速达到20%。
{"title":"The Open-Source DeLiBA2 Hardware/Software Framework for Distributed Storage Accelerators","authors":"Babar Khan, Carsten Heinz, Andreas Koch","doi":"10.1145/3624482","DOIUrl":"https://doi.org/10.1145/3624482","url":null,"abstract":"With the trend towards ever larger “big data” applications, many of the gains achievable by using specialized compute accelerators become diminished due to the growing I/O overheads. While there have been several research efforts into computational storage and FPGA implementations of the NVMe interface, to our knowledge there have been only very limited efforts to move larger parts of the Linux block I/O stack into FPGA-based hardware accelerators. Our hardware/software framework DeLiBA initially addressed this deficiency by allowing high-productivity development of software components of the I/O stack in user instead of kernel space and leverages a proven FPGA SoC framework to quickly compose and deploy the actual FPGA-based I/O accelerators. In its initial form, it achieves 10% higher throughput and up to 2.3× the I/Os per second (IOPS) for a proof-of-concept Ceph accelerator running in a real multi-node Ceph cluster. In DeLiBA2, we have extended the framework further to better support distributed storage systems, specifically by directly integrating the block I/O accelerators with a hardware-accelerated network stack, as well as by accelerating more storage functions. With these improvements, performance grows significantly: The cluster-level speed-ups now reach up to 2.8× for both throughput and IOPS relative to Ceph in software in synthetic benchmarks, and achieve end-to-end wall clock speed-ups of 20% for the real workload of building a large software package.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134912617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advances in graph processing on FPGAs promise to alleviate performance bottlenecks with irregular memory access patterns. Such bottlenecks challenge performance for a growing number of important application areas like machine learning and data analytics. While FPGAs denote a promising solution through flexible memory hierarchies and massive parallelism, we argue that current graph processing accelerators either use the off-chip memory bandwidth inefficiently or do not scale well across memory channels. In this work, we propose GraphScale, a scalable graph processing framework for FPGAs. GraphScale combines multi-channel memory with asynchronous graph processing (i. e., for fast convergence on results) and a compressed graph representation (i. e., for efficient usage of memory bandwidth and reduced memory footprint). GraphScale solves common graph problems like breadth-first search, PageRank, and weakly-connected components through modular user-defined functions, a novel two-dimensional partitioning scheme, and a high-performance two-level crossbar design. Additionally, we extend GraphScale to scale to modern high-bandwidth memory (HBM) and reduce partitioning overhead of large graphs with binary packing.
{"title":"GraphScale: Scalable Processing on FPGAs for HBM and Large Graphs","authors":"Jonas Dann, Daniel Ritter, Holger Fröning","doi":"10.1145/3616497","DOIUrl":"https://doi.org/10.1145/3616497","url":null,"abstract":"Recent advances in graph processing on FPGAs promise to alleviate performance bottlenecks with irregular memory access patterns. Such bottlenecks challenge performance for a growing number of important application areas like machine learning and data analytics. While FPGAs denote a promising solution through flexible memory hierarchies and massive parallelism, we argue that current graph processing accelerators either use the off-chip memory bandwidth inefficiently or do not scale well across memory channels. In this work, we propose GraphScale, a scalable graph processing framework for FPGAs. GraphScale combines multi-channel memory with asynchronous graph processing (i. e., for fast convergence on results) and a compressed graph representation (i. e., for efficient usage of memory bandwidth and reduced memory footprint). GraphScale solves common graph problems like breadth-first search, PageRank, and weakly-connected components through modular user-defined functions, a novel two-dimensional partitioning scheme, and a high-performance two-level crossbar design. Additionally, we extend GraphScale to scale to modern high-bandwidth memory (HBM) and reduce partitioning overhead of large graphs with binary packing.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135739706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}