Javier Campos, Jovan Mitrevski, Nhan Tran, Zhen Dong, Amir Gholaminejad, Michael W. Mahoney, Javier Duarte
We develop an end-to-end workflow for the training and implementation of co-designed neural networks (NNs) for efficient field-programmable gate array (FPGA) hardware. Our approach leverages Hessian-aware quantization (HAWQ) of NNs, the Quantized Open Neural Network Exchange (QONNX) intermediate representation, and the hls4ml tool flow for transpiling NNs into FPGA firmware. This makes efficient NN implementations in hardware accessible to nonexperts, in a single open-sourced workflow that can be deployed for real-time machine-learning applications in a wide range of scientific and industrial settings. We demonstrate the workflow in a particle physics application involving trigger decisions that must operate at the 40 MHz collision rate of the CERN Large Hadron Collider (LHC). Given the high collision rate, all data processing must be implemented on FPGA hardware within the strict area and latency requirements. Based on these constraints, we implement an optimized mixed-precision NN classifier for high-momentum particle jets in simulated LHC proton-proton collisions.
{"title":"End-to-end codesign of Hessian-aware quantized neural networks for FPGAs","authors":"Javier Campos, Jovan Mitrevski, Nhan Tran, Zhen Dong, Amir Gholaminejad, Michael W. Mahoney, Javier Duarte","doi":"10.1145/3662000","DOIUrl":"https://doi.org/10.1145/3662000","url":null,"abstract":"<p>We develop an end-to-end workflow for the training and implementation of co-designed neural networks (NNs) for efficient field-programmable gate array (FPGA) hardware. Our approach leverages Hessian-aware quantization (HAWQ) of NNs, the Quantized Open Neural Network Exchange (QONNX) intermediate representation, and the hls4ml tool flow for transpiling NNs into FPGA firmware. This makes efficient NN implementations in hardware accessible to nonexperts, in a single open-sourced workflow that can be deployed for real-time machine-learning applications in a wide range of scientific and industrial settings. We demonstrate the workflow in a particle physics application involving trigger decisions that must operate at the 40 MHz collision rate of the CERN Large Hadron Collider (LHC). Given the high collision rate, all data processing must be implemented on FPGA hardware within the strict area and latency requirements. Based on these constraints, we implement an optimized mixed-precision NN classifier for high-momentum particle jets in simulated LHC proton-proton collisions.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"44 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140940542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shervin Vakili, Mobin Vaziri, Amirhossein Zarei, J.M. Pierre Langlois
Multipliers are widely-used arithmetic operators in digital signal processing and machine learning circuits. Due to their relatively high complexity, they can have high latency and be a significant source of power consumption. One strategy to alleviate these limitations is to use approximate computing. This paper thus introduces an original FPGA-based approximate multiplier specifically optimized for machine learning computations. It utilizes dynamically reconfigurable lookup table (LUT) primitives in AMD-Xilinx technology to realize the core part of the computations. The paper provides an in-depth analysis of the hardware architecture, implementation outcomes, and accuracy evaluations of the multiplier proposed in INT8 precision. The paper also facilitates the generalization of the proposed approximate multiplier idea to other datatypes, providing analysis and estimations for hardware cost and accuracy as a function of multiplier parameters. Implementation results on an AMD-Xilinx Kintex Ultrascale+ FPGA demonstrate remarkable savings of 64% and 67% in LUT utilization for signed multiplication and multiply-and-accumulation configurations, respectively when compared to the standard Xilinx multiplier core. Accuracy measurements on four popular deep learning (DL) benchmarks indicate a minimal average accuracy decrease of less than 0.29% during post-training deployment, with the maximum reduction staying less than 0.33%. The source code of this work is available on GitHub.
{"title":"DyRecMul: Fast and Low-Cost Approximate Multiplier for FPGAs using Dynamic Reconfiguration","authors":"Shervin Vakili, Mobin Vaziri, Amirhossein Zarei, J.M. Pierre Langlois","doi":"10.1145/3663480","DOIUrl":"https://doi.org/10.1145/3663480","url":null,"abstract":"<p>Multipliers are widely-used arithmetic operators in digital signal processing and machine learning circuits. Due to their relatively high complexity, they can have high latency and be a significant source of power consumption. One strategy to alleviate these limitations is to use approximate computing. This paper thus introduces an original FPGA-based approximate multiplier specifically optimized for machine learning computations. It utilizes dynamically reconfigurable lookup table (LUT) primitives in AMD-Xilinx technology to realize the core part of the computations. The paper provides an in-depth analysis of the hardware architecture, implementation outcomes, and accuracy evaluations of the multiplier proposed in INT8 precision. The paper also facilitates the generalization of the proposed approximate multiplier idea to other datatypes, providing analysis and estimations for hardware cost and accuracy as a function of multiplier parameters. Implementation results on an AMD-Xilinx Kintex Ultrascale+ FPGA demonstrate remarkable savings of 64% and 67% in LUT utilization for signed multiplication and multiply-and-accumulation configurations, respectively when compared to the standard Xilinx multiplier core. Accuracy measurements on four popular deep learning (DL) benchmarks indicate a minimal average accuracy decrease of less than 0.29% during post-training deployment, with the maximum reduction staying less than 0.33%. The source code of this work is available on GitHub.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"31 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140831000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph processing frameworks suffer performance degradation from under-utilization of available memory bandwidth, because graph traversal often exhibits poor locality. A prior work, ACTS [24], accelerates graph processing with FPGAs and High Bandwidth Memory (HBM). ACTS achieves locality by partitioning vertex-update messages (based on destination vertex IDs) generated online after active edges have been processed. This work introduces Dynamic-ACTS which builds on ideas in ACTS to support dynamic graphs. The key innovation is to use a hash table to find the edges to be updated. Compared to Gunrock, a GPU graph engine, Dynamic-ACTS achieves a geometric mean speedup of 1.5X, with a maximum speedup of 4.6X. Compared to GraphLily, an FPGA-HBM graph engine, Dynamic-ACTS achieves a geometric speedup of 3.6X, with a maximum speedup of 16.5X. Our results also showed a geometric mean power reduction of 50% and a mean reduction of energy-delay product of 88% over Gunrock. Compared to GraSU, an FPGA graph updating engine, Dynamic-ACTS achieves an average speedup of 15X.
{"title":"Dynamic-ACTS - A Dynamic Graph Analytics Accelerator For HBM-Enabled FPGAs","authors":"Oluwole Jaiyeoba, Kevin Skadron","doi":"10.1145/3662002","DOIUrl":"https://doi.org/10.1145/3662002","url":null,"abstract":"<p>Graph processing frameworks suffer performance degradation from under-utilization of available memory bandwidth, because graph traversal often exhibits poor locality. A prior work, ACTS [24], accelerates graph processing with FPGAs and High Bandwidth Memory (HBM). ACTS achieves locality by partitioning vertex-update messages (based on destination vertex IDs) generated online after active edges have been processed. This work introduces Dynamic-ACTS which builds on ideas in ACTS to support dynamic graphs. The key innovation is to use a hash table to find the edges to be updated. Compared to Gunrock, a GPU graph engine, Dynamic-ACTS achieves a geometric mean speedup of 1.5X, with a maximum speedup of 4.6X. Compared to GraphLily, an FPGA-HBM graph engine, Dynamic-ACTS achieves a geometric speedup of 3.6X, with a maximum speedup of 16.5X. Our results also showed a geometric mean power reduction of 50% and a mean reduction of energy-delay product of 88% over Gunrock. Compared to GraSU, an FPGA graph updating engine, Dynamic-ACTS achieves an average speedup of 15X.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"12 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Julian Haase, Najdet Charaf, Alexander Groß, Diana Göhringer
As runtime reconfiguration is used in an increasing number of hardware architectures, new simulation and modeling tools are needed to support the developer during the design phases. In this article, a language extension for SystemC is presented, together with a design methodology for the description and simulation of dynamically reconfigurable hardware at different levels of abstraction. The library presented offers a high degree of flexibility in the description of reconfiguration features and their management, while allowing runtime reconfiguration simulation, removal, and replacement of custom modules as well as third-party components throughout the architecture development process. In addition, our approach supports the emerging concept of nested reconfiguration and split regions with a minimal simulation overhead of a maximum of three delta cycles for signal and transaction forwarding, and four delta cycles for the reconfiguration process.
{"title":"NC-Library: Expanding SystemC Capabilities for Nested reConfigurable Hardware Modelling","authors":"Julian Haase, Najdet Charaf, Alexander Groß, Diana Göhringer","doi":"10.1145/3662001","DOIUrl":"https://doi.org/10.1145/3662001","url":null,"abstract":"<p>As runtime reconfiguration is used in an increasing number of hardware architectures, new simulation and modeling tools are needed to support the developer during the design phases. In this article, a language extension for SystemC is presented, together with a design methodology for the description and simulation of dynamically reconfigurable hardware at different levels of abstraction. The library presented offers a high degree of flexibility in the description of reconfiguration features and their management, while allowing runtime reconfiguration simulation, removal, and replacement of custom modules as well as third-party components throughout the architecture development process. In addition, our approach supports the emerging concept of nested reconfiguration and split regions with a minimal simulation overhead of a maximum of three delta cycles for signal and transaction forwarding, and four delta cycles for the reconfiguration process.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"157 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140798806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmed F. AbouElhamayed, Angela Cui, Javier Fernandez-Marques, Nicholas D. Lane, Mohamed S. Abdelfattah
Conventional multiply-accumulate (MAC) operations have long dominated computation time for deep neural networks (DNNs), especially convolutional neural networks (CNNs). Recently, product quantization (PQ) has been applied to these workloads, replacing MACs with memory lookups to pre-computed dot products. To better understand the efficiency tradeoffs of product-quantized DNNs (PQ-DNNs), we create a custom hardware accelerator to parallelize and accelerate nearest-neighbor search and dot-product lookups. Additionally, we perform an empirical study to investigate the efficiency–accuracy tradeoffs of different PQ parameterizations and training methods. We identify PQ configurations that improve performance-per-area for ResNet20 by up to 3.1 ×, even when compared to a highly optimized conventional DNN accelerator, with similar improvements on two additional compact DNNs. When comparing to recent PQ solutions, we outperform prior work by 4 × in terms of performance-per-area with a 0.6% accuracy degradation. Finally, we reduce the bitwidth of PQ operations to investigate the impact on both hardware efficiency and accuracy. With only 2–6-bit precision on three compact DNNs, we were able to maintain DNN accuracy eliminating the need for DSPs.
{"title":"PQA: Exploring the Potential of Product Quantization in DNN Hardware Acceleration","authors":"Ahmed F. AbouElhamayed, Angela Cui, Javier Fernandez-Marques, Nicholas D. Lane, Mohamed S. Abdelfattah","doi":"10.1145/3656643","DOIUrl":"https://doi.org/10.1145/3656643","url":null,"abstract":"<p>Conventional multiply-accumulate (MAC) operations have long dominated computation time for deep neural networks (DNNs), especially convolutional neural networks (CNNs). Recently, product quantization (PQ) has been applied to these workloads, replacing MACs with memory lookups to pre-computed dot products. To better understand the efficiency tradeoffs of product-quantized DNNs (PQ-DNNs), we create a custom hardware accelerator to parallelize and accelerate nearest-neighbor search and dot-product lookups. Additionally, we perform an empirical study to investigate the efficiency–accuracy tradeoffs of different PQ parameterizations and training methods. We identify PQ configurations that improve performance-per-area for ResNet20 by up to 3.1 ×, even when compared to a highly optimized conventional DNN accelerator, with similar improvements on two additional compact DNNs. When comparing to recent PQ solutions, we outperform prior work by 4 × in terms of performance-per-area with a 0.6% accuracy degradation. Finally, we reduce the bitwidth of PQ operations to investigate the impact on both hardware efficiency and accuracy. With only 2–6-bit precision on three compact DNNs, we were able to maintain DNN accuracy eliminating the need for DSPs.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"4 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140617833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current IP encryption methods offered by FPGA vendors use an approach where the IP is decrypted during the CAD flow, and remains unencrypted in the bitstream. Given the ease of accessing modern bitstream-to-netlist tools, encrypted IP is vulnerable to inspection and theft from the IP user. While the entire bitstream can be encrypted, this is done by the user, and is not a mechanism to protect confidentiality of 3rd party IP.
In this work we present a design methodology, along with a proof-of-concept tool, that demonstrates how IP can remain partially encrypted through the CAD flow and into the bitstream. We show how this approach can support multiple encryption keys from different vendors, and can be deployed using existing CAD tools and FPGA families. Our results document the benefits and costs of using such an approach to provide much greater protection for 3rd party IP.
目前 FPGA 供应商提供的 IP 加密方法采用的是在 CAD 流程中对 IP 进行解密,而在比特流中保持未加密的方式。鉴于现代位流到网表工具的易用性,加密的 IP 很容易被 IP 用户检查和窃取。虽然可以对整个比特流进行加密,但这是由用户完成的,并不是保护第三方 IP 机密性的机制。在这项工作中,我们介绍了一种设计方法和概念验证工具,演示了如何通过 CAD 流程将部分 IP 加密保留到比特流中。我们展示了这种方法如何支持来自不同供应商的多个加密密钥,以及如何利用现有 CAD 工具和 FPGA 系列进行部署。我们的研究结果证明了使用这种方法为第三方 IP 提供更大保护的优势和成本。
{"title":"Toward FPGA Intellectual Property (IP) Encryption from Netlist to Bitstream","authors":"Daniel Hutchings, Adam Taylor, Jeffrey Goeders","doi":"10.1145/3656644","DOIUrl":"https://doi.org/10.1145/3656644","url":null,"abstract":"<p>Current IP encryption methods offered by FPGA vendors use an approach where the IP is decrypted during the CAD flow, and remains unencrypted in the bitstream. Given the ease of accessing modern bitstream-to-netlist tools, encrypted IP is vulnerable to inspection and theft from the IP user. While the entire bitstream can be encrypted, this is done by the user, and is not a mechanism to protect confidentiality of 3rd party IP. </p><p>In this work we present a design methodology, along with a proof-of-concept tool, that demonstrates how IP can remain partially encrypted through the CAD flow and into the bitstream. We show how this approach can support multiple encryption keys from different vendors, and can be deployed using existing CAD tools and FPGA families. Our results document the benefits and costs of using such an approach to provide much greater protection for 3rd party IP.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"247 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sichao Chen, Chang Cai, Su Zheng, Jiangnan Li, Guowei Zhu, Jingyuan Li, Yazhou Yan, Yuan Dai, Wenbo Yin, Lingli Wang
Coarse-grained reconfigurable arrays (CGRAs) are promising design choices in computation-intensive domains since they can strike a balance between energy efficiency and flexibility. A typical CGRA comprises processing elements (PEs) that can execute operations in applications and interconnections between them. Nevertheless, most CGRAs suffer from the ineffectiveness of supporting flexible architecture design and solving large-scale mapping problems. To address these challenges, we introduce HierCGRA, a novel framework that integrates hierarchical CGRA modeling, Chisel-based Verilog generation, LLVM-based data flow graph (DFG) generation, DFG mapping, and design space exploration (DSE). With the graph homomorphism (GH) mapping algorithm, HierCGRA achieves a faster mapping speed and higher PE utilization rate compared with the existing state-of-the-art CGRA frameworks. The proposed hierarchical mapping strategy achieves 41× speedup on average compared with the ILP mapping algorithm in CGRA-ME. Furthermore, the automated DSE based on Bayesian optimization achieves a significant performance improvement by the heterogeneity of PEs and interconnections. With these features, HierCGRA enables the agile development for large-scale CGRA and accelerates the process of finding a better CGRA architecture.
{"title":"HierCGRA: A Novel Framework for Large-Scale CGRA with Hierarchical Modeling and Automated Design Space Exploration","authors":"Sichao Chen, Chang Cai, Su Zheng, Jiangnan Li, Guowei Zhu, Jingyuan Li, Yazhou Yan, Yuan Dai, Wenbo Yin, Lingli Wang","doi":"10.1145/3656176","DOIUrl":"https://doi.org/10.1145/3656176","url":null,"abstract":"<p>Coarse-grained reconfigurable arrays (CGRAs) are promising design choices in computation-intensive domains since they can strike a balance between energy efficiency and flexibility. A typical CGRA comprises processing elements (PEs) that can execute operations in applications and interconnections between them. Nevertheless, most CGRAs suffer from the ineffectiveness of supporting flexible architecture design and solving large-scale mapping problems. To address these challenges, we introduce HierCGRA, a novel framework that integrates hierarchical CGRA modeling, Chisel-based Verilog generation, LLVM-based data flow graph (DFG) generation, DFG mapping, and design space exploration (DSE). With the graph homomorphism (GH) mapping algorithm, HierCGRA achieves a faster mapping speed and higher PE utilization rate compared with the existing state-of-the-art CGRA frameworks. The proposed hierarchical mapping strategy achieves 41× speedup on average compared with the ILP mapping algorithm in CGRA-ME. Furthermore, the automated DSE based on Bayesian optimization achieves a significant performance improvement by the heterogeneity of PEs and interconnections. With these features, HierCGRA enables the agile development for large-scale CGRA and accelerates the process of finding a better CGRA architecture.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"34 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Barry de Bruin, Kanishkan Vadivel, Mark Wijtvliet, Pekka Jääskeläinen, Henk Corporaal
Emerging data-driven applications in the embedded, e-Health, and internet of things (IoT) domain require complex on-device signal analysis and data reduction to maximize energy efficiency on these energy-constrained devices. Coarse-grained reconfigurable architectures (CGRAs) have been proposed as a good compromise between flexibility and energy efficiency for ultra-low power (ULP) signal processing. Existing CGRAs are often specialized and domain-specific or can only accelerate simple kernels, which makes accelerating complete applications on a CGRA while maintaining high energy efficiency an open issue. Moreover, the lack of instruction set architecture (ISA) standardization across CGRAs makes code generation using current compiler technology a major challenge. This work introduces R-Blocks; a ULP CGRA with HW/SW co-design tool-flow based on the OpenASIP toolset. This CGRA is extremely flexible due to its well-established VLIW-SIMD execution model and support for flexible SIMD-processing, while maintaining an extremely high energy efficiency using software bypassing, optimized instruction delivery, and local scratchpad memories. R-Blocks is synthesized in a commercial 22-nm FD-SOI technology and achieves a full-system energy efficiency of 115 MOPS/mW on a common FFT benchmark, 1.45 × higher than a highly tuned embedded RISC-V processor. Comparable energy efficiency is obtained on multiple complex workloads, making R-Blocks a promising acceleration target for general-purpose computing.
{"title":"R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRA","authors":"Barry de Bruin, Kanishkan Vadivel, Mark Wijtvliet, Pekka Jääskeläinen, Henk Corporaal","doi":"10.1145/3656642","DOIUrl":"https://doi.org/10.1145/3656642","url":null,"abstract":"<p>Emerging data-driven applications in the embedded, e-Health, and internet of things (IoT) domain require complex on-device signal analysis and data reduction to maximize energy efficiency on these energy-constrained devices. Coarse-grained reconfigurable architectures (CGRAs) have been proposed as a good compromise between flexibility and energy efficiency for ultra-low power (ULP) signal processing. Existing CGRAs are often specialized and domain-specific or can only accelerate simple kernels, which makes accelerating complete applications on a CGRA while maintaining high energy efficiency an open issue. Moreover, the lack of instruction set architecture (ISA) standardization across CGRAs makes code generation using current compiler technology a major challenge. This work introduces R-Blocks; a ULP CGRA with HW/SW co-design tool-flow based on the OpenASIP toolset. This CGRA is extremely flexible due to its well-established VLIW-SIMD execution model and support for flexible SIMD-processing, while maintaining an extremely high energy efficiency using software bypassing, optimized instruction delivery, and local scratchpad memories. R-Blocks is synthesized in a commercial 22-nm FD-SOI technology and achieves a full-system energy efficiency of 115 MOPS/mW on a common FFT benchmark, 1.45 × higher than a highly tuned embedded RISC-V processor. Comparable energy efficiency is obtained on multiple complex workloads, making R-Blocks a promising acceleration target for general-purpose computing.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"191 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. While hardware accelerators for Transformer-based models have been extensively studied, the majority of existing approaches rely on temporal architectures that reuse hardware units for different network layers and operators. However, these methods often encounter challenges in achieving low latency due to considerable memory access overhead.
This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs. Our approach involves the specialization of distinct hardware units for specific operators or layers, facilitating direct communication between them through a dataflow architecture while minimizing off-chip memory accesses. We introduce a comprehensive analytical model for estimating the performance of a spatial LLM accelerator, taking into account the on-chip compute and memory resources available on an FPGA. This model can be extended to multi-FPGA settings for distributed inference. Through our analysis, we can identify the most effective parallelization and buffering schemes for the accelerator and, crucially, determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart.
To enable more productive implementations of an LLM model on FPGAs, we further provide a library of high-level synthesis (HLS) kernels that are composable and reusable. This library will be made available as open-source. To validate the effectiveness of both our analytical model and HLS library, we have implemented BERT and GPT2 on an AMD Xilinx Alveo U280 FPGA device. Experimental results demonstrate our approach can achieve up to 13.4 × speedup when compared to previous FPGA-based accelerators for the BERT model. For GPT generative inference, we attain a 2.2 × speedup compared to DFX, an FPGA overlay, in the prefill stage, while achieving a 1.9 × speedup and a 5.7 × improvement in energy efficiency compared to the NVIDIA A100 GPU in the decode stage.
{"title":"Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference","authors":"Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, Zhiru Zhang","doi":"10.1145/3656177","DOIUrl":"https://doi.org/10.1145/3656177","url":null,"abstract":"<p>Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. While hardware accelerators for Transformer-based models have been extensively studied, the majority of existing approaches rely on temporal architectures that reuse hardware units for different network layers and operators. However, these methods often encounter challenges in achieving low latency due to considerable memory access overhead. </p><p>This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs. Our approach involves the specialization of distinct hardware units for specific operators or layers, facilitating direct communication between them through a dataflow architecture while minimizing off-chip memory accesses. We introduce a comprehensive analytical model for estimating the performance of a spatial LLM accelerator, taking into account the on-chip compute and memory resources available on an FPGA. This model can be extended to multi-FPGA settings for distributed inference. Through our analysis, we can identify the most effective parallelization and buffering schemes for the accelerator and, crucially, determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart. </p><p>To enable more productive implementations of an LLM model on FPGAs, we further provide a library of high-level synthesis (HLS) kernels that are composable and reusable. This library will be made available as open-source. To validate the effectiveness of both our analytical model and HLS library, we have implemented BERT and GPT2 on an AMD Xilinx Alveo U280 FPGA device. Experimental results demonstrate our approach can achieve up to 13.4 × speedup when compared to previous FPGA-based accelerators for the BERT model. For GPT generative inference, we attain a 2.2 × speedup compared to DFX, an FPGA overlay, in the prefill stage, while achieving a 1.9 × speedup and a 5.7 × improvement in energy efficiency compared to the NVIDIA A100 GPU in the decode stage.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"36 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sajjad Tamimi, Arthur Bernhardt, Florian Stock, Ilia Petrov, Andreas Koch
This paper introduces DANSEN, the hardware accelerator component for neoDBMS, a full-stack computational storage system designed to manage on-device execution of database queries/transactions as a Near-Data Processing (NDP)-operation. The proposed system enables Database Management Systems (DBMS) to offload NDP-operations to the storage while maintaining control over data through a native storage interface. DANSEN provides an NDP-engine that enables DBMS to perform both low-level database tasks, such as performing database administration, as well as high-level tasks like executing SQL, on the smart storage device while observing the DBMS concurrency control. Furthermore, DANSEN enables the incorporation of custom accelerators as an NDP-operation, e.g., to perform hardware-accelerated ML inference directly on the stored data. We built the DANSEN storage prototype and interface on an Ultrascale+HBM FPGA and fully integrated it with PostgreSQL 12. Experimental results demonstrate that the proposed NDP approach outperforms software-only PostgreSQL using a fast off-the-shelf NVMe drive, and significantly improves the end-to-end execution time of an aggregation operation (similar to Q6 from CH-benCHmark, 150 million records) by ≈ 10.6 ×. The versatility of the proposed approach is also validated by integrating a compute-intensive data analytics application with multi-row results, outperforming PostgreSQL by ≈ 1.5 ×.
{"title":"DANSEN: Database Acceleration on Native Computational Storage by Exploiting NDP","authors":"Sajjad Tamimi, Arthur Bernhardt, Florian Stock, Ilia Petrov, Andreas Koch","doi":"10.1145/3655625","DOIUrl":"https://doi.org/10.1145/3655625","url":null,"abstract":"<p>This paper introduces <sans-serif>DANSEN</sans-serif>, the hardware accelerator component for neoDBMS, a full-stack computational storage system designed to manage on-device execution of database queries/transactions as a Near-Data Processing (NDP)-operation. The proposed system enables Database Management Systems (DBMS) to offload NDP-operations to the storage while maintaining control over data through a <i>native storage interface</i>. <sans-serif>DANSEN</sans-serif> provides an NDP-engine that enables DBMS to perform both low-level database tasks, such as performing database administration, as well as high-level tasks like executing SQL, <i>on</i> the smart storage device while observing the DBMS concurrency control. Furthermore, <sans-serif>DANSEN</sans-serif> enables the incorporation of custom accelerators as an NDP-operation, e.g., to perform hardware-accelerated ML inference directly on the stored data. We built the <sans-serif>DANSEN</sans-serif> storage prototype and interface on an Ultrascale+HBM FPGA and fully integrated it with PostgreSQL 12. Experimental results demonstrate that the proposed NDP approach outperforms software-only PostgreSQL using a fast off-the-shelf NVMe drive, and significantly improves the end-to-end execution time of an aggregation operation (similar to Q6 from CH-benCHmark, 150 million records) by ≈ 10.6 ×. The versatility of the proposed approach is also validated by integrating a compute-intensive data analytics application with multi-row results, outperforming PostgreSQL by ≈ 1.5 ×.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"101 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}