ACM Transactions on Reconfigurable Technology and Systems最新文献

End-to-end codesign of Hessian-aware quantized neural networks for FPGAs 面向 FPGA 的 Hessian 感知量化神经网络的端到端编码设计

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-05-11 DOI: 10.1145/3662000

Javier Campos, Jovan Mitrevski, Nhan Tran, Zhen Dong, Amir Gholaminejad, Michael W. Mahoney, Javier Duarte

We develop an end-to-end workflow for the training and implementation of co-designed neural networks (NNs) for efficient field-programmable gate array (FPGA) hardware. Our approach leverages Hessian-aware quantization (HAWQ) of NNs, the Quantized Open Neural Network Exchange (QONNX) intermediate representation, and the hls4ml tool flow for transpiling NNs into FPGA firmware. This makes efficient NN implementations in hardware accessible to nonexperts, in a single open-sourced workflow that can be deployed for real-time machine-learning applications in a wide range of scientific and industrial settings. We demonstrate the workflow in a particle physics application involving trigger decisions that must operate at the 40 MHz collision rate of the CERN Large Hadron Collider (LHC). Given the high collision rate, all data processing must be implemented on FPGA hardware within the strict area and latency requirements. Based on these constraints, we implement an optimized mixed-precision NN classifier for high-momentum particle jets in simulated LHC proton-proton collisions.

我们为高效现场可编程门阵列（FPGA）硬件的协同设计神经网络（NN）的训练和实施开发了端到端的工作流程。我们的方法利用了神经网络的黑森感知量化 (HAWQ)、量化开放神经网络交换 (QONNX) 中间表示法和 hls4ml 工具流程，用于将神经网络移植到 FPGA 固件中。这使得非专业人员也能在硬件中实现高效的神经网络，在一个开源的工作流程中，可以在广泛的科学和工业环境中部署实时机器学习应用。我们在一个涉及触发决策的粒子物理应用中演示了该工作流，该应用必须在欧洲核子研究中心大型强子对撞机（LHC）40 MHz 的碰撞速率下运行。考虑到高碰撞率，所有数据处理都必须在严格的面积和延迟要求下在 FPGA 硬件上实现。基于这些约束条件，我们在模拟的大型强子对撞机质子-质子对撞中为高动量粒子射流实现了一个优化的混合精度 NN 分类器。

{"title":"End-to-end codesign of Hessian-aware quantized neural networks for FPGAs","authors":"Javier Campos, Jovan Mitrevski, Nhan Tran, Zhen Dong, Amir Gholaminejad, Michael W. Mahoney, Javier Duarte","doi":"10.1145/3662000","DOIUrl":"https://doi.org/10.1145/3662000","url":null,"abstract":"We develop an end-to-end workflow for the training and implementation of co-designed neural networks (NNs) for efficient field-programmable gate array (FPGA) hardware. Our approach leverages Hessian-aware quantization (HAWQ) of NNs, the Quantized Open Neural Network Exchange (QONNX) intermediate representation, and the hls4ml tool flow for transpiling NNs into FPGA firmware. This makes efficient NN implementations in hardware accessible to nonexperts, in a single open-sourced workflow that can be deployed for real-time machine-learning applications in a wide range of scientific and industrial settings. We demonstrate the workflow in a particle physics application involving trigger decisions that must operate at the 40 MHz collision rate of the CERN Large Hadron Collider (LHC). Given the high collision rate, all data processing must be implemented on FPGA hardware within the strict area and latency requirements. Based on these constraints, we implement an optimized mixed-precision NN classifier for high-momentum particle jets in simulated LHC proton-proton collisions.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"44 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140940542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DyRecMul: Fast and Low-Cost Approximate Multiplier for FPGAs using Dynamic Reconfiguration DyRecMul：使用动态重配置的 FPGA 快速低成本近似乘法器

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-05-01 DOI: 10.1145/3663480

Shervin Vakili, Mobin Vaziri, Amirhossein Zarei, J.M. Pierre Langlois

Multipliers are widely-used arithmetic operators in digital signal processing and machine learning circuits. Due to their relatively high complexity, they can have high latency and be a significant source of power consumption. One strategy to alleviate these limitations is to use approximate computing. This paper thus introduces an original FPGA-based approximate multiplier specifically optimized for machine learning computations. It utilizes dynamically reconfigurable lookup table (LUT) primitives in AMD-Xilinx technology to realize the core part of the computations. The paper provides an in-depth analysis of the hardware architecture, implementation outcomes, and accuracy evaluations of the multiplier proposed in INT8 precision. The paper also facilitates the generalization of the proposed approximate multiplier idea to other datatypes, providing analysis and estimations for hardware cost and accuracy as a function of multiplier parameters. Implementation results on an AMD-Xilinx Kintex Ultrascale+ FPGA demonstrate remarkable savings of 64% and 67% in LUT utilization for signed multiplication and multiply-and-accumulation configurations, respectively when compared to the standard Xilinx multiplier core. Accuracy measurements on four popular deep learning (DL) benchmarks indicate a minimal average accuracy decrease of less than 0.29% during post-training deployment, with the maximum reduction staying less than 0.33%. The source code of this work is available on GitHub.

乘法器是数字信号处理和机器学习电路中广泛使用的算术运算符。由于其相对较高的复杂性，它们可能会有较高的延迟，并且是功耗的一个重要来源。缓解这些限制的策略之一是使用近似计算。因此，本文介绍了一种基于 FPGA 的原创近似乘法器，专门针对机器学习计算进行了优化。它利用 AMD-Xilinx 技术中的动态可重新配置查找表（LUT）基元来实现计算的核心部分。本文深入分析了 INT8 精度乘法器的硬件架构、实现结果和精度评估。该论文还有助于将所提出的近似乘法器思想推广到其他数据类型，并根据乘法器参数对硬件成本和精度进行分析和估算。在 AMD-Xilinx Kintex Ultrascale+ FPGA 上的实现结果表明，与标准 Xilinx 乘法器内核相比，有符号乘法和乘法累加配置的 LUT 利用率分别节省了 64% 和 67%。对四种流行的深度学习（DL）基准进行的精度测量表明，在训练后部署期间，平均精度下降幅度最小，不到 0.29%，最大下降幅度不到 0.33%。这项工作的源代码可在 GitHub 上获取。

{"title":"DyRecMul: Fast and Low-Cost Approximate Multiplier for FPGAs using Dynamic Reconfiguration","authors":"Shervin Vakili, Mobin Vaziri, Amirhossein Zarei, J.M. Pierre Langlois","doi":"10.1145/3663480","DOIUrl":"https://doi.org/10.1145/3663480","url":null,"abstract":"Multipliers are widely-used arithmetic operators in digital signal processing and machine learning circuits. Due to their relatively high complexity, they can have high latency and be a significant source of power consumption. One strategy to alleviate these limitations is to use approximate computing. This paper thus introduces an original FPGA-based approximate multiplier specifically optimized for machine learning computations. It utilizes dynamically reconfigurable lookup table (LUT) primitives in AMD-Xilinx technology to realize the core part of the computations. The paper provides an in-depth analysis of the hardware architecture, implementation outcomes, and accuracy evaluations of the multiplier proposed in INT8 precision. The paper also facilitates the generalization of the proposed approximate multiplier idea to other datatypes, providing analysis and estimations for hardware cost and accuracy as a function of multiplier parameters. Implementation results on an AMD-Xilinx Kintex Ultrascale+ FPGA demonstrate remarkable savings of 64% and 67% in LUT utilization for signed multiplication and multiply-and-accumulation configurations, respectively when compared to the standard Xilinx multiplier core. Accuracy measurements on four popular deep learning (DL) benchmarks indicate a minimal average accuracy decrease of less than 0.29% during post-training deployment, with the maximum reduction staying less than 0.33%. The source code of this work is available on GitHub.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"31 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140831000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dynamic-ACTS - A Dynamic Graph Analytics Accelerator For HBM-Enabled FPGAs Dynamic-ACTS - 面向 HBM FPGA 的动态图形分析加速器

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-04-30 DOI: 10.1145/3662002

Oluwole Jaiyeoba, Kevin Skadron

Graph processing frameworks suffer performance degradation from under-utilization of available memory bandwidth, because graph traversal often exhibits poor locality. A prior work, ACTS [24], accelerates graph processing with FPGAs and High Bandwidth Memory (HBM). ACTS achieves locality by partitioning vertex-update messages (based on destination vertex IDs) generated online after active edges have been processed. This work introduces Dynamic-ACTS which builds on ideas in ACTS to support dynamic graphs. The key innovation is to use a hash table to find the edges to be updated. Compared to Gunrock, a GPU graph engine, Dynamic-ACTS achieves a geometric mean speedup of 1.5X, with a maximum speedup of 4.6X. Compared to GraphLily, an FPGA-HBM graph engine, Dynamic-ACTS achieves a geometric speedup of 3.6X, with a maximum speedup of 16.5X. Our results also showed a geometric mean power reduction of 50% and a mean reduction of energy-delay product of 88% over Gunrock. Compared to GraSU, an FPGA graph updating engine, Dynamic-ACTS achieves an average speedup of 15X.

由于图形遍历通常表现出较低的局部性，可用内存带宽利用不足导致图形处理框架性能下降。之前的一项研究成果 ACTS [24]，利用 FPGA 和高带宽内存（HBM）加速了图形处理。ACTS 通过分割活动边处理后在线生成的顶点更新信息（基于目标顶点 ID）来实现局部性。这项工作引入了 Dynamic-ACTS，它以 ACTS 的理念为基础，支持动态图。其主要创新在于使用哈希表来查找要更新的边。与 GPU 图形引擎 Gunrock 相比，Dynamic-ACTS 的几何平均速度提高了 1.5 倍，最大速度提高了 4.6 倍。与 FPGA-HBM 图形引擎 GraphLily 相比，Dynamic-ACTS 的几何平均速度提高了 3.6 倍，最大速度提高了 16.5 倍。我们的结果还显示，与 Gunrock 相比，几何平均功耗降低了 50%，能耗-延迟乘积平均降低了 88%。与 FPGA 图形更新引擎 GraSU 相比，Dynamic-ACTS 的平均速度提高了 15 倍。

引用次数: 0

NC-Library: Expanding SystemC Capabilities for Nested reConfigurable Hardware Modelling NC 库：为嵌套式可重构硬件建模扩展 SystemC 功能

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-04-27 DOI: 10.1145/3662001

Julian Haase, Najdet Charaf, Alexander Groß, Diana Göhringer

As runtime reconfiguration is used in an increasing number of hardware architectures, new simulation and modeling tools are needed to support the developer during the design phases. In this article, a language extension for SystemC is presented, together with a design methodology for the description and simulation of dynamically reconfigurable hardware at different levels of abstraction. The library presented offers a high degree of flexibility in the description of reconfiguration features and their management, while allowing runtime reconfiguration simulation, removal, and replacement of custom modules as well as third-party components throughout the architecture development process. In addition, our approach supports the emerging concept of nested reconfiguration and split regions with a minimal simulation overhead of a maximum of three delta cycles for signal and transaction forwarding, and four delta cycles for the reconfiguration process.

随着运行时重新配置技术在越来越多的硬件架构中得到应用，需要新的仿真和建模工具在设计阶段为开发人员提供支持。本文介绍了 SystemC 语言的扩展，以及在不同抽象层次上描述和模拟动态可重构硬件的设计方法。所介绍的库在描述重配置功能及其管理方面具有高度灵活性，同时允许在整个架构开发过程中进行运行时重配置仿真、移除和替换定制模块以及第三方组件。此外，我们的方法还支持嵌套重新配置和分割区域的新兴概念，并将信号和事务转发的模拟开销降至最低，最多不超过三个三角洲周期，重新配置过程不超过四个三角洲周期。

引用次数: 0

PQA: Exploring the Potential of Product Quantization in DNN Hardware Acceleration PQA：探索产品量化在 DNN 硬件加速中的潜力

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-04-18 DOI: 10.1145/3656643

Ahmed F. AbouElhamayed, Angela Cui, Javier Fernandez-Marques, Nicholas D. Lane, Mohamed S. Abdelfattah

Conventional multiply-accumulate (MAC) operations have long dominated computation time for deep neural networks (DNNs), especially convolutional neural networks (CNNs). Recently, product quantization (PQ) has been applied to these workloads, replacing MACs with memory lookups to pre-computed dot products. To better understand the efficiency tradeoffs of product-quantized DNNs (PQ-DNNs), we create a custom hardware accelerator to parallelize and accelerate nearest-neighbor search and dot-product lookups. Additionally, we perform an empirical study to investigate the efficiency–accuracy tradeoffs of different PQ parameterizations and training methods. We identify PQ configurations that improve performance-per-area for ResNet20 by up to 3.1 ×, even when compared to a highly optimized conventional DNN accelerator, with similar improvements on two additional compact DNNs. When comparing to recent PQ solutions, we outperform prior work by 4 × in terms of performance-per-area with a 0.6% accuracy degradation. Finally, we reduce the bitwidth of PQ operations to investigate the impact on both hardware efficiency and accuracy. With only 2–6-bit precision on three compact DNNs, we were able to maintain DNN accuracy eliminating the need for DSPs.

对于深度神经网络（DNN），尤其是卷积神经网络（CNN）来说，传统的乘积（MAC）运算长期以来一直占据着计算时间的主导地位。最近，积量化（PQ）被应用到这些工作负载中，用内存查找预计算点积的方式取代了 MAC。为了更好地理解积量化 DNN（PQ-DNN）的效率权衡，我们创建了一个定制硬件加速器，用于并行加速最近邻搜索和点积查找。此外，我们还进行了一项实证研究，以调查不同 PQ 参数化和训练方法在效率和准确性之间的权衡。我们发现，即使与高度优化的传统 DNN 加速器相比，PQ 配置也能将 ResNet20 的单位面积性能提高 3.1 倍，而且在另外两个紧凑型 DNN 上也有类似的提高。与最近的 PQ 解决方案相比，我们的单位面积性能提高了 4 倍，准确率降低了 0.6%。最后，我们降低了 PQ 运算的位宽，以研究其对硬件效率和精度的影响。在三个紧凑型 DNN 上仅使用 2-6 位精度的情况下，我们能够保持 DNN 的精度，而无需使用 DSP。

{"title":"PQA: Exploring the Potential of Product Quantization in DNN Hardware Acceleration","authors":"Ahmed F. AbouElhamayed, Angela Cui, Javier Fernandez-Marques, Nicholas D. Lane, Mohamed S. Abdelfattah","doi":"10.1145/3656643","DOIUrl":"https://doi.org/10.1145/3656643","url":null,"abstract":"Conventional multiply-accumulate (MAC) operations have long dominated computation time for deep neural networks (DNNs), especially convolutional neural networks (CNNs). Recently, product quantization (PQ) has been applied to these workloads, replacing MACs with memory lookups to pre-computed dot products. To better understand the efficiency tradeoffs of product-quantized DNNs (PQ-DNNs), we create a custom hardware accelerator to parallelize and accelerate nearest-neighbor search and dot-product lookups. Additionally, we perform an empirical study to investigate the efficiency–accuracy tradeoffs of different PQ parameterizations and training methods. We identify PQ configurations that improve performance-per-area for ResNet20 by up to 3.1 ×, even when compared to a highly optimized conventional DNN accelerator, with similar improvements on two additional compact DNNs. When comparing to recent PQ solutions, we outperform prior work by 4 × in terms of performance-per-area with a 0.6% accuracy degradation. Finally, we reduce the bitwidth of PQ operations to investigate the impact on both hardware efficiency and accuracy. With only 2–6-bit precision on three compact DNNs, we were able to maintain DNN accuracy eliminating the need for DSPs.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"4 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140617833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Toward FPGA Intellectual Property (IP) Encryption from Netlist to Bitstream 实现从网表到比特流的 FPGA 知识产权 (IP) 加密

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-04-12 DOI: 10.1145/3656644

Daniel Hutchings, Adam Taylor, Jeffrey Goeders

Current IP encryption methods offered by FPGA vendors use an approach where the IP is decrypted during the CAD flow, and remains unencrypted in the bitstream. Given the ease of accessing modern bitstream-to-netlist tools, encrypted IP is vulnerable to inspection and theft from the IP user. While the entire bitstream can be encrypted, this is done by the user, and is not a mechanism to protect confidentiality of 3rd party IP.

In this work we present a design methodology, along with a proof-of-concept tool, that demonstrates how IP can remain partially encrypted through the CAD flow and into the bitstream. We show how this approach can support multiple encryption keys from different vendors, and can be deployed using existing CAD tools and FPGA families. Our results document the benefits and costs of using such an approach to provide much greater protection for 3rd party IP.

目前 FPGA 供应商提供的 IP 加密方法采用的是在 CAD 流程中对 IP 进行解密，而在比特流中保持未加密的方式。鉴于现代位流到网表工具的易用性，加密的 IP 很容易被 IP 用户检查和窃取。虽然可以对整个比特流进行加密，但这是由用户完成的，并不是保护第三方 IP 机密性的机制。在这项工作中，我们介绍了一种设计方法和概念验证工具，演示了如何通过 CAD 流程将部分 IP 加密保留到比特流中。我们展示了这种方法如何支持来自不同供应商的多个加密密钥，以及如何利用现有 CAD 工具和 FPGA 系列进行部署。我们的研究结果证明了使用这种方法为第三方 IP 提供更大保护的优势和成本。

引用次数: 0

HierCGRA: A Novel Framework for Large-Scale CGRA with Hierarchical Modeling and Automated Design Space Exploration HierCGRA：利用分层建模和自动设计空间探索实现大规模 CGRA 的新型框架

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-04-08 DOI: 10.1145/3656176

Sichao Chen, Chang Cai, Su Zheng, Jiangnan Li, Guowei Zhu, Jingyuan Li, Yazhou Yan, Yuan Dai, Wenbo Yin, Lingli Wang

Coarse-grained reconfigurable arrays (CGRAs) are promising design choices in computation-intensive domains since they can strike a balance between energy efficiency and flexibility. A typical CGRA comprises processing elements (PEs) that can execute operations in applications and interconnections between them. Nevertheless, most CGRAs suffer from the ineffectiveness of supporting flexible architecture design and solving large-scale mapping problems. To address these challenges, we introduce HierCGRA, a novel framework that integrates hierarchical CGRA modeling, Chisel-based Verilog generation, LLVM-based data flow graph (DFG) generation, DFG mapping, and design space exploration (DSE). With the graph homomorphism (GH) mapping algorithm, HierCGRA achieves a faster mapping speed and higher PE utilization rate compared with the existing state-of-the-art CGRA frameworks. The proposed hierarchical mapping strategy achieves 41× speedup on average compared with the ILP mapping algorithm in CGRA-ME. Furthermore, the automated DSE based on Bayesian optimization achieves a significant performance improvement by the heterogeneity of PEs and interconnections. With these features, HierCGRA enables the agile development for large-scale CGRA and accelerates the process of finding a better CGRA architecture.

粗粒度可重构阵列（CGRA）在计算密集型领域是一种很有前途的设计选择，因为它们可以在能效和灵活性之间取得平衡。典型的 CGRA 由可在应用中执行操作的处理元件（PE）和它们之间的互连组成。然而，大多数 CGRA 在支持灵活的架构设计和解决大规模映射问题方面效果不佳。为了应对这些挑战，我们引入了 HierCGRA，这是一种集成了分层 CGRA 建模、基于 Chisel 的 Verilog 生成、基于 LLVM 的数据流图（DFG）生成、DFG 映射和设计空间探索（DSE）的新型框架。通过图同态（GH）映射算法，HierCGRA 与现有最先进的 CGRA 框架相比，实现了更快的映射速度和更高的 PE 利用率。与 CGRA-ME 中的 ILP 映射算法相比，所提出的分层映射策略平均加快了 41 倍。此外，基于贝叶斯优化的自动 DSE 通过 PE 和互连的异构性实现了显著的性能提升。凭借这些特性，HierCGRA 实现了大规模 CGRA 的敏捷开发，并加快了寻找更好 CGRA 架构的进程。

{"title":"HierCGRA: A Novel Framework for Large-Scale CGRA with Hierarchical Modeling and Automated Design Space Exploration","authors":"Sichao Chen, Chang Cai, Su Zheng, Jiangnan Li, Guowei Zhu, Jingyuan Li, Yazhou Yan, Yuan Dai, Wenbo Yin, Lingli Wang","doi":"10.1145/3656176","DOIUrl":"https://doi.org/10.1145/3656176","url":null,"abstract":"Coarse-grained reconfigurable arrays (CGRAs) are promising design choices in computation-intensive domains since they can strike a balance between energy efficiency and flexibility. A typical CGRA comprises processing elements (PEs) that can execute operations in applications and interconnections between them. Nevertheless, most CGRAs suffer from the ineffectiveness of supporting flexible architecture design and solving large-scale mapping problems. To address these challenges, we introduce HierCGRA, a novel framework that integrates hierarchical CGRA modeling, Chisel-based Verilog generation, LLVM-based data flow graph (DFG) generation, DFG mapping, and design space exploration (DSE). With the graph homomorphism (GH) mapping algorithm, HierCGRA achieves a faster mapping speed and higher PE utilization rate compared with the existing state-of-the-art CGRA frameworks. The proposed hierarchical mapping strategy achieves 41× speedup on average compared with the ILP mapping algorithm in CGRA-ME. Furthermore, the automated DSE based on Bayesian optimization achieves a significant performance improvement by the heterogeneity of PEs and interconnections. With these features, HierCGRA enables the agile development for large-scale CGRA and accelerates the process of finding a better CGRA architecture.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"34 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRA R-锁：高能效、灵活、可编程的 CGRA

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-04-08 DOI: 10.1145/3656642

Barry de Bruin, Kanishkan Vadivel, Mark Wijtvliet, Pekka Jääskeläinen, Henk Corporaal

Emerging data-driven applications in the embedded, e-Health, and internet of things (IoT) domain require complex on-device signal analysis and data reduction to maximize energy efficiency on these energy-constrained devices. Coarse-grained reconfigurable architectures (CGRAs) have been proposed as a good compromise between flexibility and energy efficiency for ultra-low power (ULP) signal processing. Existing CGRAs are often specialized and domain-specific or can only accelerate simple kernels, which makes accelerating complete applications on a CGRA while maintaining high energy efficiency an open issue. Moreover, the lack of instruction set architecture (ISA) standardization across CGRAs makes code generation using current compiler technology a major challenge. This work introduces R-Blocks; a ULP CGRA with HW/SW co-design tool-flow based on the OpenASIP toolset. This CGRA is extremely flexible due to its well-established VLIW-SIMD execution model and support for flexible SIMD-processing, while maintaining an extremely high energy efficiency using software bypassing, optimized instruction delivery, and local scratchpad memories. R-Blocks is synthesized in a commercial 22-nm FD-SOI technology and achieves a full-system energy efficiency of 115 MOPS/mW on a common FFT benchmark, 1.45 × higher than a highly tuned embedded RISC-V processor. Comparable energy efficiency is obtained on multiple complex workloads, making R-Blocks a promising acceleration target for general-purpose computing.

嵌入式、电子医疗和物联网（IoT）领域中新兴的数据驱动型应用需要复杂的设备上信号分析和数据缩减，以最大限度地提高这些能源受限设备的能效。粗粒度可重构架构（CGRA）作为超低功耗（ULP）信号处理的灵活性和能效之间的良好折中方案已被提出。现有的粗粒度可重构架构通常是专用的、特定领域的，或者只能加速简单的内核，这使得在粗粒度可重构架构上加速完整的应用并保持高能效成为一个悬而未决的问题。此外，由于 CGRA 之间缺乏指令集架构（ISA）标准化，使用当前编译器技术生成代码成为一大挑战。这项工作引入了 R-Blocks；这是一种基于 OpenASIP 工具集的 ULP CGRA，具有硬件/软件协同设计工具流。这种 CGRA 具有极高的灵活性，因为它采用了成熟的 VLIW-SIMD 执行模型，支持灵活的 SIMD 处理，同时利用软件旁路、优化的指令传输和本地 scratchpad 存储器保持了极高的能效。R-Blocks 采用商用 22 纳米 FD-SOI 技术合成，在普通 FFT 基准上实现了 115 MOPS/mW 的全系统能效，比高度调整的嵌入式 RISC-V 处理器高 1.45 倍。在多种复杂的工作负载上，R-Blocks 获得了相当高的能效，成为通用计算领域前景广阔的加速目标。

{"title":"R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRA","authors":"Barry de Bruin, Kanishkan Vadivel, Mark Wijtvliet, Pekka Jääskeläinen, Henk Corporaal","doi":"10.1145/3656642","DOIUrl":"https://doi.org/10.1145/3656642","url":null,"abstract":"Emerging data-driven applications in the embedded, e-Health, and internet of things (IoT) domain require complex on-device signal analysis and data reduction to maximize energy efficiency on these energy-constrained devices. Coarse-grained reconfigurable architectures (CGRAs) have been proposed as a good compromise between flexibility and energy efficiency for ultra-low power (ULP) signal processing. Existing CGRAs are often specialized and domain-specific or can only accelerate simple kernels, which makes accelerating complete applications on a CGRA while maintaining high energy efficiency an open issue. Moreover, the lack of instruction set architecture (ISA) standardization across CGRAs makes code generation using current compiler technology a major challenge. This work introduces R-Blocks; a ULP CGRA with HW/SW co-design tool-flow based on the OpenASIP toolset. This CGRA is extremely flexible due to its well-established VLIW-SIMD execution model and support for flexible SIMD-processing, while maintaining an extremely high energy efficiency using software bypassing, optimized instruction delivery, and local scratchpad memories. R-Blocks is synthesized in a commercial 22-nm FD-SOI technology and achieves a full-system energy efficiency of 115 MOPS/mW on a common FFT benchmark, 1.45 × higher than a highly tuned embedded RISC-V processor. Comparable energy efficiency is obtained on multiple complex workloads, making R-Blocks a promising acceleration target for general-purpose computing.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"191 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference 了解基于 FPGA 空间加速的大型语言模型推理的潜力

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-04-04 DOI: 10.1145/3656177

Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, Zhiru Zhang

Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. While hardware accelerators for Transformer-based models have been extensively studied, the majority of existing approaches rely on temporal architectures that reuse hardware units for different network layers and operators. However, these methods often encounter challenges in achieving low latency due to considerable memory access overhead.

This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs. Our approach involves the specialization of distinct hardware units for specific operators or layers, facilitating direct communication between them through a dataflow architecture while minimizing off-chip memory accesses. We introduce a comprehensive analytical model for estimating the performance of a spatial LLM accelerator, taking into account the on-chip compute and memory resources available on an FPGA. This model can be extended to multi-FPGA settings for distributed inference. Through our analysis, we can identify the most effective parallelization and buffering schemes for the accelerator and, crucially, determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart.

To enable more productive implementations of an LLM model on FPGAs, we further provide a library of high-level synthesis (HLS) kernels that are composable and reusable. This library will be made available as open-source. To validate the effectiveness of both our analytical model and HLS library, we have implemented BERT and GPT2 on an AMD Xilinx Alveo U280 FPGA device. Experimental results demonstrate our approach can achieve up to 13.4 × speedup when compared to previous FPGA-based accelerators for the BERT model. For GPT generative inference, we attain a 2.2 × speedup compared to DFX, an FPGA overlay, in the prefill stage, while achieving a 1.9 × speedup and a 5.7 × improvement in energy efficiency compared to the NVIDIA A100 GPU in the decode stage.

最近，拥有数十亿个参数的大型语言模型（LLM）取得了长足进步，这对推理工作负载的高效部署提出了巨大需求。虽然针对基于 Transformer 的模型的硬件加速器已经得到了广泛的研究，但大多数现有方法都依赖于为不同网络层和运算符重复使用硬件单元的时序架构。然而，由于内存访问开销巨大，这些方法在实现低延迟方面经常遇到挑战。本文研究了在 FPGA 上对 LLM 推理进行特定模型空间加速的可行性和潜力。我们的方法涉及为特定算子或层专用不同的硬件单元，通过数据流架构促进它们之间的直接通信，同时最大限度地减少片外内存访问。考虑到 FPGA 上可用的片上计算和内存资源，我们引入了一个用于估算空间 LLM 加速器性能的综合分析模型。该模型可扩展到多 FPGA 设置，用于分布式推理。通过分析，我们可以确定加速器最有效的并行化和缓冲方案，更重要的是，确定基于 FPGA 的空间加速在哪些情况下优于基于 GPU 的空间加速。为了在 FPGA 上更有效地实现 LLM 模型，我们进一步提供了一个可组合、可重用的高级合成（HLS）内核库。该库将以开源形式提供。为了验证我们的分析模型和 HLS 库的有效性，我们在 AMD Xilinx Alveo U280 FPGA 设备上实现了 BERT 和 GPT2。实验结果表明，与之前基于 FPGA 的 BERT 模型加速器相比，我们的方法可实现高达 13.4 倍的速度提升。在 GPT 生成推理方面，与 DFX（一种 FPGA 叠加器）相比，我们在预填充阶段的速度提高了 2.2 倍，而与英伟达 A100 GPU 相比，我们在解码阶段的速度提高了 1.9 倍，能效提高了 5.7 倍。

{"title":"Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference","authors":"Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, Zhiru Zhang","doi":"10.1145/3656177","DOIUrl":"https://doi.org/10.1145/3656177","url":null,"abstract":"Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. While hardware accelerators for Transformer-based models have been extensively studied, the majority of existing approaches rely on temporal architectures that reuse hardware units for different network layers and operators. However, these methods often encounter challenges in achieving low latency due to considerable memory access overhead. This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs. Our approach involves the specialization of distinct hardware units for specific operators or layers, facilitating direct communication between them through a dataflow architecture while minimizing off-chip memory accesses. We introduce a comprehensive analytical model for estimating the performance of a spatial LLM accelerator, taking into account the on-chip compute and memory resources available on an FPGA. This model can be extended to multi-FPGA settings for distributed inference. Through our analysis, we can identify the most effective parallelization and buffering schemes for the accelerator and, crucially, determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart. To enable more productive implementations of an LLM model on FPGAs, we further provide a library of high-level synthesis (HLS) kernels that are composable and reusable. This library will be made available as open-source. To validate the effectiveness of both our analytical model and HLS library, we have implemented BERT and GPT2 on an AMD Xilinx Alveo U280 FPGA device. Experimental results demonstrate our approach can achieve up to 13.4 × speedup when compared to previous FPGA-based accelerators for the BERT model. For GPT generative inference, we attain a 2.2 × speedup compared to DFX, an FPGA overlay, in the prefill stage, while achieving a 1.9 × speedup and a 5.7 × improvement in energy efficiency compared to the NVIDIA A100 GPU in the decode stage.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"36 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DANSEN: Database Acceleration on Native Computational Storage by Exploiting NDP DANSEN：利用 NDP 在本地计算存储上加速数据库

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-04-04 DOI: 10.1145/3655625

Sajjad Tamimi, Arthur Bernhardt, Florian Stock, Ilia Petrov, Andreas Koch

This paper introduces DANSEN, the hardware accelerator component for neoDBMS, a full-stack computational storage system designed to manage on-device execution of database queries/transactions as a Near-Data Processing (NDP)-operation. The proposed system enables Database Management Systems (DBMS) to offload NDP-operations to the storage while maintaining control over data through a native storage interface. DANSEN provides an NDP-engine that enables DBMS to perform both low-level database tasks, such as performing database administration, as well as high-level tasks like executing SQL, on the smart storage device while observing the DBMS concurrency control. Furthermore, DANSEN enables the incorporation of custom accelerators as an NDP-operation, e.g., to perform hardware-accelerated ML inference directly on the stored data. We built the DANSEN storage prototype and interface on an Ultrascale+HBM FPGA and fully integrated it with PostgreSQL 12. Experimental results demonstrate that the proposed NDP approach outperforms software-only PostgreSQL using a fast off-the-shelf NVMe drive, and significantly improves the end-to-end execution time of an aggregation operation (similar to Q6 from CH-benCHmark, 150 million records) by ≈ 10.6 ×. The versatility of the proposed approach is also validated by integrating a compute-intensive data analytics application with multi-row results, outperforming PostgreSQL by ≈ 1.5 ×.

本文介绍了neoDBMS的硬件加速组件DANSEN，这是一个全栈计算存储系统，旨在将数据库查询/交易的设备上执行作为近数据处理（NDP）操作进行管理。该系统可使数据库管理系统（DBMS）将 NDP 操作卸载到存储设备上，同时通过本地存储接口保持对数据的控制。DANSEN 提供的 NDP 引擎可使 DBMS 在智能存储设备上执行数据库管理等低级数据库任务和执行 SQL 等高级任务，同时遵守 DBMS 的并发控制。此外，DANSEN 还能将自定义加速器作为 NDP 操作进行整合，例如直接在存储数据上执行硬件加速的 ML 推理。我们在 Ultrascale+HBM FPGA 上构建了 DANSEN 存储原型和接口，并将其与 PostgreSQL 12 完全集成。实验结果表明，使用快速的现成 NVMe 驱动器，所提出的 NDP 方法优于纯软件 PostgreSQL，并将聚合操作（类似于 CH-benCHmark 中的 Q6，1.5 亿条记录）的端到端执行时间显著提高了 ≈ 10.6 倍。通过将计算密集型数据分析应用与多行结果集成，还验证了所提方法的多功能性，其性能比 PostgreSQL 高出 ≈ 1.5 倍。

{"title":"DANSEN: Database Acceleration on Native Computational Storage by Exploiting NDP","authors":"Sajjad Tamimi, Arthur Bernhardt, Florian Stock, Ilia Petrov, Andreas Koch","doi":"10.1145/3655625","DOIUrl":"https://doi.org/10.1145/3655625","url":null,"abstract":"This paper introduces <sans-serif>DANSEN</sans-serif>, the hardware accelerator component for neoDBMS, a full-stack computational storage system designed to manage on-device execution of database queries/transactions as a Near-Data Processing (NDP)-operation. The proposed system enables Database Management Systems (DBMS) to offload NDP-operations to the storage while maintaining control over data through a native storage interface. <sans-serif>DANSEN</sans-serif> provides an NDP-engine that enables DBMS to perform both low-level database tasks, such as performing database administration, as well as high-level tasks like executing SQL, on the smart storage device while observing the DBMS concurrency control. Furthermore, <sans-serif>DANSEN</sans-serif> enables the incorporation of custom accelerators as an NDP-operation, e.g., to perform hardware-accelerated ML inference directly on the stored data. We built the <sans-serif>DANSEN</sans-serif> storage prototype and interface on an Ultrascale+HBM FPGA and fully integrated it with PostgreSQL 12. Experimental results demonstrate that the proposed NDP approach outperforms software-only PostgreSQL using a fast off-the-shelf NVMe drive, and significantly improves the end-to-end execution time of an aggregation operation (similar to Q6 from CH-benCHmark, 150 million records) by ≈ 10.6 ×. The versatility of the proposed approach is also validated by integrating a compute-intensive data analytics application with multi-row results, outperforming PostgreSQL by ≈ 1.5 ×.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"101 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0