首页 > 最新文献

ACM Transactions on Reconfigurable Technology and Systems最新文献

英文 中文
Codesign of reactor-oriented hardware and software for cyber-physical systems 面向反应堆的网络物理系统硬件和软件的代码设计
IF 2.3 4区 计算机科学 Q1 Computer Science Pub Date : 2024-06-12 DOI: 10.1145/3672083
E. Jellum, Martin Schoeberl, Edward A. Lee, Milica Orlandić
Modern cyber-physical systems often make use of heterogeneous systems-on-chip with reconfigurable logic to provide adequate computing power and flexible I/O. However, modeling, verifying, and implementing the computations spanning CPUs and reconfigurable logic is still challenging. The hardware and software components are often designed by different teams and at different levels of abstraction, making it hard to reason about the resulting computation. We propose to lift both hardware and software design to the same level of abstraction by using the Lingua Franca coordination language. Lingua Franca is based on a sparse synchronous model that allows modeling concurrency and timing while keeping a sequential model for the actual computation. We define hardware reactors as a subset of the reactor model of computation underlying Lingua Franca. We also present and evaluate reactor-chisel, a hardware runtime implementing the semantics of hardware reactors, and an extension to the Lingua Franca compiler enabling reactor-oriented hardware-software codesign.
现代网络物理系统通常利用具有可重构逻辑的异构片上系统来提供足够的计算能力和灵活的输入/输出。然而,对跨越 CPU 和可重构逻辑的计算进行建模、验证和实现仍然具有挑战性。硬件和软件组件通常由不同的团队在不同的抽象层次上设计,因此很难对计算结果进行推理。我们建议使用 Lingua Franca 协调语言将硬件和软件设计提升到同一抽象层次。Lingua Franca 基于稀疏同步模型,可对并发和时序进行建模,同时为实际计算保留顺序模型。我们将硬件反应器定义为 Lingua Franca 基础计算反应器模型的一个子集。我们还介绍并评估了实现硬件反应器语义的硬件运行时 reactor-chisel,以及对 Lingua Franca 编译器的扩展,从而实现面向反应器的硬件-软件编码设计。
{"title":"Codesign of reactor-oriented hardware and software for cyber-physical systems","authors":"E. Jellum, Martin Schoeberl, Edward A. Lee, Milica Orlandić","doi":"10.1145/3672083","DOIUrl":"https://doi.org/10.1145/3672083","url":null,"abstract":"Modern cyber-physical systems often make use of heterogeneous systems-on-chip with reconfigurable logic to provide adequate computing power and flexible I/O. However, modeling, verifying, and implementing the computations spanning CPUs and reconfigurable logic is still challenging. The hardware and software components are often designed by different teams and at different levels of abstraction, making it hard to reason about the resulting computation. We propose to lift both hardware and software design to the same level of abstraction by using the Lingua Franca coordination language. Lingua Franca is based on a sparse synchronous model that allows modeling concurrency and timing while keeping a sequential model for the actual computation. We define hardware reactors as a subset of the reactor model of computation underlying Lingua Franca. We also present and evaluate reactor-chisel, a hardware runtime implementing the semantics of hardware reactors, and an extension to the Lingua Franca compiler enabling reactor-oriented hardware-software codesign.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141351533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient SpMM Accelerator for Deep Learning: Sparkle and Its Automated Generator 用于深度学习的高效 SpMM 加速器:Sparkle 及其自动生成器
IF 2.3 4区 计算机科学 Q1 Computer Science Pub Date : 2024-06-07 DOI: 10.1145/3665896
Shiyao Xu, Jingfei Jiang, Jinwei Xu, Xifu Qian
Deep learning (DL) technology has made breakthroughs in a wide range of intelligent tasks such as vision, language, recommendation systems, etc. Sparse matrix multiplication (SpMM) is the key computation kernel of most sparse models. Conventional computing platforms such as CPUs, GPUs, and AI chips with regular processing units are unable to effectively support sparse computation due to their fixed structure and instruction sets. This work extends Sparkle, an accelerator architecture, which is developed specifically for processing SpMM in DL. During the balanced data loading process, some modifications are implemented to enhance the flexibility of the Sparkle architecture. Additionally, a Sparkle generator is proposed to accommodate diverse resource constraints and facilitate adaptable deployment. Leveraging Sparkle’s structural parameters and template-based design methods, the generator enables automatic Sparkle circuit generation under varying parameters. An instantiated Sparkle accelerator is implemented on the Xilinx xqvu11p FPGA platform with a specific configuration. Compared to the state-of-the-art SpMM accelerator SIGMA, the Sparkle accelerator instance improves the sparse computing efficiency by about 10 to 20 (%) . Furthermore, the Sparkle instance achieved 7.76 (times) higher performance over the Nvidia Orin NX GPU. More instances of accelerators with different parameters were evaluated, demonstrating that the Sparkle architecture can effectively accelerate SpMM.
深度学习(DL)技术在视觉、语言、推荐系统等广泛的智能任务中取得了突破性进展。稀疏矩阵乘法(SpMM)是大多数稀疏模型的关键计算内核。传统的计算平台,如 CPU、GPU 和带有常规处理单元的 AI 芯片,由于结构和指令集固定,无法有效支持稀疏计算。本研究扩展了专为在 DL 中处理 SpMM 而开发的加速器架构 Sparkle。在平衡数据加载过程中,实施了一些修改,以增强 Sparkle 架构的灵活性。此外,还提出了一种 Sparkle 生成器,以适应不同的资源限制并促进适应性部署。利用 Sparkle 的结构参数和基于模板的设计方法,生成器可在不同参数下自动生成 Sparkle 电路。在具有特定配置的 Xilinx xqvu11p FPGA 平台上实现了实例化的 Sparkle 加速器。与最先进的SpMM加速器SIGMA相比,Sparkle加速器实例将稀疏计算效率提高了约10到20(%)。此外,与Nvidia Orin NX GPU相比,Sparkle实例的性能提高了7.76倍。对更多不同参数的加速器实例进行了评估,证明了Sparkle架构可以有效加速SpMM。
{"title":"Efficient SpMM Accelerator for Deep Learning: Sparkle and Its Automated Generator","authors":"Shiyao Xu, Jingfei Jiang, Jinwei Xu, Xifu Qian","doi":"10.1145/3665896","DOIUrl":"https://doi.org/10.1145/3665896","url":null,"abstract":"Deep learning (DL) technology has made breakthroughs in a wide range of intelligent tasks such as vision, language, recommendation systems, etc. Sparse matrix multiplication (SpMM) is the key computation kernel of most sparse models. Conventional computing platforms such as CPUs, GPUs, and AI chips with regular processing units are unable to effectively support sparse computation due to their fixed structure and instruction sets. This work extends Sparkle, an accelerator architecture, which is developed specifically for processing SpMM in DL. During the balanced data loading process, some modifications are implemented to enhance the flexibility of the Sparkle architecture. Additionally, a Sparkle generator is proposed to accommodate diverse resource constraints and facilitate adaptable deployment. Leveraging Sparkle’s structural parameters and template-based design methods, the generator enables automatic Sparkle circuit generation under varying parameters. An instantiated Sparkle accelerator is implemented on the Xilinx xqvu11p FPGA platform with a specific configuration. Compared to the state-of-the-art SpMM accelerator SIGMA, the Sparkle accelerator instance improves the sparse computing efficiency by about 10 to 20 (%) . Furthermore, the Sparkle instance achieved 7.76 (times) higher performance over the Nvidia Orin NX GPU. More instances of accelerators with different parameters were evaluated, demonstrating that the Sparkle architecture can effectively accelerate SpMM.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141373115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Turn on, Tune in, Listen up: Maximizing Side-Channel Recovery in Cross-Platform Time-to-Digital Converters 打开、调谐、聆听:最大限度提高跨平台时数转换器的侧信道恢复能力
IF 2.3 4区 计算机科学 Q1 Computer Science Pub Date : 2024-06-07 DOI: 10.1145/3666092
Colin Drewes, Tyler Sheaves, Olivia Weng, Keegan Ryan, Bill Hunter, Christopher McCarty, R. Kastner, Dustin Richmond
Voltage fluctuation sensors measure minute changes in an FPGA power distribution network, allowing attackers to extract information from concurrently executing computations. Previous voltage fluctuation sensors make assumptions about the co-tenant computation and require the attacker have a priori access or system knowledge to tune the sensor parameters statically. Additionally, prior voltage fluctuation sensors make use of proprietary vendor intellectual property and do not provide guidance on sensor migration to other vendors. We present the open-source design of the Tunable Dual-Polarity Time-to-Digital Converter, which introduces three dynamically tunable parameters that optimize signal measurement, including the transition polarity, sample window, frequency, and phase. We show that a properly tuned sensor improves co-tenant classification accuracy by 2.5 (times) over prior work and increases the ability to identify the co-tenant computation and its microarchitectural implementation. Across 13 varying applications, our techniques yield an 80 (%) classification accuracy that generalizes beyond a single board. Our sensor improves the ability of a correlation power analysis attack to rank correct subkey values by 2 (times) . As an extension to our prior work, we show that the voltage fluctuation sensor is portable to multiple FPGA vendors, and we demonstrate implementations on both Xilinx and Intel FPGA systems.
电压波动传感器测量 FPGA 配电网络中的微小变化,允许攻击者从并发执行的计算中提取信息。以前的电压波动传感器对共同租户的计算做出假设,并要求攻击者拥有先验访问权限或系统知识来静态调整传感器参数。此外,以前的电压波动传感器使用的是供应商的专有知识产权,并不提供将传感器迁移到其他供应商的指导。我们介绍了可调双极性时-数转换器的开源设计,它引入了三个动态可调参数,包括转换极性、采样窗口、频率和相位,从而优化了信号测量。我们的研究表明,经过适当调整的传感器可将共租户分类准确率提高 2.5 倍,并增强了识别共租户计算及其微架构实现的能力。在13个不同的应用中,我们的技术产生了80%的分类准确率,并超越了单个电路板。我们的传感器将相关功率分析攻击对正确子键值进行排序的能力提高了 2 (次)。作为先前工作的延伸,我们展示了电压波动传感器可移植到多个 FPGA 供应商,并演示了在赛灵思和英特尔 FPGA 系统上的实现。
{"title":"Turn on, Tune in, Listen up: Maximizing Side-Channel Recovery in Cross-Platform Time-to-Digital Converters","authors":"Colin Drewes, Tyler Sheaves, Olivia Weng, Keegan Ryan, Bill Hunter, Christopher McCarty, R. Kastner, Dustin Richmond","doi":"10.1145/3666092","DOIUrl":"https://doi.org/10.1145/3666092","url":null,"abstract":"Voltage fluctuation sensors measure minute changes in an FPGA power distribution network, allowing attackers to extract information from concurrently executing computations. Previous voltage fluctuation sensors make assumptions about the co-tenant computation and require the attacker have a priori access or system knowledge to tune the sensor parameters statically. Additionally, prior voltage fluctuation sensors make use of proprietary vendor intellectual property and do not provide guidance on sensor migration to other vendors. We present the open-source design of the Tunable Dual-Polarity Time-to-Digital Converter, which introduces three dynamically tunable parameters that optimize signal measurement, including the transition polarity, sample window, frequency, and phase. We show that a properly tuned sensor improves co-tenant classification accuracy by 2.5 (times) over prior work and increases the ability to identify the co-tenant computation and its microarchitectural implementation. Across 13 varying applications, our techniques yield an 80 (%) classification accuracy that generalizes beyond a single board. Our sensor improves the ability of a correlation power analysis attack to rank correct subkey values by 2 (times) . As an extension to our prior work, we show that the voltage fluctuation sensor is portable to multiple FPGA vendors, and we demonstrate implementations on both Xilinx and Intel FPGA systems.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141372384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
End-to-end codesign of Hessian-aware quantized neural networks for FPGAs 面向 FPGA 的 Hessian 感知量化神经网络的端到端编码设计
IF 2.3 4区 计算机科学 Q1 Computer Science Pub Date : 2024-05-11 DOI: 10.1145/3662000
Javier Campos, Jovan Mitrevski, Nhan Tran, Zhen Dong, Amir Gholaminejad, Michael W. Mahoney, Javier Duarte

We develop an end-to-end workflow for the training and implementation of co-designed neural networks (NNs) for efficient field-programmable gate array (FPGA) hardware. Our approach leverages Hessian-aware quantization (HAWQ) of NNs, the Quantized Open Neural Network Exchange (QONNX) intermediate representation, and the hls4ml tool flow for transpiling NNs into FPGA firmware. This makes efficient NN implementations in hardware accessible to nonexperts, in a single open-sourced workflow that can be deployed for real-time machine-learning applications in a wide range of scientific and industrial settings. We demonstrate the workflow in a particle physics application involving trigger decisions that must operate at the 40 MHz collision rate of the CERN Large Hadron Collider (LHC). Given the high collision rate, all data processing must be implemented on FPGA hardware within the strict area and latency requirements. Based on these constraints, we implement an optimized mixed-precision NN classifier for high-momentum particle jets in simulated LHC proton-proton collisions.

我们为高效现场可编程门阵列(FPGA)硬件的协同设计神经网络(NN)的训练和实施开发了端到端的工作流程。我们的方法利用了神经网络的黑森感知量化 (HAWQ)、量化开放神经网络交换 (QONNX) 中间表示法和 hls4ml 工具流程,用于将神经网络移植到 FPGA 固件中。这使得非专业人员也能在硬件中实现高效的神经网络,在一个开源的工作流程中,可以在广泛的科学和工业环境中部署实时机器学习应用。我们在一个涉及触发决策的粒子物理应用中演示了该工作流,该应用必须在欧洲核子研究中心大型强子对撞机(LHC)40 MHz 的碰撞速率下运行。考虑到高碰撞率,所有数据处理都必须在严格的面积和延迟要求下在 FPGA 硬件上实现。基于这些约束条件,我们在模拟的大型强子对撞机质子-质子对撞中为高动量粒子射流实现了一个优化的混合精度 NN 分类器。
{"title":"End-to-end codesign of Hessian-aware quantized neural networks for FPGAs","authors":"Javier Campos, Jovan Mitrevski, Nhan Tran, Zhen Dong, Amir Gholaminejad, Michael W. Mahoney, Javier Duarte","doi":"10.1145/3662000","DOIUrl":"https://doi.org/10.1145/3662000","url":null,"abstract":"<p>We develop an end-to-end workflow for the training and implementation of co-designed neural networks (NNs) for efficient field-programmable gate array (FPGA) hardware. Our approach leverages Hessian-aware quantization (HAWQ) of NNs, the Quantized Open Neural Network Exchange (QONNX) intermediate representation, and the hls4ml tool flow for transpiling NNs into FPGA firmware. This makes efficient NN implementations in hardware accessible to nonexperts, in a single open-sourced workflow that can be deployed for real-time machine-learning applications in a wide range of scientific and industrial settings. We demonstrate the workflow in a particle physics application involving trigger decisions that must operate at the 40 MHz collision rate of the CERN Large Hadron Collider (LHC). Given the high collision rate, all data processing must be implemented on FPGA hardware within the strict area and latency requirements. Based on these constraints, we implement an optimized mixed-precision NN classifier for high-momentum particle jets in simulated LHC proton-proton collisions.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140940542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DyRecMul: Fast and Low-Cost Approximate Multiplier for FPGAs using Dynamic Reconfiguration DyRecMul:使用动态重配置的 FPGA 快速低成本近似乘法器
IF 2.3 4区 计算机科学 Q1 Computer Science Pub Date : 2024-05-01 DOI: 10.1145/3663480
Shervin Vakili, Mobin Vaziri, Amirhossein Zarei, J.M. Pierre Langlois

Multipliers are widely-used arithmetic operators in digital signal processing and machine learning circuits. Due to their relatively high complexity, they can have high latency and be a significant source of power consumption. One strategy to alleviate these limitations is to use approximate computing. This paper thus introduces an original FPGA-based approximate multiplier specifically optimized for machine learning computations. It utilizes dynamically reconfigurable lookup table (LUT) primitives in AMD-Xilinx technology to realize the core part of the computations. The paper provides an in-depth analysis of the hardware architecture, implementation outcomes, and accuracy evaluations of the multiplier proposed in INT8 precision. The paper also facilitates the generalization of the proposed approximate multiplier idea to other datatypes, providing analysis and estimations for hardware cost and accuracy as a function of multiplier parameters. Implementation results on an AMD-Xilinx Kintex Ultrascale+ FPGA demonstrate remarkable savings of 64% and 67% in LUT utilization for signed multiplication and multiply-and-accumulation configurations, respectively when compared to the standard Xilinx multiplier core. Accuracy measurements on four popular deep learning (DL) benchmarks indicate a minimal average accuracy decrease of less than 0.29% during post-training deployment, with the maximum reduction staying less than 0.33%. The source code of this work is available on GitHub.

乘法器是数字信号处理和机器学习电路中广泛使用的算术运算符。由于其相对较高的复杂性,它们可能会有较高的延迟,并且是功耗的一个重要来源。缓解这些限制的策略之一是使用近似计算。因此,本文介绍了一种基于 FPGA 的原创近似乘法器,专门针对机器学习计算进行了优化。它利用 AMD-Xilinx 技术中的动态可重新配置查找表(LUT)基元来实现计算的核心部分。本文深入分析了 INT8 精度乘法器的硬件架构、实现结果和精度评估。该论文还有助于将所提出的近似乘法器思想推广到其他数据类型,并根据乘法器参数对硬件成本和精度进行分析和估算。在 AMD-Xilinx Kintex Ultrascale+ FPGA 上的实现结果表明,与标准 Xilinx 乘法器内核相比,有符号乘法和乘法累加配置的 LUT 利用率分别节省了 64% 和 67%。对四种流行的深度学习(DL)基准进行的精度测量表明,在训练后部署期间,平均精度下降幅度最小,不到 0.29%,最大下降幅度不到 0.33%。这项工作的源代码可在 GitHub 上获取。
{"title":"DyRecMul: Fast and Low-Cost Approximate Multiplier for FPGAs using Dynamic Reconfiguration","authors":"Shervin Vakili, Mobin Vaziri, Amirhossein Zarei, J.M. Pierre Langlois","doi":"10.1145/3663480","DOIUrl":"https://doi.org/10.1145/3663480","url":null,"abstract":"<p>Multipliers are widely-used arithmetic operators in digital signal processing and machine learning circuits. Due to their relatively high complexity, they can have high latency and be a significant source of power consumption. One strategy to alleviate these limitations is to use approximate computing. This paper thus introduces an original FPGA-based approximate multiplier specifically optimized for machine learning computations. It utilizes dynamically reconfigurable lookup table (LUT) primitives in AMD-Xilinx technology to realize the core part of the computations. The paper provides an in-depth analysis of the hardware architecture, implementation outcomes, and accuracy evaluations of the multiplier proposed in INT8 precision. The paper also facilitates the generalization of the proposed approximate multiplier idea to other datatypes, providing analysis and estimations for hardware cost and accuracy as a function of multiplier parameters. Implementation results on an AMD-Xilinx Kintex Ultrascale+ FPGA demonstrate remarkable savings of 64% and 67% in LUT utilization for signed multiplication and multiply-and-accumulation configurations, respectively when compared to the standard Xilinx multiplier core. Accuracy measurements on four popular deep learning (DL) benchmarks indicate a minimal average accuracy decrease of less than 0.29% during post-training deployment, with the maximum reduction staying less than 0.33%. The source code of this work is available on GitHub.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140831000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dynamic-ACTS - A Dynamic Graph Analytics Accelerator For HBM-Enabled FPGAs Dynamic-ACTS - 面向 HBM FPGA 的动态图形分析加速器
IF 2.3 4区 计算机科学 Q1 Computer Science Pub Date : 2024-04-30 DOI: 10.1145/3662002
Oluwole Jaiyeoba, Kevin Skadron

Graph processing frameworks suffer performance degradation from under-utilization of available memory bandwidth, because graph traversal often exhibits poor locality. A prior work, ACTS [24], accelerates graph processing with FPGAs and High Bandwidth Memory (HBM). ACTS achieves locality by partitioning vertex-update messages (based on destination vertex IDs) generated online after active edges have been processed. This work introduces Dynamic-ACTS which builds on ideas in ACTS to support dynamic graphs. The key innovation is to use a hash table to find the edges to be updated. Compared to Gunrock, a GPU graph engine, Dynamic-ACTS achieves a geometric mean speedup of 1.5X, with a maximum speedup of 4.6X. Compared to GraphLily, an FPGA-HBM graph engine, Dynamic-ACTS achieves a geometric speedup of 3.6X, with a maximum speedup of 16.5X. Our results also showed a geometric mean power reduction of 50% and a mean reduction of energy-delay product of 88% over Gunrock. Compared to GraSU, an FPGA graph updating engine, Dynamic-ACTS achieves an average speedup of 15X.

由于图形遍历通常表现出较低的局部性,可用内存带宽利用不足导致图形处理框架性能下降。之前的一项研究成果 ACTS [24],利用 FPGA 和高带宽内存(HBM)加速了图形处理。ACTS 通过分割活动边处理后在线生成的顶点更新信息(基于目标顶点 ID)来实现局部性。这项工作引入了 Dynamic-ACTS,它以 ACTS 的理念为基础,支持动态图。其主要创新在于使用哈希表来查找要更新的边。与 GPU 图形引擎 Gunrock 相比,Dynamic-ACTS 的几何平均速度提高了 1.5 倍,最大速度提高了 4.6 倍。与 FPGA-HBM 图形引擎 GraphLily 相比,Dynamic-ACTS 的几何平均速度提高了 3.6 倍,最大速度提高了 16.5 倍。我们的结果还显示,与 Gunrock 相比,几何平均功耗降低了 50%,能耗-延迟乘积平均降低了 88%。与 FPGA 图形更新引擎 GraSU 相比,Dynamic-ACTS 的平均速度提高了 15 倍。
{"title":"Dynamic-ACTS - A Dynamic Graph Analytics Accelerator For HBM-Enabled FPGAs","authors":"Oluwole Jaiyeoba, Kevin Skadron","doi":"10.1145/3662002","DOIUrl":"https://doi.org/10.1145/3662002","url":null,"abstract":"<p>Graph processing frameworks suffer performance degradation from under-utilization of available memory bandwidth, because graph traversal often exhibits poor locality. A prior work, ACTS [24], accelerates graph processing with FPGAs and High Bandwidth Memory (HBM). ACTS achieves locality by partitioning vertex-update messages (based on destination vertex IDs) generated online after active edges have been processed. This work introduces Dynamic-ACTS which builds on ideas in ACTS to support dynamic graphs. The key innovation is to use a hash table to find the edges to be updated. Compared to Gunrock, a GPU graph engine, Dynamic-ACTS achieves a geometric mean speedup of 1.5X, with a maximum speedup of 4.6X. Compared to GraphLily, an FPGA-HBM graph engine, Dynamic-ACTS achieves a geometric speedup of 3.6X, with a maximum speedup of 16.5X. Our results also showed a geometric mean power reduction of 50% and a mean reduction of energy-delay product of 88% over Gunrock. Compared to GraSU, an FPGA graph updating engine, Dynamic-ACTS achieves an average speedup of 15X.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NC-Library: Expanding SystemC Capabilities for Nested reConfigurable Hardware Modelling NC 库:为嵌套式可重构硬件建模扩展 SystemC 功能
IF 2.3 4区 计算机科学 Q1 Computer Science Pub Date : 2024-04-27 DOI: 10.1145/3662001
Julian Haase, Najdet Charaf, Alexander Groß, Diana Göhringer

As runtime reconfiguration is used in an increasing number of hardware architectures, new simulation and modeling tools are needed to support the developer during the design phases. In this article, a language extension for SystemC is presented, together with a design methodology for the description and simulation of dynamically reconfigurable hardware at different levels of abstraction. The library presented offers a high degree of flexibility in the description of reconfiguration features and their management, while allowing runtime reconfiguration simulation, removal, and replacement of custom modules as well as third-party components throughout the architecture development process. In addition, our approach supports the emerging concept of nested reconfiguration and split regions with a minimal simulation overhead of a maximum of three delta cycles for signal and transaction forwarding, and four delta cycles for the reconfiguration process.

随着运行时重新配置技术在越来越多的硬件架构中得到应用,需要新的仿真和建模工具在设计阶段为开发人员提供支持。本文介绍了 SystemC 语言的扩展,以及在不同抽象层次上描述和模拟动态可重构硬件的设计方法。所介绍的库在描述重配置功能及其管理方面具有高度灵活性,同时允许在整个架构开发过程中进行运行时重配置仿真、移除和替换定制模块以及第三方组件。此外,我们的方法还支持嵌套重新配置和分割区域的新兴概念,并将信号和事务转发的模拟开销降至最低,最多不超过三个三角洲周期,重新配置过程不超过四个三角洲周期。
{"title":"NC-Library: Expanding SystemC Capabilities for Nested reConfigurable Hardware Modelling","authors":"Julian Haase, Najdet Charaf, Alexander Groß, Diana Göhringer","doi":"10.1145/3662001","DOIUrl":"https://doi.org/10.1145/3662001","url":null,"abstract":"<p>As runtime reconfiguration is used in an increasing number of hardware architectures, new simulation and modeling tools are needed to support the developer during the design phases. In this article, a language extension for SystemC is presented, together with a design methodology for the description and simulation of dynamically reconfigurable hardware at different levels of abstraction. The library presented offers a high degree of flexibility in the description of reconfiguration features and their management, while allowing runtime reconfiguration simulation, removal, and replacement of custom modules as well as third-party components throughout the architecture development process. In addition, our approach supports the emerging concept of nested reconfiguration and split regions with a minimal simulation overhead of a maximum of three delta cycles for signal and transaction forwarding, and four delta cycles for the reconfiguration process.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140798806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PQA: Exploring the Potential of Product Quantization in DNN Hardware Acceleration PQA:探索产品量化在 DNN 硬件加速中的潜力
IF 2.3 4区 计算机科学 Q1 Computer Science Pub Date : 2024-04-18 DOI: 10.1145/3656643
Ahmed F. AbouElhamayed, Angela Cui, Javier Fernandez-Marques, Nicholas D. Lane, Mohamed S. Abdelfattah

Conventional multiply-accumulate (MAC) operations have long dominated computation time for deep neural networks (DNNs), especially convolutional neural networks (CNNs). Recently, product quantization (PQ) has been applied to these workloads, replacing MACs with memory lookups to pre-computed dot products. To better understand the efficiency tradeoffs of product-quantized DNNs (PQ-DNNs), we create a custom hardware accelerator to parallelize and accelerate nearest-neighbor search and dot-product lookups. Additionally, we perform an empirical study to investigate the efficiency–accuracy tradeoffs of different PQ parameterizations and training methods. We identify PQ configurations that improve performance-per-area for ResNet20 by up to 3.1 ×, even when compared to a highly optimized conventional DNN accelerator, with similar improvements on two additional compact DNNs. When comparing to recent PQ solutions, we outperform prior work by 4 × in terms of performance-per-area with a 0.6% accuracy degradation. Finally, we reduce the bitwidth of PQ operations to investigate the impact on both hardware efficiency and accuracy. With only 2–6-bit precision on three compact DNNs, we were able to maintain DNN accuracy eliminating the need for DSPs.

对于深度神经网络(DNN),尤其是卷积神经网络(CNN)来说,传统的乘积(MAC)运算长期以来一直占据着计算时间的主导地位。最近,积量化(PQ)被应用到这些工作负载中,用内存查找预计算点积的方式取代了 MAC。为了更好地理解积量化 DNN(PQ-DNN)的效率权衡,我们创建了一个定制硬件加速器,用于并行加速最近邻搜索和点积查找。此外,我们还进行了一项实证研究,以调查不同 PQ 参数化和训练方法在效率和准确性之间的权衡。我们发现,即使与高度优化的传统 DNN 加速器相比,PQ 配置也能将 ResNet20 的单位面积性能提高 3.1 倍,而且在另外两个紧凑型 DNN 上也有类似的提高。与最近的 PQ 解决方案相比,我们的单位面积性能提高了 4 倍,准确率降低了 0.6%。最后,我们降低了 PQ 运算的位宽,以研究其对硬件效率和精度的影响。在三个紧凑型 DNN 上仅使用 2-6 位精度的情况下,我们能够保持 DNN 的精度,而无需使用 DSP。
{"title":"PQA: Exploring the Potential of Product Quantization in DNN Hardware Acceleration","authors":"Ahmed F. AbouElhamayed, Angela Cui, Javier Fernandez-Marques, Nicholas D. Lane, Mohamed S. Abdelfattah","doi":"10.1145/3656643","DOIUrl":"https://doi.org/10.1145/3656643","url":null,"abstract":"<p>Conventional multiply-accumulate (MAC) operations have long dominated computation time for deep neural networks (DNNs), especially convolutional neural networks (CNNs). Recently, product quantization (PQ) has been applied to these workloads, replacing MACs with memory lookups to pre-computed dot products. To better understand the efficiency tradeoffs of product-quantized DNNs (PQ-DNNs), we create a custom hardware accelerator to parallelize and accelerate nearest-neighbor search and dot-product lookups. Additionally, we perform an empirical study to investigate the efficiency–accuracy tradeoffs of different PQ parameterizations and training methods. We identify PQ configurations that improve performance-per-area for ResNet20 by up to 3.1 ×, even when compared to a highly optimized conventional DNN accelerator, with similar improvements on two additional compact DNNs. When comparing to recent PQ solutions, we outperform prior work by 4 × in terms of performance-per-area with a 0.6% accuracy degradation. Finally, we reduce the bitwidth of PQ operations to investigate the impact on both hardware efficiency and accuracy. With only 2–6-bit precision on three compact DNNs, we were able to maintain DNN accuracy eliminating the need for DSPs.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140617833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Toward FPGA Intellectual Property (IP) Encryption from Netlist to Bitstream 实现从网表到比特流的 FPGA 知识产权 (IP) 加密
IF 2.3 4区 计算机科学 Q1 Computer Science Pub Date : 2024-04-12 DOI: 10.1145/3656644
Daniel Hutchings, Adam Taylor, Jeffrey Goeders

Current IP encryption methods offered by FPGA vendors use an approach where the IP is decrypted during the CAD flow, and remains unencrypted in the bitstream. Given the ease of accessing modern bitstream-to-netlist tools, encrypted IP is vulnerable to inspection and theft from the IP user. While the entire bitstream can be encrypted, this is done by the user, and is not a mechanism to protect confidentiality of 3rd party IP.

In this work we present a design methodology, along with a proof-of-concept tool, that demonstrates how IP can remain partially encrypted through the CAD flow and into the bitstream. We show how this approach can support multiple encryption keys from different vendors, and can be deployed using existing CAD tools and FPGA families. Our results document the benefits and costs of using such an approach to provide much greater protection for 3rd party IP.

目前 FPGA 供应商提供的 IP 加密方法采用的是在 CAD 流程中对 IP 进行解密,而在比特流中保持未加密的方式。鉴于现代位流到网表工具的易用性,加密的 IP 很容易被 IP 用户检查和窃取。虽然可以对整个比特流进行加密,但这是由用户完成的,并不是保护第三方 IP 机密性的机制。在这项工作中,我们介绍了一种设计方法和概念验证工具,演示了如何通过 CAD 流程将部分 IP 加密保留到比特流中。我们展示了这种方法如何支持来自不同供应商的多个加密密钥,以及如何利用现有 CAD 工具和 FPGA 系列进行部署。我们的研究结果证明了使用这种方法为第三方 IP 提供更大保护的优势和成本。
{"title":"Toward FPGA Intellectual Property (IP) Encryption from Netlist to Bitstream","authors":"Daniel Hutchings, Adam Taylor, Jeffrey Goeders","doi":"10.1145/3656644","DOIUrl":"https://doi.org/10.1145/3656644","url":null,"abstract":"<p>Current IP encryption methods offered by FPGA vendors use an approach where the IP is decrypted during the CAD flow, and remains unencrypted in the bitstream. Given the ease of accessing modern bitstream-to-netlist tools, encrypted IP is vulnerable to inspection and theft from the IP user. While the entire bitstream can be encrypted, this is done by the user, and is not a mechanism to protect confidentiality of 3rd party IP. </p><p>In this work we present a design methodology, along with a proof-of-concept tool, that demonstrates how IP can remain partially encrypted through the CAD flow and into the bitstream. We show how this approach can support multiple encryption keys from different vendors, and can be deployed using existing CAD tools and FPGA families. Our results document the benefits and costs of using such an approach to provide much greater protection for 3rd party IP.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HierCGRA: A Novel Framework for Large-Scale CGRA with Hierarchical Modeling and Automated Design Space Exploration HierCGRA:利用分层建模和自动设计空间探索实现大规模 CGRA 的新型框架
IF 2.3 4区 计算机科学 Q1 Computer Science Pub Date : 2024-04-08 DOI: 10.1145/3656176
Sichao Chen, Chang Cai, Su Zheng, Jiangnan Li, Guowei Zhu, Jingyuan Li, Yazhou Yan, Yuan Dai, Wenbo Yin, Lingli Wang

Coarse-grained reconfigurable arrays (CGRAs) are promising design choices in computation-intensive domains since they can strike a balance between energy efficiency and flexibility. A typical CGRA comprises processing elements (PEs) that can execute operations in applications and interconnections between them. Nevertheless, most CGRAs suffer from the ineffectiveness of supporting flexible architecture design and solving large-scale mapping problems. To address these challenges, we introduce HierCGRA, a novel framework that integrates hierarchical CGRA modeling, Chisel-based Verilog generation, LLVM-based data flow graph (DFG) generation, DFG mapping, and design space exploration (DSE). With the graph homomorphism (GH) mapping algorithm, HierCGRA achieves a faster mapping speed and higher PE utilization rate compared with the existing state-of-the-art CGRA frameworks. The proposed hierarchical mapping strategy achieves 41× speedup on average compared with the ILP mapping algorithm in CGRA-ME. Furthermore, the automated DSE based on Bayesian optimization achieves a significant performance improvement by the heterogeneity of PEs and interconnections. With these features, HierCGRA enables the agile development for large-scale CGRA and accelerates the process of finding a better CGRA architecture.

粗粒度可重构阵列(CGRA)在计算密集型领域是一种很有前途的设计选择,因为它们可以在能效和灵活性之间取得平衡。典型的 CGRA 由可在应用中执行操作的处理元件(PE)和它们之间的互连组成。然而,大多数 CGRA 在支持灵活的架构设计和解决大规模映射问题方面效果不佳。为了应对这些挑战,我们引入了 HierCGRA,这是一种集成了分层 CGRA 建模、基于 Chisel 的 Verilog 生成、基于 LLVM 的数据流图(DFG)生成、DFG 映射和设计空间探索(DSE)的新型框架。通过图同态(GH)映射算法,HierCGRA 与现有最先进的 CGRA 框架相比,实现了更快的映射速度和更高的 PE 利用率。与 CGRA-ME 中的 ILP 映射算法相比,所提出的分层映射策略平均加快了 41 倍。此外,基于贝叶斯优化的自动 DSE 通过 PE 和互连的异构性实现了显著的性能提升。凭借这些特性,HierCGRA 实现了大规模 CGRA 的敏捷开发,并加快了寻找更好 CGRA 架构的进程。
{"title":"HierCGRA: A Novel Framework for Large-Scale CGRA with Hierarchical Modeling and Automated Design Space Exploration","authors":"Sichao Chen, Chang Cai, Su Zheng, Jiangnan Li, Guowei Zhu, Jingyuan Li, Yazhou Yan, Yuan Dai, Wenbo Yin, Lingli Wang","doi":"10.1145/3656176","DOIUrl":"https://doi.org/10.1145/3656176","url":null,"abstract":"<p>Coarse-grained reconfigurable arrays (CGRAs) are promising design choices in computation-intensive domains since they can strike a balance between energy efficiency and flexibility. A typical CGRA comprises processing elements (PEs) that can execute operations in applications and interconnections between them. Nevertheless, most CGRAs suffer from the ineffectiveness of supporting flexible architecture design and solving large-scale mapping problems. To address these challenges, we introduce HierCGRA, a novel framework that integrates hierarchical CGRA modeling, Chisel-based Verilog generation, LLVM-based data flow graph (DFG) generation, DFG mapping, and design space exploration (DSE). With the graph homomorphism (GH) mapping algorithm, HierCGRA achieves a faster mapping speed and higher PE utilization rate compared with the existing state-of-the-art CGRA frameworks. The proposed hierarchical mapping strategy achieves 41× speedup on average compared with the ILP mapping algorithm in CGRA-ME. Furthermore, the automated DSE based on Bayesian optimization achieves a significant performance improvement by the heterogeneity of PEs and interconnections. With these features, HierCGRA enables the agile development for large-scale CGRA and accelerates the process of finding a better CGRA architecture.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM Transactions on Reconfigurable Technology and Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1