ACM Transactions on Reconfigurable Technology and Systems最新文献_第4页

Covert-channels in FPGA-enabled SmartSSDs 支持fpga的smartssd中的转换通道

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-12-04 DOI: 10.1145/3635312

Theodoros Trochatos, Anthony Etim, Jakub Szefer

Cloud computing providers today offer access to a variety of devices, which users can rent and access remotely in a shared setting. Among these devices are SmartSSDs, which a solid-state disks (SSD) augmented with an FPGA, enabling users to instantiate custom circuits within the FPGA, including potentially malicious circuits for power and temperature measurement. Normally, cloud users have no remote access to power and temperature data, but with SmartSSDs they could abuse the FPGA component to instantiate circuits to learn this information. Additionally, custom power waster circuits can be instantiated within the FPGA. This paper shows for the first time that by leveraging ring oscillator sensors and power wasters, numerous covert-channels in FPGA-enabled SmartSSDs could be used to transmit information. This work presents two channels in single-tenant setting (SmartSSD is used by one user at a time) and two channels in multi-tenant setting (FPGA and SSD inside SmartSSD is shared by different users). The presented covert channels can reach close to 100% accuracy. Meanwhile, bandwidth of the channels can be easily scaled by cloud users renting more SmartSSDs as the bandwidth of the covert channels is proportional to number of SmartSSD used.

如今，云计算提供商提供对各种设备的访问，用户可以在共享设置中租用和远程访问这些设备。这些设备中有smartssd，它是一个固态磁盘(SSD)，增强了FPGA，使用户能够在FPGA内实例化自定义电路，包括用于功率和温度测量的潜在恶意电路。通常，云用户无法远程访问电源和温度数据，但使用smartssd，他们可以滥用FPGA组件来实例化电路以获取这些信息。此外，可在FPGA内实例化定制的功耗损耗电路。本文首次表明，通过利用环形振荡器传感器和功耗浪费器，可以使用fpga支持的smartssd中的许多转换通道来传输信息。本工作在单租户(SmartSSD由一个用户一次使用)和多租户(SmartSSD内部的FPGA和SSD由不同用户共享)的情况下提供了两个通道。所提出的隐蔽信道可以达到接近100%的准确率。同时，由于隐蔽通道的带宽与使用的SmartSSD数量成正比，云用户可以租用更多的SmartSSD来扩展通道的带宽。

{"title":"Covert-channels in FPGA-enabled SmartSSDs","authors":"Theodoros Trochatos, Anthony Etim, Jakub Szefer","doi":"10.1145/3635312","DOIUrl":"https://doi.org/10.1145/3635312","url":null,"abstract":"Cloud computing providers today offer access to a variety of devices, which users can rent and access remotely in a shared setting. Among these devices are SmartSSDs, which a solid-state disks (SSD) augmented with an FPGA, enabling users to instantiate custom circuits within the FPGA, including potentially malicious circuits for power and temperature measurement. Normally, cloud users have no remote access to power and temperature data, but with SmartSSDs they could abuse the FPGA component to instantiate circuits to learn this information. Additionally, custom power waster circuits can be instantiated within the FPGA. This paper shows for the first time that by leveraging ring oscillator sensors and power wasters, numerous covert-channels in FPGA-enabled SmartSSDs could be used to transmit information. This work presents two channels in single-tenant setting (SmartSSD is used by one user at a time) and two channels in multi-tenant setting (FPGA and SSD inside SmartSSD is shared by different users). The presented covert channels can reach close to 100% accuracy. Meanwhile, bandwidth of the channels can be easily scaled by cloud users renting more SmartSSDs as the bandwidth of the covert channels is proportional to number of SmartSSD used.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"64 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Across Time and Space: Senju’s Approach for Scaling Iterative Stencil Loop Accelerators on Single and Multiple FPGAs 跨越时间和空间:Senju在单个和多个fpga上缩放迭代模板循环加速器的方法

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-11-29 DOI: 10.1145/3634920

Emanuele Del Sozzo, Davide Conficconi, Kentaro Sano

Stencil-based applications play an essential role in high-performance systems as they occur in numerous computational areas, such as partial differential equation solving. In this context, Iterative Stencil Loops (ISLs) represent a prominent and well-known algorithmic class within the stencil domain. Specifically, ISL-based calculations iteratively apply the same stencil to a multi-dimensional point grid multiple times or until convergence. However, due to their iterative and intensive nature, ISLs are highly performance-hungry, demanding specialized solutions. Here, Field Programmable Gate Arrays (FPGAs) represent a valid architectural choice as they enable the design of custom, parallel, and scalable ISL accelerators. Besides, the regular structure of ISLs makes them an ideal candidate for automatic optimization and generation flows. For these reasons, this paper introduces Senju, an automation framework for the design of highly parallel ISL accelerators targeting single-/multi-FPGA systems. Given an input description, Senju automates the entire design process and provides accurate performance estimations. The experimental evaluation shows remarkable and scalable results, outperforming single- and multi-FPGA literature approaches under different metrics. Finally, we present a new analysis of temporal and spatial parallelism trade-offs in a real-case scenario and discuss our performance through a single- and novel specialized multi-FPGA formulation of the Roofline Model.

基于模板的应用程序在高性能系统中起着至关重要的作用，因为它们出现在许多计算领域，例如偏微分方程求解。在这种情况下，迭代模板循环(isl)代表了模板领域中一个突出且众所周知的算法类。具体来说，基于is的计算迭代地将相同的模板应用于多维点网格多次或直到收敛。然而，由于其迭代性和密集性，isl对性能要求很高，需要专门的解决方案。在这里，现场可编程门阵列(fpga)代表了一种有效的架构选择，因为它们可以设计定制的、并行的和可扩展的ISL加速器。此外，isl的规则结构使其成为自动优化和生成流的理想候选者。基于这些原因，本文介绍了Senju，一个针对单/多fpga系统的高度并行ISL加速器设计的自动化框架。给定输入描述，Senju将整个设计过程自动化，并提供准确的性能评估。实验评估显示出显著的可扩展结果，在不同指标下优于单fpga和多fpga文献方法。最后，我们提出了一个新的分析，在实际情况下的时间和空间并行权衡，并讨论了我们的性能通过一个单一的和新颖的专用多fpga的rooline模型配方。

{"title":"Across Time and Space: Senju’s Approach for Scaling Iterative Stencil Loop Accelerators on Single and Multiple FPGAs","authors":"Emanuele Del Sozzo, Davide Conficconi, Kentaro Sano","doi":"10.1145/3634920","DOIUrl":"https://doi.org/10.1145/3634920","url":null,"abstract":"Stencil-based applications play an essential role in high-performance systems as they occur in numerous computational areas, such as partial differential equation solving. In this context, Iterative Stencil Loops (ISLs) represent a prominent and well-known algorithmic class within the stencil domain. Specifically, ISL-based calculations iteratively apply the same stencil to a multi-dimensional point grid multiple times or until convergence. However, due to their iterative and intensive nature, ISLs are highly performance-hungry, demanding specialized solutions. Here, Field Programmable Gate Arrays (FPGAs) represent a valid architectural choice as they enable the design of custom, parallel, and scalable ISL accelerators. Besides, the regular structure of ISLs makes them an ideal candidate for automatic optimization and generation flows. For these reasons, this paper introduces Senju, an automation framework for the design of highly parallel ISL accelerators targeting single-/multi-FPGA systems. Given an input description, Senju automates the entire design process and provides accurate performance estimations. The experimental evaluation shows remarkable and scalable results, outperforming single- and multi-FPGA literature approaches under different metrics. Finally, we present a new analysis of temporal and spatial parallelism trade-offs in a real-case scenario and discuss our performance through a single- and novel specialized multi-FPGA formulation of the Roofline Model.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"42 12","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the Malicious Potential of Xilinx’ Internal Configuration Access Port (ICAP) Xilinx内部配置访问端口(ICAP)的恶意漏洞分析

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-11-17 DOI: 10.1145/3633204

Nils Albartus, Maik Ender, Jan-Niklas Möller, Marc Fyrbiak, Christof Paar, Russell Tessier

FPGAs have become increasingly popular in computing platforms. With recent advances in bitstream format reverse engineering, the scientific community has widely explored static FPGA security threats. For example, it is now possible to convert a bitstream to a netlist, revealing design information, and apply modifications to the static bitstream based on this knowledge. However, a systematic study of the influence of the bitstream format understanding in regards to the security aspects of the dynamic configuration process, particularly for Xilinx’s Internal Configuration Access Port (ICAP), is lacking. This paper fills this gap by comprehensively analyzing the security implications of ICAP interfaces, which primarily support dynamic partial reconfiguration. We delve into the Xilinx bitstream file format, identify misconceptions in official documentation, and propose novel configuration (attack) primitives based on dynamic reconfiguration, i.e., create/read/update/delete circuits in the FPGA, without requiring pre-definition during the design phase. Our primitives are consolidated in a novel Stealthy Reconfigurable Adaptive Trojan (STRAT) framework to conceal Trojans and evade state-of-the-art netlist reverse engineering methods. As FPGAs become integral to modern cloud computing, this research presents crucial insights on potential security risks, including the possibility of a malicious tenant or provider altering or spying on another tenant’s configuration undetected.

fpga在计算平台中越来越受欢迎。随着比特流格式逆向工程的最新进展，科学界对静态FPGA的安全威胁进行了广泛的探索。例如，现在可以将比特流转换为网表，揭示设计信息，并根据这些知识对静态比特流应用修改。然而，对于比特流格式理解对动态配置过程安全方面的影响，特别是对于Xilinx的内部配置访问端口(ICAP)，缺乏系统的研究。本文通过全面分析ICAP接口的安全含义来填补这一空白，ICAP接口主要支持动态部分重构。我们深入研究了Xilinx比特流文件格式，识别官方文档中的误解，并提出了基于动态重新配置的新配置(攻击)原语，即在FPGA中创建/读取/更新/删除电路，而无需在设计阶段预先定义。我们的原语被整合在一个新颖的隐身可重构自适应木马(STRAT)框架中，以隐藏木马并逃避最先进的网络列表逆向工程方法。随着fpga成为现代云计算不可或缺的一部分，这项研究提出了对潜在安全风险的重要见解，包括恶意租户或提供商在未被发现的情况下更改或监视另一个租户配置的可能性。

{"title":"On the Malicious Potential of Xilinx’ Internal Configuration Access Port (ICAP)","authors":"Nils Albartus, Maik Ender, Jan-Niklas Möller, Marc Fyrbiak, Christof Paar, Russell Tessier","doi":"10.1145/3633204","DOIUrl":"https://doi.org/10.1145/3633204","url":null,"abstract":"FPGAs have become increasingly popular in computing platforms. With recent advances in bitstream format reverse engineering, the scientific community has widely explored static FPGA security threats. For example, it is now possible to convert a bitstream to a netlist, revealing design information, and apply modifications to the static bitstream based on this knowledge. However, a systematic study of the influence of the bitstream format understanding in regards to the security aspects of the dynamic configuration process, particularly for Xilinx’s Internal Configuration Access Port (ICAP), is lacking. This paper fills this gap by comprehensively analyzing the security implications of ICAP interfaces, which primarily support dynamic partial reconfiguration. We delve into the Xilinx bitstream file format, identify misconceptions in official documentation, and propose novel configuration (attack) primitives based on dynamic reconfiguration, i.e., create/read/update/delete circuits in the FPGA, without requiring pre-definition during the design phase. Our primitives are consolidated in a novel Stealthy Reconfigurable Adaptive Trojan (<monospace>STRAT</monospace>) framework to conceal Trojans and evade state-of-the-art netlist reverse engineering methods. As FPGAs become integral to modern cloud computing, this research presents crucial insights on potential security risks, including the possibility of a malicious tenant or provider altering or spying on another tenant’s configuration undetected.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"83 4","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138504977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HyBNN: Quantifying and Optimizing Hardware Efficiency of Binary Neural Networks 二值神经网络硬件效率的量化与优化

4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-11-07 DOI: 10.1145/3631610

Geng Yang, Jie Lei, Zhenman Fang, Yunsong li, Jiaqing Zhang, Weiying Xie

Binary neural network (BNN), where both the weight and the activation values are represented with one bit, provides an attractive alternative to deploy highly efficient deep learning inference on resource-constrained edge devices. However, our investigation reveals that, to achieve satisfactory accuracy gains, state-of-the-art (SOTA) BNNs, such as FracBNN and ReActNet, usually have to incorporate various auxiliary floating-point components and increase the model size, which in turn degrades the hardware performance efficiency. In this paper, we aim to quantify such hardware inefficiency in SOTA BNNs and further mitigate it with negligible accuracy loss. First, we observe that the auxiliary floating-point (AFP) components consume an average of 93% DSPs, 46% LUTs, and 62% FFs, among the entire BNN accelerator resource utilization. To mitigate such overhead, we propose a novel algorithm-hardware co-design, called FuseBNN , to fuse those AFP operators without hurting the accuracy. On average, FuseBNN reduces AFP resource utilization to 59% DSPs, 13% LUTs, and 16% FFs. Second, SOTA BNNs often use the compact MobileNetV1 as the backbone network but have to replace the lightweight 3 × 3 depth-wise convolution (DWC) with the 3 × 3 standard convolution (SC, e.g., in ReActNet and our ReActNet-adapted BaseBNN) or even more complex fractional 3 × 3 SC (e.g., in FracBNN) to bridge the accuracy gap. As a result, the model parameter size is significantly increased and becomes 2.25 × larger than that of the 4-bit direct quantization with the original DWC (4-Bit-Net); the number of multiply-accumulate operations is also significantly increased so that the overall LUT resource usage of BaseBNN is almost the same as that of 4-Bit-Net. To address this issue, we propose HyBNN , where we binarize depth-wise separation convolution (DSC) blocks for the first time to decrease the model size and incorporate 4-bit DSC blocks to compensate for the accuracy loss. For the ship detection task in synthetic aperture radar imagery on the AMD-Xilinx ZCU102 FPGA, HyBNN achieves a detection accuracy of 94.8% and a detection speed of 615 frames per second (FPS), which is 6.8 × faster than FuseBNN+ (94.9% accuracy) and 2.7 × faster than 4-Bit-Net (95.9% accuracy). For image classification on the CIFAR-10 dataset on the AMD-Xilinx Ultra96-V2 FPGA, HyBNN achieves 1.5 × speedup and 0.7% better accuracy over SOTA FracBNN.

二元神经网络(BNN)，其中权重和激活值都用一个比特表示，为在资源受限的边缘设备上部署高效的深度学习推理提供了一个有吸引力的替代方案。然而，我们的调查显示，为了获得令人满意的精度增益，最先进的(SOTA) bnn，如FracBNN和ReActNet，通常必须合并各种辅助浮点组件并增加模型尺寸，这反过来又降低了硬件性能效率。在本文中，我们的目标是量化SOTA bnn中的这种硬件低效率，并在可以忽略的精度损失下进一步减轻它。首先，我们观察到辅助浮点(AFP)组件在整个BNN加速器资源利用率中平均消耗93%的dsp, 46%的lut和62%的ff。为了减少这种开销，我们提出了一种新的算法-硬件协同设计，称为FuseBNN，在不影响精度的情况下融合这些AFP算子。平均而言，FuseBNN将AFP资源利用率降低到59%的dsp, 13%的lut和16%的ff。其次，SOTA bnn通常使用紧凑的MobileNetV1作为骨干网络，但必须用3 × 3标准卷积(SC，例如在ReActNet和我们的ReActNet-adapted BaseBNN中)或更复杂的分数3 × 3 SC(例如在FracBNN中)取代轻量级的3 × 3深度卷积(DWC)来弥合精度差距。结果，模型参数尺寸明显增大，比原始DWC (4-bit - net)的4位直接量化大2.25倍;乘法累加操作的数量也显著增加，使得BaseBNN的总体LUT资源使用几乎与4-Bit-Net相同。为了解决这个问题，我们提出了HyBNN，其中我们首次对深度分离卷积(DSC)块进行二值化以减小模型大小，并合并4位DSC块来补偿精度损失。在基于AMD-Xilinx ZCU102 FPGA的合成孔径雷达图像舰船检测任务中，HyBNN的检测精度为94.8%，检测速度为615帧/秒(FPS)，比FuseBNN+(精度94.9%)快6.8倍，比4-Bit-Net(精度95.9%)快2.7倍。对于在AMD-Xilinx Ultra96-V2 FPGA上的CIFAR-10数据集上的图像分类，HyBNN比SOTA FracBNN实现了1.5倍的加速和0.7%的精度提高。

{"title":"HyBNN: Quantifying and Optimizing Hardware Efficiency of Binary Neural Networks","authors":"Geng Yang, Jie Lei, Zhenman Fang, Yunsong li, Jiaqing Zhang, Weiying Xie","doi":"10.1145/3631610","DOIUrl":"https://doi.org/10.1145/3631610","url":null,"abstract":"Binary neural network (BNN), where both the weight and the activation values are represented with one bit, provides an attractive alternative to deploy highly efficient deep learning inference on resource-constrained edge devices. However, our investigation reveals that, to achieve satisfactory accuracy gains, state-of-the-art (SOTA) BNNs, such as FracBNN and ReActNet, usually have to incorporate various auxiliary floating-point components and increase the model size, which in turn degrades the hardware performance efficiency. In this paper, we aim to quantify such hardware inefficiency in SOTA BNNs and further mitigate it with negligible accuracy loss. First, we observe that the auxiliary floating-point (AFP) components consume an average of 93% DSPs, 46% LUTs, and 62% FFs, among the entire BNN accelerator resource utilization. To mitigate such overhead, we propose a novel algorithm-hardware co-design, called FuseBNN , to fuse those AFP operators without hurting the accuracy. On average, FuseBNN reduces AFP resource utilization to 59% DSPs, 13% LUTs, and 16% FFs. Second, SOTA BNNs often use the compact MobileNetV1 as the backbone network but have to replace the lightweight 3 × 3 depth-wise convolution (DWC) with the 3 × 3 standard convolution (SC, e.g., in ReActNet and our ReActNet-adapted BaseBNN) or even more complex fractional 3 × 3 SC (e.g., in FracBNN) to bridge the accuracy gap. As a result, the model parameter size is significantly increased and becomes 2.25 × larger than that of the 4-bit direct quantization with the original DWC (4-Bit-Net); the number of multiply-accumulate operations is also significantly increased so that the overall LUT resource usage of BaseBNN is almost the same as that of 4-Bit-Net. To address this issue, we propose HyBNN , where we binarize depth-wise separation convolution (DSC) blocks for the first time to decrease the model size and incorporate 4-bit DSC blocks to compensate for the accuracy loss. For the ship detection task in synthetic aperture radar imagery on the AMD-Xilinx ZCU102 FPGA, HyBNN achieves a detection accuracy of 94.8% and a detection speed of 615 frames per second (FPS), which is 6.8 × faster than FuseBNN+ (94.9% accuracy) and 2.7 × faster than 4-Bit-Net (95.9% accuracy). For image classification on the CIFAR-10 dataset on the AMD-Xilinx Ultra96-V2 FPGA, HyBNN achieves 1.5 × speedup and 0.7% better accuracy over SOTA FracBNN.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"279 11","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135475095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Eciton: Very Low-Power Recurrent Neural Network Accelerator for Real-Time Inference at the Edge 用于边缘实时推理的极低功耗递归神经网络加速器

4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-11-01 DOI: 10.1145/3629979

Jeffrey Chen, Sang-Woo Jun, Sehwan Hong, Warrick He, Jinyeong Moon

This paper presents Eciton, a very low-power recurrent neural network accelerator for time series data within low-power edge sensor nodes, achieving real-time inference with a power consumption of 17 mW under load. Eciton reduces memory and chip resource requirements via 8-bit quantization and hard sigmoid activation, allowing the accelerator as well as the recurrent neural network model parameters to fit in a low-cost, low-power Lattice iCE40 UP5K FPGA. We evaluate Eciton on multiple, established time-series classification applications including predictive maintenance of mechanical systems, sound classification, and intrusion detection for IoT nodes. Binary and multi-class classification edge models are explored, demonstrating that Eciton can adapt to a variety of deployable environments and remote use cases. Eciton demonstrates real-time processing at a very low power consumption with minimal loss of accuracy on multiple inference scenarios with differing characteristics, while achieving competitive power efficiency against the state-of-the-art of similar scale. We show that the addition of this accelerator actually reduces the power budget of the sensor node by reducing power-hungry wireless transmission. The resulting power budget of the sensor node is small enough to be powered by a power harvester, potentially allowing it to run indefinitely without a battery or periodic maintenance.

本文介绍了Eciton，一种非常低功耗的循环神经网络加速器，用于低功耗边缘传感器节点内的时间序列数据，在负载下以17兆瓦的功耗实现实时推断。Eciton通过8位量化和硬sigmoid激活降低了内存和芯片资源需求，允许加速器以及循环神经网络模型参数适合低成本、低功耗的Lattice iCE40 UP5K FPGA。我们在多个已建立的时间序列分类应用中评估Eciton，包括机械系统的预测性维护、声音分类和物联网节点的入侵检测。探讨了二元和多类分类边缘模型，证明了Eciton可以适应各种可部署环境和远程用例。Eciton演示了在具有不同特征的多个推理场景中以非常低的功耗进行实时处理，并将准确性损失降到最低，同时实现了与类似规模的最先进技术相比具有竞争力的功率效率。我们表明，这个加速器的加入实际上通过减少耗电的无线传输减少了传感器节点的功率预算。由此产生的传感器节点的功率预算足够小，可以由电力采集器供电，从而有可能使其在没有电池或定期维护的情况下无限期运行。

{"title":"Eciton: Very Low-Power Recurrent Neural Network Accelerator for Real-Time Inference at the Edge","authors":"Jeffrey Chen, Sang-Woo Jun, Sehwan Hong, Warrick He, Jinyeong Moon","doi":"10.1145/3629979","DOIUrl":"https://doi.org/10.1145/3629979","url":null,"abstract":"This paper presents Eciton, a very low-power recurrent neural network accelerator for time series data within low-power edge sensor nodes, achieving real-time inference with a power consumption of 17 mW under load. Eciton reduces memory and chip resource requirements via 8-bit quantization and hard sigmoid activation, allowing the accelerator as well as the recurrent neural network model parameters to fit in a low-cost, low-power Lattice iCE40 UP5K FPGA. We evaluate Eciton on multiple, established time-series classification applications including predictive maintenance of mechanical systems, sound classification, and intrusion detection for IoT nodes. Binary and multi-class classification edge models are explored, demonstrating that Eciton can adapt to a variety of deployable environments and remote use cases. Eciton demonstrates real-time processing at a very low power consumption with minimal loss of accuracy on multiple inference scenarios with differing characteristics, while achieving competitive power efficiency against the state-of-the-art of similar scale. We show that the addition of this accelerator actually reduces the power budget of the sensor node by reducing power-hungry wireless transmission. The resulting power budget of the sensor node is small enough to be powered by a power harvester, potentially allowing it to run indefinitely without a battery or periodic maintenance.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"174 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135371182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Designing Deep Learning Models on FPGA with Multiple Heterogeneous Engines 基于FPGA的多异构引擎深度学习模型设计

4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-10-10 DOI: 10.1145/3615870

Miguel Reis, Mário Véstias, Horácio Neto

Deep learning models are becoming more complex and heterogeneous with new layer types to improve their accuracy. This brings a considerable challenge to the designers of accelerators of deep neural networks. There have been several architectures and design flows to map deep learning models on hardware, but they are limited to a particular model and/or layer types. Also, the architectures generated by these tools target, in general, high-performance devices, not appropriate for embedded computing. This paper proposes a multi-engine architecture and a design flow to implement deep learning models on FPGA. The hardware design uses high-level synthesis to allow design space exploration. The architecture is scalable and therefore applicable to any density FPGAs. The architecture and design flow were applied to the development of a hardware/software system for image classification with ResNet50, object detection with YOLOv3-Tiny and image segmentation with Deeplabv3+. The system was tested in a low-density Zynq UltraScale+ ZU3EG FPGA to show its scalability. The results show that the proposed multi-engine architecture generates efficient accelerators. An accelerator of ResNet50 with a 4-bit quantization achieves 67 FPS, and the object detector with YOLOv3-Tiny with a throughput of 36 FPS and the image segmentation application achieves 1.4 FPS.

深度学习模型正变得越来越复杂和异构，并增加了新的层类型以提高其准确性。这给深度神经网络加速器的设计者带来了相当大的挑战。有几种架构和设计流程可以将深度学习模型映射到硬件上，但它们仅限于特定的模型和/或层类型。此外，这些工具生成的体系结构通常针对高性能设备，不适合嵌入式计算。本文提出了在FPGA上实现深度学习模型的多引擎架构和设计流程。硬件设计采用高级综合，实现设计空间的探索。该架构是可扩展的，因此适用于任何密度的fpga。将该体系结构和设计流程应用于基于ResNet50的图像分类、基于YOLOv3-Tiny的目标检测和基于Deeplabv3+的图像分割的软硬件系统开发。该系统在低密度Zynq UltraScale+ ZU3EG FPGA上进行了测试，以显示其可扩展性。结果表明，所提出的多引擎架构产生了高效的加速器。采用4位量化的ResNet50加速器达到67 FPS，采用吞吐量为36 FPS的YOLOv3-Tiny目标检测器和图像分割应用程序达到1.4 FPS。

{"title":"Designing Deep Learning Models on FPGA with Multiple Heterogeneous Engines","authors":"Miguel Reis, Mário Véstias, Horácio Neto","doi":"10.1145/3615870","DOIUrl":"https://doi.org/10.1145/3615870","url":null,"abstract":"Deep learning models are becoming more complex and heterogeneous with new layer types to improve their accuracy. This brings a considerable challenge to the designers of accelerators of deep neural networks. There have been several architectures and design flows to map deep learning models on hardware, but they are limited to a particular model and/or layer types. Also, the architectures generated by these tools target, in general, high-performance devices, not appropriate for embedded computing. This paper proposes a multi-engine architecture and a design flow to implement deep learning models on FPGA. The hardware design uses high-level synthesis to allow design space exploration. The architecture is scalable and therefore applicable to any density FPGAs. The architecture and design flow were applied to the development of a hardware/software system for image classification with ResNet50, object detection with YOLOv3-Tiny and image segmentation with Deeplabv3+. The system was tested in a low-density Zynq UltraScale+ ZU3EG FPGA to show its scalability. The results show that the proposed multi-engine architecture generates efficient accelerators. An accelerator of ResNet50 with a 4-bit quantization achieves 67 FPS, and the object detector with YOLOv3-Tiny with a throughput of 36 FPS and the image segmentation application achieves 1.4 FPS.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136353080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Strega : An HTTP Server for FPGAs Strega:用于fpga的HTTP服务器

4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-10-10 DOI: 10.1145/3611312

Fabio Maschi, Gustavo Alonso

The computer architecture landscape is being reshaped by the new opportunities, challenges and constraints brought by the cloud. On the one hand, high-level applications profit from specialised hardware to boost their performance and reduce deployment costs. On the other hand, cloud providers maximise the CPU time allocated to client applications by offloading infrastructure tasks to hardware accelerators. While it is well understood how to do this for, e.g., network function virtualisation and protocols such as TCP/IP, support for higher networking layers is still largely missing, limiting the potential of accelerators. In this paper, we present S trega , an open-source 1 light-weight HTTP server that enables crucial functionality such as FPGA-accelerated functions being called through a RESTful protocol (FPGA-as-a-Function). Our experimental analysis shows that a single S trega node sustains a throughput of 1.7 M HTTP requests per second with an end-to-end latency as low as 16 μ s, outperforming nginx running on 32 vCPUs in both metrics, and can even be an alternative to the traditional OpenCL flow over the PCIe bus. Through this work, we pave the way for running microservices directly on FPGAs, bypassing CPU overhead and realising the full potential of FPGA acceleration in distributed cloud applications.

云计算带来的新机遇、新挑战和新限制正在重塑计算机架构格局。一方面，高级应用程序从专门的硬件中获利，以提高其性能并降低部署成本。另一方面，云提供商通过将基础设施任务卸载给硬件加速器，最大化分配给客户端应用程序的CPU时间。虽然对于网络功能虚拟化和TCP/IP等协议如何做到这一点已经很好理解，但对更高网络层的支持仍然很大程度上缺失，这限制了加速器的潜力。在本文中，我们介绍了S trega，这是一个开源的轻量级HTTP服务器，它支持通过RESTful协议(fpga as-a- function)调用fpga加速函数等关键功能。我们的实验分析表明，单个S trega节点维持每秒1.7 M HTTP请求的吞吐量，端到端延迟低至16 μ S，在这两个指标上都优于运行在32个vcpu上的nginx，甚至可以替代通过PCIe总线的传统OpenCL流。通过这项工作，我们为直接在FPGA上运行微服务铺平了道路，绕过了CPU开销，并在分布式云应用程序中实现了FPGA加速的全部潜力。

{"title":"<scp>Strega</scp> : An HTTP Server for FPGAs","authors":"Fabio Maschi, Gustavo Alonso","doi":"10.1145/3611312","DOIUrl":"https://doi.org/10.1145/3611312","url":null,"abstract":"The computer architecture landscape is being reshaped by the new opportunities, challenges and constraints brought by the cloud. On the one hand, high-level applications profit from specialised hardware to boost their performance and reduce deployment costs. On the other hand, cloud providers maximise the CPU time allocated to client applications by offloading infrastructure tasks to hardware accelerators. While it is well understood how to do this for, e.g., network function virtualisation and protocols such as TCP/IP, support for higher networking layers is still largely missing, limiting the potential of accelerators. In this paper, we present S trega , an open-source 1 light-weight HTTP server that enables crucial functionality such as FPGA-accelerated functions being called through a RESTful protocol (FPGA-as-a-Function). Our experimental analysis shows that a single S trega node sustains a throughput of 1.7 M HTTP requests per second with an end-to-end latency as low as 16 μ s, outperforming nginx running on 32 vCPUs in both metrics, and can even be an alternative to the traditional OpenCL flow over the PCIe bus. Through this work, we pave the way for running microservices directly on FPGAs, bypassing CPU overhead and realising the full potential of FPGA acceleration in distributed cloud applications.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136295278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reprogrammable non-linear circuits using ReRAM for NN accelerators 使用ReRAM用于神经网络加速器的可编程非线性电路

4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-10-10 DOI: 10.1145/3617894

Rafael Fão de Moura, Luigi Carro

As the massive usage of Artificial Intelligence (AI) techniques spreads in the economy, researchers are exploring new techniques to reduce the energy consumption of Neural Network (NN) applications, especially as the complexity of NNs continues to increase. Using analog Resistive RAM (ReRAM) devices to compute Matrix-Vector Multiplication (MVM) in O (1) time complexity is a promising approach, but it’s true that these implementations often fail to cover the diversity of nonlinearities required for modern NN applications. In this work, we propose a novel approach where ReRAMs themselves can be reprogrammed to compute not only the required matrix multiplications, but also the activation functions, softmax, and pooling layers, reducing energy in complex NNs. This approach offers more versatility for researching novel NN layouts compared to custom logic. Results show that our device outperforms analog and digital Field Programmable approaches by up to 8.5x in experiments on real-world human activity recognition and language modeling datasets with Convolutional Neural Networks (CNNs), Generative Pre-trained Transformer (GPT), and Long Short-Term Memory (LSTM) models.

随着人工智能(AI)技术在经济中的广泛应用，研究人员正在探索新的技术来降低神经网络(NN)应用的能耗，特别是随着神经网络复杂性的不断增加。使用模拟电阻性RAM (ReRAM)设备以0(1)时间复杂度计算矩阵向量乘法(MVM)是一种很有前途的方法，但这些实现通常无法覆盖现代神经网络应用所需的非线性多样性。在这项工作中，我们提出了一种新的方法，其中reram本身可以重新编程，不仅可以计算所需的矩阵乘法，还可以计算激活函数，softmax和池化层，从而减少复杂神经网络中的能量。与自定义逻辑相比，这种方法为研究新颖的神经网络布局提供了更多的通用性。结果表明，在使用卷积神经网络(cnn)、生成式预训练变压器(GPT)和长短期记忆(LSTM)模型的现实世界人类活动识别和语言建模数据集的实验中，我们的设备比模拟和数字现场可编程方法高出8.5倍。

{"title":"Reprogrammable non-linear circuits using ReRAM for NN accelerators","authors":"Rafael Fão de Moura, Luigi Carro","doi":"10.1145/3617894","DOIUrl":"https://doi.org/10.1145/3617894","url":null,"abstract":"As the massive usage of Artificial Intelligence (AI) techniques spreads in the economy, researchers are exploring new techniques to reduce the energy consumption of Neural Network (NN) applications, especially as the complexity of NNs continues to increase. Using analog Resistive RAM (ReRAM) devices to compute Matrix-Vector Multiplication (MVM) in O (1) time complexity is a promising approach, but it’s true that these implementations often fail to cover the diversity of nonlinearities required for modern NN applications. In this work, we propose a novel approach where ReRAMs themselves can be reprogrammed to compute not only the required matrix multiplications, but also the activation functions, softmax, and pooling layers, reducing energy in complex NNs. This approach offers more versatility for researching novel NN layouts compared to custom logic. Results show that our device outperforms analog and digital Field Programmable approaches by up to 8.5x in experiments on real-world human activity recognition and language modeling datasets with Convolutional Neural Networks (CNNs), Generative Pre-trained Transformer (GPT), and Long Short-Term Memory (LSTM) models.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136295688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FDRA: A Framework for Dynamically Reconfigurable Accelerator Supporting Multi-Level Parallelism FDRA:一个支持多级并行的动态可重构加速器框架

4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-10-10 DOI: 10.1145/3614224

Yunhui Qiu, Yiqing Mao, Xuchen Gao, Sichao Chen, Jiangnan Li, Wenbo Yin, Lingli Wang

Coarse-grained reconfigurable architectures (CGRAs) have emerged as promising accelerators due to their high flexibility and energy efficiency. However, existing open-source works often lack integration of CGRAs with CPU systems and corresponding toolchains. Moreover, there is rare support for the accelerator instruction pipelining to overlap data communication, computation, and configuration across multiple tasks. In this paper, we propose FDRA, an open-source exploration framework for a heterogeneous system-on-chip (SoC) with a RISC-V processor and a dynamically reconfigurable accelerator (DRA) supporting loop, instruction, and task levels of parallelism. FDRA encompasses parameterized SoC modeling, Verilog generation, source-to-source application code transformation using frontend and DRA compilers, SoC simulation, and FPGA prototyping. FDRA incorporates the extraction of periodic accumulative operators and multidimensional linear load/store operators from nested loops. The DRA enables accessing the shared L2 cache with virtual addresses and supports direct memory access (DMA) with arbitrary start addresses and data lengths. Integrated into the RISC-V Rocket SoC, our DRA achieves a remarkable 55 × acceleration for loop kernels and improves energy efficiency by 29 ×. Compared to state-of-the-art RISC-V vector units, our DRA demonstrates a 2.9 × speed improvement and 3.5 × greater energy efficiency. In contrast to previous CGRA+RISC-V SoCs, our SoC achieves a minimum speedup of 5.2 ×.

粗粒度可重构架构(CGRAs)由于其高灵活性和高能效而成为一种很有前途的加速器。然而，现有的开源作品往往缺乏CGRAs与CPU系统和相应的工具链的集成。此外，很少支持加速器指令流水线来跨多个任务重叠数据通信、计算和配置。在本文中，我们提出了FDRA，这是一个异构片上系统(SoC)的开源探索框架，具有RISC-V处理器和动态可重构加速器(DRA)，支持循环，指令和任务并行级别。FDRA包括参数化SoC建模、Verilog生成、使用前端和DRA编译器的源到源应用程序代码转换、SoC仿真和FPGA原型。FDRA结合了从嵌套循环中提取周期性累积操作符和多维线性加载/存储操作符。DRA支持使用虚拟地址访问共享L2缓存，并支持使用任意起始地址和数据长度的直接内存访问(DMA)。集成到RISC-V Rocket SoC中，我们的DRA实现了环路内核的55倍加速，并将能源效率提高了29倍。与先进的RISC-V矢量单元相比，我们的DRA速度提高了2.9倍，能效提高了3.5倍。与之前的CGRA+RISC-V SoC相比，我们的SoC实现了5.2倍的最小加速。

{"title":"FDRA: A Framework for Dynamically Reconfigurable Accelerator Supporting Multi-Level Parallelism","authors":"Yunhui Qiu, Yiqing Mao, Xuchen Gao, Sichao Chen, Jiangnan Li, Wenbo Yin, Lingli Wang","doi":"10.1145/3614224","DOIUrl":"https://doi.org/10.1145/3614224","url":null,"abstract":"Coarse-grained reconfigurable architectures (CGRAs) have emerged as promising accelerators due to their high flexibility and energy efficiency. However, existing open-source works often lack integration of CGRAs with CPU systems and corresponding toolchains. Moreover, there is rare support for the accelerator instruction pipelining to overlap data communication, computation, and configuration across multiple tasks. In this paper, we propose FDRA, an open-source exploration framework for a heterogeneous system-on-chip (SoC) with a RISC-V processor and a dynamically reconfigurable accelerator (DRA) supporting loop, instruction, and task levels of parallelism. FDRA encompasses parameterized SoC modeling, Verilog generation, source-to-source application code transformation using frontend and DRA compilers, SoC simulation, and FPGA prototyping. FDRA incorporates the extraction of periodic accumulative operators and multidimensional linear load/store operators from nested loops. The DRA enables accessing the shared L2 cache with virtual addresses and supports direct memory access (DMA) with arbitrary start addresses and data lengths. Integrated into the RISC-V Rocket SoC, our DRA achieves a remarkable 55 × acceleration for loop kernels and improves energy efficiency by 29 ×. Compared to state-of-the-art RISC-V vector units, our DRA demonstrates a 2.9 × speed improvement and 3.5 × greater energy efficiency. In contrast to previous CGRA+RISC-V SoCs, our SoC achieves a minimum speedup of 5.2 ×.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136295683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Constraint-Aware Multi-Technique Approximate High-Level Synthesis for FPGAs fpga约束感知多技术近似高级综合

4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-10-09 DOI: 10.1145/3624481

Marcos T. Leipnitz, Gabriel L. Nazar

Numerous approximate computing (AC) techniques have been developed to reduce the design costs in error-resilient application domains, such as signal and multimedia processing, data mining, machine learning, and computer vision, to trade-off computation accuracy with area and power savings or performance improvements. Selecting adequate techniques for each application and optimization target is complex but crucial for high-quality results. In this context, Approximate High-Level Synthesis (AHLS) tools have been proposed to alleviate the burden of hand-crafting approximate circuits by automating the exploitation of AC techniques. However, such tools are typically tied to a specific approximation technique or a difficult-to-extend set of techniques whose exploitation is not fully automated or steered by optimization targets. Therefore, available AHLS tools overlook the benefits of expanding the design space by mixing diverse approximation techniques toward meeting specific design objectives with minimum error. In this work, we propose an AHLS design methodology for FPGAs that automatically identifies efficient combinations of multiple approximation techniques for different applications and design constraints. Compared to single-technique approaches, decreases of up to 30% in mean squared error and absolute increases of up to 6.5% in percentage accuracy were obtained for a set of image, video, signal processing and machine learning benchmarks.

许多近似计算(AC)技术已经被开发出来，以降低错误弹性应用领域的设计成本，例如信号和多媒体处理、数据挖掘、机器学习和计算机视觉，以权衡计算精度与节省面积和功耗或性能改进。为每个应用程序和优化目标选择适当的技术是复杂的，但对于高质量的结果至关重要。在这种情况下，近似高级合成(AHLS)工具被提出，通过自动化利用交流技术来减轻手工制作近似电路的负担。然而，这些工具通常与特定的近似技术或难以扩展的一组技术绑定在一起，这些技术的利用不是完全自动化的，也不是由优化目标控制的。因此，现有的AHLS工具忽略了通过混合各种近似技术以最小误差满足特定设计目标来扩展设计空间的好处。在这项工作中，我们提出了一种fpga的AHLS设计方法，该方法可以自动识别不同应用和设计约束的多种近似技术的有效组合。与单一技术方法相比，对于一组图像、视频、信号处理和机器学习基准，均方误差降低高达30%，百分比精度绝对提高高达6.5%。

{"title":"Constraint-Aware Multi-Technique Approximate High-Level Synthesis for FPGAs","authors":"Marcos T. Leipnitz, Gabriel L. Nazar","doi":"10.1145/3624481","DOIUrl":"https://doi.org/10.1145/3624481","url":null,"abstract":"Numerous approximate computing (AC) techniques have been developed to reduce the design costs in error-resilient application domains, such as signal and multimedia processing, data mining, machine learning, and computer vision, to trade-off computation accuracy with area and power savings or performance improvements. Selecting adequate techniques for each application and optimization target is complex but crucial for high-quality results. In this context, Approximate High-Level Synthesis (AHLS) tools have been proposed to alleviate the burden of hand-crafting approximate circuits by automating the exploitation of AC techniques. However, such tools are typically tied to a specific approximation technique or a difficult-to-extend set of techniques whose exploitation is not fully automated or steered by optimization targets. Therefore, available AHLS tools overlook the benefits of expanding the design space by mixing diverse approximation techniques toward meeting specific design objectives with minimum error. In this work, we propose an AHLS design methodology for FPGAs that automatically identifies efficient combinations of multiple approximation techniques for different applications and design constraints. Compared to single-technique approaches, decreases of up to 30% in mean squared error and absolute increases of up to 6.5% in percentage accuracy were obtained for a set of image, video, signal processing and machine learning benchmarks.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135043608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0