Cloud computing providers today offer access to a variety of devices, which users can rent and access remotely in a shared setting. Among these devices are SmartSSDs, which a solid-state disks (SSD) augmented with an FPGA, enabling users to instantiate custom circuits within the FPGA, including potentially malicious circuits for power and temperature measurement. Normally, cloud users have no remote access to power and temperature data, but with SmartSSDs they could abuse the FPGA component to instantiate circuits to learn this information. Additionally, custom power waster circuits can be instantiated within the FPGA. This paper shows for the first time that by leveraging ring oscillator sensors and power wasters, numerous covert-channels in FPGA-enabled SmartSSDs could be used to transmit information. This work presents two channels in single-tenant setting (SmartSSD is used by one user at a time) and two channels in multi-tenant setting (FPGA and SSD inside SmartSSD is shared by different users). The presented covert channels can reach close to 100% accuracy. Meanwhile, bandwidth of the channels can be easily scaled by cloud users renting more SmartSSDs as the bandwidth of the covert channels is proportional to number of SmartSSD used.
{"title":"Covert-channels in FPGA-enabled SmartSSDs","authors":"Theodoros Trochatos, Anthony Etim, Jakub Szefer","doi":"10.1145/3635312","DOIUrl":"https://doi.org/10.1145/3635312","url":null,"abstract":"<p>Cloud computing providers today offer access to a variety of devices, which users can rent and access remotely in a shared setting. Among these devices are SmartSSDs, which a solid-state disks (SSD) augmented with an FPGA, enabling users to instantiate custom circuits within the FPGA, including potentially malicious circuits for power and temperature measurement. Normally, cloud users have no remote access to power and temperature data, but with SmartSSDs they could abuse the FPGA component to instantiate circuits to learn this information. Additionally, custom power waster circuits can be instantiated within the FPGA. This paper shows for the first time that by leveraging ring oscillator sensors and power wasters, numerous covert-channels in FPGA-enabled SmartSSDs could be used to transmit information. This work presents two channels in single-tenant setting (SmartSSD is used by one user at a time) and two channels in multi-tenant setting (FPGA and SSD inside SmartSSD is shared by different users). The presented covert channels can reach close to 100% accuracy. Meanwhile, bandwidth of the channels can be easily scaled by cloud users renting more SmartSSDs as the bandwidth of the covert channels is proportional to number of SmartSSD used.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"64 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emanuele Del Sozzo, Davide Conficconi, Kentaro Sano
Stencil-based applications play an essential role in high-performance systems as they occur in numerous computational areas, such as partial differential equation solving. In this context, Iterative Stencil Loops (ISLs) represent a prominent and well-known algorithmic class within the stencil domain. Specifically, ISL-based calculations iteratively apply the same stencil to a multi-dimensional point grid multiple times or until convergence. However, due to their iterative and intensive nature, ISLs are highly performance-hungry, demanding specialized solutions. Here, Field Programmable Gate Arrays (FPGAs) represent a valid architectural choice as they enable the design of custom, parallel, and scalable ISL accelerators. Besides, the regular structure of ISLs makes them an ideal candidate for automatic optimization and generation flows. For these reasons, this paper introduces Senju, an automation framework for the design of highly parallel ISL accelerators targeting single-/multi-FPGA systems. Given an input description, Senju automates the entire design process and provides accurate performance estimations. The experimental evaluation shows remarkable and scalable results, outperforming single- and multi-FPGA literature approaches under different metrics. Finally, we present a new analysis of temporal and spatial parallelism trade-offs in a real-case scenario and discuss our performance through a single- and novel specialized multi-FPGA formulation of the Roofline Model.
{"title":"Across Time and Space: Senju’s Approach for Scaling Iterative Stencil Loop Accelerators on Single and Multiple FPGAs","authors":"Emanuele Del Sozzo, Davide Conficconi, Kentaro Sano","doi":"10.1145/3634920","DOIUrl":"https://doi.org/10.1145/3634920","url":null,"abstract":"<p>Stencil-based applications play an essential role in high-performance systems as they occur in numerous computational areas, such as partial differential equation solving. In this context, Iterative Stencil Loops (ISLs) represent a prominent and well-known algorithmic class within the stencil domain. Specifically, ISL-based calculations iteratively apply the same stencil to a multi-dimensional point grid multiple times or until convergence. However, due to their iterative and intensive nature, ISLs are highly performance-hungry, demanding specialized solutions. Here, Field Programmable Gate Arrays (FPGAs) represent a valid architectural choice as they enable the design of custom, parallel, and scalable ISL accelerators. Besides, the regular structure of ISLs makes them an ideal candidate for automatic optimization and generation flows. For these reasons, this paper introduces <span>Senju</span>, an automation framework for the design of highly parallel ISL accelerators targeting single-/multi-FPGA systems. Given an input description, <span>Senju</span> automates the entire design process and provides accurate performance estimations. The experimental evaluation shows remarkable and scalable results, outperforming single- and multi-FPGA literature approaches under different metrics. Finally, we present a new analysis of temporal and spatial parallelism trade-offs in a real-case scenario and discuss our performance through a single- and novel specialized multi-FPGA formulation of the Roofline Model.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"42 12","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nils Albartus, Maik Ender, Jan-Niklas Möller, Marc Fyrbiak, Christof Paar, Russell Tessier
FPGAs have become increasingly popular in computing platforms. With recent advances in bitstream format reverse engineering, the scientific community has widely explored static FPGA security threats. For example, it is now possible to convert a bitstream to a netlist, revealing design information, and apply modifications to the static bitstream based on this knowledge. However, a systematic study of the influence of the bitstream format understanding in regards to the security aspects of the dynamic configuration process, particularly for Xilinx’s Internal Configuration Access Port (ICAP), is lacking. This paper fills this gap by comprehensively analyzing the security implications of ICAP interfaces, which primarily support dynamic partial reconfiguration. We delve into the Xilinx bitstream file format, identify misconceptions in official documentation, and propose novel configuration (attack) primitives based on dynamic reconfiguration, i.e., create/read/update/delete circuits in the FPGA, without requiring pre-definition during the design phase. Our primitives are consolidated in a novel Stealthy Reconfigurable Adaptive Trojan (STRAT) framework to conceal Trojans and evade state-of-the-art netlist reverse engineering methods. As FPGAs become integral to modern cloud computing, this research presents crucial insights on potential security risks, including the possibility of a malicious tenant or provider altering or spying on another tenant’s configuration undetected.
{"title":"On the Malicious Potential of Xilinx’ Internal Configuration Access Port (ICAP)","authors":"Nils Albartus, Maik Ender, Jan-Niklas Möller, Marc Fyrbiak, Christof Paar, Russell Tessier","doi":"10.1145/3633204","DOIUrl":"https://doi.org/10.1145/3633204","url":null,"abstract":"<p>FPGAs have become increasingly popular in computing platforms. With recent advances in bitstream format reverse engineering, the scientific community has widely explored static FPGA security threats. For example, it is now possible to convert a bitstream to a netlist, revealing design information, and apply modifications to the static bitstream based on this knowledge. However, a systematic study of the influence of the bitstream format understanding in regards to the security aspects of the dynamic configuration process, particularly for Xilinx’s Internal Configuration Access Port (ICAP), is lacking. This paper fills this gap by comprehensively analyzing the security implications of ICAP interfaces, which primarily support dynamic partial reconfiguration. We delve into the Xilinx bitstream file format, identify misconceptions in official documentation, and propose novel configuration (attack) primitives based on dynamic reconfiguration, i.e., create/read/update/delete circuits in the FPGA, without requiring pre-definition during the design phase. Our primitives are consolidated in a novel <i>Stealthy Reconfigurable Adaptive Trojan</i> (<monospace>STRAT</monospace>) framework to conceal Trojans and evade state-of-the-art netlist reverse engineering methods. As FPGAs become integral to modern cloud computing, this research presents crucial insights on potential security risks, including the possibility of a malicious tenant or provider altering or spying on another tenant’s configuration undetected.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"83 4","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138504977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Binary neural network (BNN), where both the weight and the activation values are represented with one bit, provides an attractive alternative to deploy highly efficient deep learning inference on resource-constrained edge devices. However, our investigation reveals that, to achieve satisfactory accuracy gains, state-of-the-art (SOTA) BNNs, such as FracBNN and ReActNet, usually have to incorporate various auxiliary floating-point components and increase the model size, which in turn degrades the hardware performance efficiency. In this paper, we aim to quantify such hardware inefficiency in SOTA BNNs and further mitigate it with negligible accuracy loss. First, we observe that the auxiliary floating-point (AFP) components consume an average of 93% DSPs, 46% LUTs, and 62% FFs, among the entire BNN accelerator resource utilization. To mitigate such overhead, we propose a novel algorithm-hardware co-design, called FuseBNN , to fuse those AFP operators without hurting the accuracy. On average, FuseBNN reduces AFP resource utilization to 59% DSPs, 13% LUTs, and 16% FFs. Second, SOTA BNNs often use the compact MobileNetV1 as the backbone network but have to replace the lightweight 3 × 3 depth-wise convolution (DWC) with the 3 × 3 standard convolution (SC, e.g., in ReActNet and our ReActNet-adapted BaseBNN) or even more complex fractional 3 × 3 SC (e.g., in FracBNN) to bridge the accuracy gap. As a result, the model parameter size is significantly increased and becomes 2.25 × larger than that of the 4-bit direct quantization with the original DWC (4-Bit-Net); the number of multiply-accumulate operations is also significantly increased so that the overall LUT resource usage of BaseBNN is almost the same as that of 4-Bit-Net. To address this issue, we propose HyBNN , where we binarize depth-wise separation convolution (DSC) blocks for the first time to decrease the model size and incorporate 4-bit DSC blocks to compensate for the accuracy loss. For the ship detection task in synthetic aperture radar imagery on the AMD-Xilinx ZCU102 FPGA, HyBNN achieves a detection accuracy of 94.8% and a detection speed of 615 frames per second (FPS), which is 6.8 × faster than FuseBNN+ (94.9% accuracy) and 2.7 × faster than 4-Bit-Net (95.9% accuracy). For image classification on the CIFAR-10 dataset on the AMD-Xilinx Ultra96-V2 FPGA, HyBNN achieves 1.5 × speedup and 0.7% better accuracy over SOTA FracBNN.
{"title":"HyBNN: Quantifying and Optimizing Hardware Efficiency of Binary Neural Networks","authors":"Geng Yang, Jie Lei, Zhenman Fang, Yunsong li, Jiaqing Zhang, Weiying Xie","doi":"10.1145/3631610","DOIUrl":"https://doi.org/10.1145/3631610","url":null,"abstract":"Binary neural network (BNN), where both the weight and the activation values are represented with one bit, provides an attractive alternative to deploy highly efficient deep learning inference on resource-constrained edge devices. However, our investigation reveals that, to achieve satisfactory accuracy gains, state-of-the-art (SOTA) BNNs, such as FracBNN and ReActNet, usually have to incorporate various auxiliary floating-point components and increase the model size, which in turn degrades the hardware performance efficiency. In this paper, we aim to quantify such hardware inefficiency in SOTA BNNs and further mitigate it with negligible accuracy loss. First, we observe that the auxiliary floating-point (AFP) components consume an average of 93% DSPs, 46% LUTs, and 62% FFs, among the entire BNN accelerator resource utilization. To mitigate such overhead, we propose a novel algorithm-hardware co-design, called FuseBNN , to fuse those AFP operators without hurting the accuracy. On average, FuseBNN reduces AFP resource utilization to 59% DSPs, 13% LUTs, and 16% FFs. Second, SOTA BNNs often use the compact MobileNetV1 as the backbone network but have to replace the lightweight 3 × 3 depth-wise convolution (DWC) with the 3 × 3 standard convolution (SC, e.g., in ReActNet and our ReActNet-adapted BaseBNN) or even more complex fractional 3 × 3 SC (e.g., in FracBNN) to bridge the accuracy gap. As a result, the model parameter size is significantly increased and becomes 2.25 × larger than that of the 4-bit direct quantization with the original DWC (4-Bit-Net); the number of multiply-accumulate operations is also significantly increased so that the overall LUT resource usage of BaseBNN is almost the same as that of 4-Bit-Net. To address this issue, we propose HyBNN , where we binarize depth-wise separation convolution (DSC) blocks for the first time to decrease the model size and incorporate 4-bit DSC blocks to compensate for the accuracy loss. For the ship detection task in synthetic aperture radar imagery on the AMD-Xilinx ZCU102 FPGA, HyBNN achieves a detection accuracy of 94.8% and a detection speed of 615 frames per second (FPS), which is 6.8 × faster than FuseBNN+ (94.9% accuracy) and 2.7 × faster than 4-Bit-Net (95.9% accuracy). For image classification on the CIFAR-10 dataset on the AMD-Xilinx Ultra96-V2 FPGA, HyBNN achieves 1.5 × speedup and 0.7% better accuracy over SOTA FracBNN.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"279 11","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135475095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents Eciton, a very low-power recurrent neural network accelerator for time series data within low-power edge sensor nodes, achieving real-time inference with a power consumption of 17 mW under load. Eciton reduces memory and chip resource requirements via 8-bit quantization and hard sigmoid activation, allowing the accelerator as well as the recurrent neural network model parameters to fit in a low-cost, low-power Lattice iCE40 UP5K FPGA. We evaluate Eciton on multiple, established time-series classification applications including predictive maintenance of mechanical systems, sound classification, and intrusion detection for IoT nodes. Binary and multi-class classification edge models are explored, demonstrating that Eciton can adapt to a variety of deployable environments and remote use cases. Eciton demonstrates real-time processing at a very low power consumption with minimal loss of accuracy on multiple inference scenarios with differing characteristics, while achieving competitive power efficiency against the state-of-the-art of similar scale. We show that the addition of this accelerator actually reduces the power budget of the sensor node by reducing power-hungry wireless transmission. The resulting power budget of the sensor node is small enough to be powered by a power harvester, potentially allowing it to run indefinitely without a battery or periodic maintenance.
{"title":"Eciton: Very Low-Power Recurrent Neural Network Accelerator for Real-Time Inference at the Edge","authors":"Jeffrey Chen, Sang-Woo Jun, Sehwan Hong, Warrick He, Jinyeong Moon","doi":"10.1145/3629979","DOIUrl":"https://doi.org/10.1145/3629979","url":null,"abstract":"This paper presents Eciton, a very low-power recurrent neural network accelerator for time series data within low-power edge sensor nodes, achieving real-time inference with a power consumption of 17 mW under load. Eciton reduces memory and chip resource requirements via 8-bit quantization and hard sigmoid activation, allowing the accelerator as well as the recurrent neural network model parameters to fit in a low-cost, low-power Lattice iCE40 UP5K FPGA. We evaluate Eciton on multiple, established time-series classification applications including predictive maintenance of mechanical systems, sound classification, and intrusion detection for IoT nodes. Binary and multi-class classification edge models are explored, demonstrating that Eciton can adapt to a variety of deployable environments and remote use cases. Eciton demonstrates real-time processing at a very low power consumption with minimal loss of accuracy on multiple inference scenarios with differing characteristics, while achieving competitive power efficiency against the state-of-the-art of similar scale. We show that the addition of this accelerator actually reduces the power budget of the sensor node by reducing power-hungry wireless transmission. The resulting power budget of the sensor node is small enough to be powered by a power harvester, potentially allowing it to run indefinitely without a battery or periodic maintenance.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"174 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135371182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep learning models are becoming more complex and heterogeneous with new layer types to improve their accuracy. This brings a considerable challenge to the designers of accelerators of deep neural networks. There have been several architectures and design flows to map deep learning models on hardware, but they are limited to a particular model and/or layer types. Also, the architectures generated by these tools target, in general, high-performance devices, not appropriate for embedded computing. This paper proposes a multi-engine architecture and a design flow to implement deep learning models on FPGA. The hardware design uses high-level synthesis to allow design space exploration. The architecture is scalable and therefore applicable to any density FPGAs. The architecture and design flow were applied to the development of a hardware/software system for image classification with ResNet50, object detection with YOLOv3-Tiny and image segmentation with Deeplabv3+. The system was tested in a low-density Zynq UltraScale+ ZU3EG FPGA to show its scalability. The results show that the proposed multi-engine architecture generates efficient accelerators. An accelerator of ResNet50 with a 4-bit quantization achieves 67 FPS, and the object detector with YOLOv3-Tiny with a throughput of 36 FPS and the image segmentation application achieves 1.4 FPS.
{"title":"Designing Deep Learning Models on FPGA with Multiple Heterogeneous Engines","authors":"Miguel Reis, Mário Véstias, Horácio Neto","doi":"10.1145/3615870","DOIUrl":"https://doi.org/10.1145/3615870","url":null,"abstract":"Deep learning models are becoming more complex and heterogeneous with new layer types to improve their accuracy. This brings a considerable challenge to the designers of accelerators of deep neural networks. There have been several architectures and design flows to map deep learning models on hardware, but they are limited to a particular model and/or layer types. Also, the architectures generated by these tools target, in general, high-performance devices, not appropriate for embedded computing. This paper proposes a multi-engine architecture and a design flow to implement deep learning models on FPGA. The hardware design uses high-level synthesis to allow design space exploration. The architecture is scalable and therefore applicable to any density FPGAs. The architecture and design flow were applied to the development of a hardware/software system for image classification with ResNet50, object detection with YOLOv3-Tiny and image segmentation with Deeplabv3+. The system was tested in a low-density Zynq UltraScale+ ZU3EG FPGA to show its scalability. The results show that the proposed multi-engine architecture generates efficient accelerators. An accelerator of ResNet50 with a 4-bit quantization achieves 67 FPS, and the object detector with YOLOv3-Tiny with a throughput of 36 FPS and the image segmentation application achieves 1.4 FPS.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136353080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The computer architecture landscape is being reshaped by the new opportunities, challenges and constraints brought by the cloud. On the one hand, high-level applications profit from specialised hardware to boost their performance and reduce deployment costs. On the other hand, cloud providers maximise the CPU time allocated to client applications by offloading infrastructure tasks to hardware accelerators. While it is well understood how to do this for, e.g., network function virtualisation and protocols such as TCP/IP, support for higher networking layers is still largely missing, limiting the potential of accelerators. In this paper, we present S trega , an open-source 1 light-weight HTTP server that enables crucial functionality such as FPGA-accelerated functions being called through a RESTful protocol (FPGA-as-a-Function). Our experimental analysis shows that a single S trega node sustains a throughput of 1.7 M HTTP requests per second with an end-to-end latency as low as 16 μ s, outperforming nginx running on 32 vCPUs in both metrics, and can even be an alternative to the traditional OpenCL flow over the PCIe bus. Through this work, we pave the way for running microservices directly on FPGAs, bypassing CPU overhead and realising the full potential of FPGA acceleration in distributed cloud applications.
云计算带来的新机遇、新挑战和新限制正在重塑计算机架构格局。一方面,高级应用程序从专门的硬件中获利,以提高其性能并降低部署成本。另一方面,云提供商通过将基础设施任务卸载给硬件加速器,最大化分配给客户端应用程序的CPU时间。虽然对于网络功能虚拟化和TCP/IP等协议如何做到这一点已经很好理解,但对更高网络层的支持仍然很大程度上缺失,这限制了加速器的潜力。在本文中,我们介绍了S trega,这是一个开源的轻量级HTTP服务器,它支持通过RESTful协议(fpga as-a- function)调用fpga加速函数等关键功能。我们的实验分析表明,单个S trega节点维持每秒1.7 M HTTP请求的吞吐量,端到端延迟低至16 μ S,在这两个指标上都优于运行在32个vcpu上的nginx,甚至可以替代通过PCIe总线的传统OpenCL流。通过这项工作,我们为直接在FPGA上运行微服务铺平了道路,绕过了CPU开销,并在分布式云应用程序中实现了FPGA加速的全部潜力。
{"title":"<scp>Strega</scp> : An HTTP Server for FPGAs","authors":"Fabio Maschi, Gustavo Alonso","doi":"10.1145/3611312","DOIUrl":"https://doi.org/10.1145/3611312","url":null,"abstract":"The computer architecture landscape is being reshaped by the new opportunities, challenges and constraints brought by the cloud. On the one hand, high-level applications profit from specialised hardware to boost their performance and reduce deployment costs. On the other hand, cloud providers maximise the CPU time allocated to client applications by offloading infrastructure tasks to hardware accelerators. While it is well understood how to do this for, e.g., network function virtualisation and protocols such as TCP/IP, support for higher networking layers is still largely missing, limiting the potential of accelerators. In this paper, we present S trega , an open-source 1 light-weight HTTP server that enables crucial functionality such as FPGA-accelerated functions being called through a RESTful protocol (FPGA-as-a-Function). Our experimental analysis shows that a single S trega node sustains a throughput of 1.7 M HTTP requests per second with an end-to-end latency as low as 16 μ s, outperforming nginx running on 32 vCPUs in both metrics, and can even be an alternative to the traditional OpenCL flow over the PCIe bus. Through this work, we pave the way for running microservices directly on FPGAs, bypassing CPU overhead and realising the full potential of FPGA acceleration in distributed cloud applications.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136295278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As the massive usage of Artificial Intelligence (AI) techniques spreads in the economy, researchers are exploring new techniques to reduce the energy consumption of Neural Network (NN) applications, especially as the complexity of NNs continues to increase. Using analog Resistive RAM (ReRAM) devices to compute Matrix-Vector Multiplication (MVM) in O (1) time complexity is a promising approach, but it’s true that these implementations often fail to cover the diversity of nonlinearities required for modern NN applications. In this work, we propose a novel approach where ReRAMs themselves can be reprogrammed to compute not only the required matrix multiplications, but also the activation functions, softmax, and pooling layers, reducing energy in complex NNs. This approach offers more versatility for researching novel NN layouts compared to custom logic. Results show that our device outperforms analog and digital Field Programmable approaches by up to 8.5x in experiments on real-world human activity recognition and language modeling datasets with Convolutional Neural Networks (CNNs), Generative Pre-trained Transformer (GPT), and Long Short-Term Memory (LSTM) models.
{"title":"Reprogrammable non-linear circuits using ReRAM for NN accelerators","authors":"Rafael Fão de Moura, Luigi Carro","doi":"10.1145/3617894","DOIUrl":"https://doi.org/10.1145/3617894","url":null,"abstract":"As the massive usage of Artificial Intelligence (AI) techniques spreads in the economy, researchers are exploring new techniques to reduce the energy consumption of Neural Network (NN) applications, especially as the complexity of NNs continues to increase. Using analog Resistive RAM (ReRAM) devices to compute Matrix-Vector Multiplication (MVM) in O (1) time complexity is a promising approach, but it’s true that these implementations often fail to cover the diversity of nonlinearities required for modern NN applications. In this work, we propose a novel approach where ReRAMs themselves can be reprogrammed to compute not only the required matrix multiplications, but also the activation functions, softmax, and pooling layers, reducing energy in complex NNs. This approach offers more versatility for researching novel NN layouts compared to custom logic. Results show that our device outperforms analog and digital Field Programmable approaches by up to 8.5x in experiments on real-world human activity recognition and language modeling datasets with Convolutional Neural Networks (CNNs), Generative Pre-trained Transformer (GPT), and Long Short-Term Memory (LSTM) models.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136295688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Coarse-grained reconfigurable architectures (CGRAs) have emerged as promising accelerators due to their high flexibility and energy efficiency. However, existing open-source works often lack integration of CGRAs with CPU systems and corresponding toolchains. Moreover, there is rare support for the accelerator instruction pipelining to overlap data communication, computation, and configuration across multiple tasks. In this paper, we propose FDRA, an open-source exploration framework for a heterogeneous system-on-chip (SoC) with a RISC-V processor and a dynamically reconfigurable accelerator (DRA) supporting loop, instruction, and task levels of parallelism. FDRA encompasses parameterized SoC modeling, Verilog generation, source-to-source application code transformation using frontend and DRA compilers, SoC simulation, and FPGA prototyping. FDRA incorporates the extraction of periodic accumulative operators and multidimensional linear load/store operators from nested loops. The DRA enables accessing the shared L2 cache with virtual addresses and supports direct memory access (DMA) with arbitrary start addresses and data lengths. Integrated into the RISC-V Rocket SoC, our DRA achieves a remarkable 55 × acceleration for loop kernels and improves energy efficiency by 29 ×. Compared to state-of-the-art RISC-V vector units, our DRA demonstrates a 2.9 × speed improvement and 3.5 × greater energy efficiency. In contrast to previous CGRA+RISC-V SoCs, our SoC achieves a minimum speedup of 5.2 ×.
{"title":"FDRA: A Framework for Dynamically Reconfigurable Accelerator Supporting Multi-Level Parallelism","authors":"Yunhui Qiu, Yiqing Mao, Xuchen Gao, Sichao Chen, Jiangnan Li, Wenbo Yin, Lingli Wang","doi":"10.1145/3614224","DOIUrl":"https://doi.org/10.1145/3614224","url":null,"abstract":"Coarse-grained reconfigurable architectures (CGRAs) have emerged as promising accelerators due to their high flexibility and energy efficiency. However, existing open-source works often lack integration of CGRAs with CPU systems and corresponding toolchains. Moreover, there is rare support for the accelerator instruction pipelining to overlap data communication, computation, and configuration across multiple tasks. In this paper, we propose FDRA, an open-source exploration framework for a heterogeneous system-on-chip (SoC) with a RISC-V processor and a dynamically reconfigurable accelerator (DRA) supporting loop, instruction, and task levels of parallelism. FDRA encompasses parameterized SoC modeling, Verilog generation, source-to-source application code transformation using frontend and DRA compilers, SoC simulation, and FPGA prototyping. FDRA incorporates the extraction of periodic accumulative operators and multidimensional linear load/store operators from nested loops. The DRA enables accessing the shared L2 cache with virtual addresses and supports direct memory access (DMA) with arbitrary start addresses and data lengths. Integrated into the RISC-V Rocket SoC, our DRA achieves a remarkable 55 × acceleration for loop kernels and improves energy efficiency by 29 ×. Compared to state-of-the-art RISC-V vector units, our DRA demonstrates a 2.9 × speed improvement and 3.5 × greater energy efficiency. In contrast to previous CGRA+RISC-V SoCs, our SoC achieves a minimum speedup of 5.2 ×.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136295683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Numerous approximate computing (AC) techniques have been developed to reduce the design costs in error-resilient application domains, such as signal and multimedia processing, data mining, machine learning, and computer vision, to trade-off computation accuracy with area and power savings or performance improvements. Selecting adequate techniques for each application and optimization target is complex but crucial for high-quality results. In this context, Approximate High-Level Synthesis (AHLS) tools have been proposed to alleviate the burden of hand-crafting approximate circuits by automating the exploitation of AC techniques. However, such tools are typically tied to a specific approximation technique or a difficult-to-extend set of techniques whose exploitation is not fully automated or steered by optimization targets. Therefore, available AHLS tools overlook the benefits of expanding the design space by mixing diverse approximation techniques toward meeting specific design objectives with minimum error. In this work, we propose an AHLS design methodology for FPGAs that automatically identifies efficient combinations of multiple approximation techniques for different applications and design constraints. Compared to single-technique approaches, decreases of up to 30% in mean squared error and absolute increases of up to 6.5% in percentage accuracy were obtained for a set of image, video, signal processing and machine learning benchmarks.
{"title":"Constraint-Aware Multi-Technique Approximate High-Level Synthesis for FPGAs","authors":"Marcos T. Leipnitz, Gabriel L. Nazar","doi":"10.1145/3624481","DOIUrl":"https://doi.org/10.1145/3624481","url":null,"abstract":"Numerous approximate computing (AC) techniques have been developed to reduce the design costs in error-resilient application domains, such as signal and multimedia processing, data mining, machine learning, and computer vision, to trade-off computation accuracy with area and power savings or performance improvements. Selecting adequate techniques for each application and optimization target is complex but crucial for high-quality results. In this context, Approximate High-Level Synthesis (AHLS) tools have been proposed to alleviate the burden of hand-crafting approximate circuits by automating the exploitation of AC techniques. However, such tools are typically tied to a specific approximation technique or a difficult-to-extend set of techniques whose exploitation is not fully automated or steered by optimization targets. Therefore, available AHLS tools overlook the benefits of expanding the design space by mixing diverse approximation techniques toward meeting specific design objectives with minimum error. In this work, we propose an AHLS design methodology for FPGAs that automatically identifies efficient combinations of multiple approximation techniques for different applications and design constraints. Compared to single-technique approaches, decreases of up to 30% in mean squared error and absolute increases of up to 6.5% in percentage accuracy were obtained for a set of image, video, signal processing and machine learning benchmarks.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135043608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}