Jens Trautmann, Paul Krüger, Andreas Becher, Stefan Wildermann, Jürgen Teich
Digitizing side-channel signals at high sampling rates produces huge amounts of data, while side-channel analysis techniques only need those specific trace segments containing Cryptographic Operations (COs). For detecting these segments, waveform-matching techniques have been established comparing the signal with a template of the CO’s characteristic pattern. Real-time waveform matching requires highly parallel implementations as achieved by hardware design but also reconfigurability as provided by FPGAs to adapt the matching hardware to a specific CO pattern. However, currently proposed designs process the samples from analog-to-digital converters sequentially and can only process low sampling rates due to the limited clock speed of FPGAs.
In this paper, we present a parallel waveform-matching architecture capable of performing high-speed waveform matching on a high-end FPGA-based digitizer. We also present a workflow for calibrating the waveform-matching system to the specific pattern of the CO in the presence of hardware restrictions provided by the FPGA hardware. Our implementation enables waveform matching at 10 GS/s, offering a speedup of 50x compared to the fastest state-of-the-art implementation known to us. We demonstrate how to apply the technique for attacking the widespread XTS-AES algorithm using waveform matching to recover the encrypted tweak even in the presence of so-called systemic noise.
{"title":"Design, Calibration, and Evaluation of Real-Time Waveform Matching on an FPGA-based Digitizer at 10 GS/s","authors":"Jens Trautmann, Paul Krüger, Andreas Becher, Stefan Wildermann, Jürgen Teich","doi":"10.1145/3635719","DOIUrl":"https://doi.org/10.1145/3635719","url":null,"abstract":"<p>Digitizing side-channel signals at high sampling rates produces huge amounts of data, while side-channel analysis techniques only need those specific trace segments containing Cryptographic Operations (COs). For detecting these segments, waveform-matching techniques have been established comparing the signal with a template of the CO’s characteristic pattern. Real-time waveform matching requires highly parallel implementations as achieved by hardware design but also reconfigurability as provided by FPGAs to adapt the matching hardware to a specific CO pattern. However, currently proposed designs process the samples from analog-to-digital converters sequentially and can only process low sampling rates due to the limited clock speed of FPGAs. </p><p>In this paper, we present a parallel waveform-matching architecture capable of performing high-speed waveform matching on a high-end FPGA-based digitizer. We also present a workflow for calibrating the waveform-matching system to the specific pattern of the CO in the presence of hardware restrictions provided by the FPGA hardware. Our implementation enables waveform matching at 10 GS/s, offering a speedup of 50x compared to the fastest state-of-the-art implementation known to us. We demonstrate how to apply the technique for attacking the widespread XTS-AES algorithm using waveform matching to recover the encrypted tweak even in the presence of so-called systemic noise.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introduction to the Special Section on FCCM 2022","authors":"Jing Li, Martin Herbordt","doi":"10.1145/3632092","DOIUrl":"https://doi.org/10.1145/3632092","url":null,"abstract":"<p>No abstract available.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zimeng Fan, Wei Hu, Fang Liu, Dian Xu, Hong Guo, Yanxiang He, Min Peng
In computer vision, the joint development of the algorithm and computing dimensions cannot be separated. Models and algorithms are constantly evolving, while hardware designs must adapt to new or updated algorithms. Reconfigurable devices are recognized as important platforms for computer vision applications because of their reconfigurability. There are two typical design approaches: customized and overlay design. However, existing work is unable to achieve both efficient performance and scalability to adapt to a wide range of models. To address both considerations, we propose a design framework based on reconfigurable devices to provide unified support for computer vision models. It provides software-programmable modules while leaving unit design space for problem-specific algorithms. Based on the proposed framework, we design a model mapping method and a hardware architecture with two processor arrays to enable dynamic and static reconfiguration, thereby relieving redesign pressure. In addition, resource consumption and efficiency can be balanced by adjusting the hyperparameter. In experiments on CNN, vision Transformer, and vision MLP models, our work’s throughput is improved by 18.8x–33.6x and 1.4x–2.0x compared to CPU and GPU. Compared to others on the same platform, accelerators based on our framework can better balance resource consumption and efficiency.
{"title":"A hardware design framework for computer vision models based on reconfigurable devices","authors":"Zimeng Fan, Wei Hu, Fang Liu, Dian Xu, Hong Guo, Yanxiang He, Min Peng","doi":"10.1145/3635157","DOIUrl":"https://doi.org/10.1145/3635157","url":null,"abstract":"<p>In computer vision, the joint development of the algorithm and computing dimensions cannot be separated. Models and algorithms are constantly evolving, while hardware designs must adapt to new or updated algorithms. Reconfigurable devices are recognized as important platforms for computer vision applications because of their reconfigurability. There are two typical design approaches: customized and overlay design. However, existing work is unable to achieve both efficient performance and scalability to adapt to a wide range of models. To address both considerations, we propose a design framework based on reconfigurable devices to provide unified support for computer vision models. It provides software-programmable modules while leaving unit design space for problem-specific algorithms. Based on the proposed framework, we design a model mapping method and a hardware architecture with two processor arrays to enable dynamic and static reconfiguration, thereby relieving redesign pressure. In addition, resource consumption and efficiency can be balanced by adjusting the hyperparameter. In experiments on CNN, vision Transformer, and vision MLP models, our work’s throughput is improved by 18.8x–33.6x and 1.4x–2.0x compared to CPU and GPU. Compared to others on the same platform, accelerators based on our framework can better balance resource consumption and efficiency.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anupreetham Anupreetham, Mohamed Ibrahim, Mathew Hall, Andrew Boutros, Ajay Kuzhively, Abinash Mohanty, Eriko Nurvitadhi, Vaughn Betz, Yu Cao, Jae-sun Seo
Object detection and classification is a key task in many computer vision applications such as smart surveillance and autonomous vehicles. Recent advances in deep learning have significantly improved the quality of results achieved by these systems, making them more accurate and reliable in complex environments. Modern object detection systems make use of lightweight convolutional neural networks (CNNs) for feature extraction, coupled with single-shot multi-box detectors (SSDs) that generate bounding boxes around the identified objects along with their classification confidence scores. Subsequently, a non-maximum suppression (NMS) module removes any redundant detection boxes from the final output. Typical NMS algorithms must wait for all box predictions to be generated by the SSD-based feature extractor before processing them. This sequential dependency between box predictions and NMS results in a significant latency overhead and degrades the overall system throughput, even if a high-performance CNN accelerator is used for the SSD feature extraction component. In this paper, we present a novel pipelined NMS algorithm that eliminates this sequential dependency and associated NMS latency overhead. We then use our novel NMS algorithm to implement an end-to-end fully pipelined FPGA system for low-latency SSD-MobileNet-V1 object detection. Our system, implemented on an Intel Stratix 10 FPGA, runs at 400 MHz and achieves a throughput of 2,167 frames per second with an end-to-end batch-1 latency of 2.13 ms. Our system achieves 5.3 × higher throughput and 5 × lower latency compared to the best prior FPGA-based solution with comparable accuracy.
{"title":"High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-Design","authors":"Anupreetham Anupreetham, Mohamed Ibrahim, Mathew Hall, Andrew Boutros, Ajay Kuzhively, Abinash Mohanty, Eriko Nurvitadhi, Vaughn Betz, Yu Cao, Jae-sun Seo","doi":"10.1145/3634919","DOIUrl":"https://doi.org/10.1145/3634919","url":null,"abstract":"<p>Object detection and classification is a key task in many computer vision applications such as smart surveillance and autonomous vehicles. Recent advances in deep learning have significantly improved the quality of results achieved by these systems, making them more accurate and reliable in complex environments. Modern object detection systems make use of lightweight convolutional neural networks (CNNs) for feature extraction, coupled with single-shot multi-box detectors (SSDs) that generate bounding boxes around the identified objects along with their classification confidence scores. Subsequently, a non-maximum suppression (NMS) module removes any redundant detection boxes from the final output. Typical NMS algorithms must wait for all box predictions to be generated by the SSD-based feature extractor before processing them. This sequential dependency between box predictions and NMS results in a significant latency overhead and degrades the overall system throughput, even if a high-performance CNN accelerator is used for the SSD feature extraction component. In this paper, we present a novel pipelined NMS algorithm that eliminates this sequential dependency and associated NMS latency overhead. We then use our novel NMS algorithm to implement an end-to-end fully pipelined FPGA system for low-latency SSD-MobileNet-V1 object detection. Our system, implemented on an Intel Stratix 10 FPGA, runs at 400 MHz and achieves a throughput of 2,167 frames per second with an end-to-end batch-1 latency of 2.13 ms. Our system achieves 5.3 × higher throughput and 5 × lower latency compared to the best prior FPGA-based solution with comparable accuracy.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cloud computing providers today offer access to a variety of devices, which users can rent and access remotely in a shared setting. Among these devices are SmartSSDs, which a solid-state disks (SSD) augmented with an FPGA, enabling users to instantiate custom circuits within the FPGA, including potentially malicious circuits for power and temperature measurement. Normally, cloud users have no remote access to power and temperature data, but with SmartSSDs they could abuse the FPGA component to instantiate circuits to learn this information. Additionally, custom power waster circuits can be instantiated within the FPGA. This paper shows for the first time that by leveraging ring oscillator sensors and power wasters, numerous covert-channels in FPGA-enabled SmartSSDs could be used to transmit information. This work presents two channels in single-tenant setting (SmartSSD is used by one user at a time) and two channels in multi-tenant setting (FPGA and SSD inside SmartSSD is shared by different users). The presented covert channels can reach close to 100% accuracy. Meanwhile, bandwidth of the channels can be easily scaled by cloud users renting more SmartSSDs as the bandwidth of the covert channels is proportional to number of SmartSSD used.
{"title":"Covert-channels in FPGA-enabled SmartSSDs","authors":"Theodoros Trochatos, Anthony Etim, Jakub Szefer","doi":"10.1145/3635312","DOIUrl":"https://doi.org/10.1145/3635312","url":null,"abstract":"<p>Cloud computing providers today offer access to a variety of devices, which users can rent and access remotely in a shared setting. Among these devices are SmartSSDs, which a solid-state disks (SSD) augmented with an FPGA, enabling users to instantiate custom circuits within the FPGA, including potentially malicious circuits for power and temperature measurement. Normally, cloud users have no remote access to power and temperature data, but with SmartSSDs they could abuse the FPGA component to instantiate circuits to learn this information. Additionally, custom power waster circuits can be instantiated within the FPGA. This paper shows for the first time that by leveraging ring oscillator sensors and power wasters, numerous covert-channels in FPGA-enabled SmartSSDs could be used to transmit information. This work presents two channels in single-tenant setting (SmartSSD is used by one user at a time) and two channels in multi-tenant setting (FPGA and SSD inside SmartSSD is shared by different users). The presented covert channels can reach close to 100% accuracy. Meanwhile, bandwidth of the channels can be easily scaled by cloud users renting more SmartSSDs as the bandwidth of the covert channels is proportional to number of SmartSSD used.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emanuele Del Sozzo, Davide Conficconi, Kentaro Sano
Stencil-based applications play an essential role in high-performance systems as they occur in numerous computational areas, such as partial differential equation solving. In this context, Iterative Stencil Loops (ISLs) represent a prominent and well-known algorithmic class within the stencil domain. Specifically, ISL-based calculations iteratively apply the same stencil to a multi-dimensional point grid multiple times or until convergence. However, due to their iterative and intensive nature, ISLs are highly performance-hungry, demanding specialized solutions. Here, Field Programmable Gate Arrays (FPGAs) represent a valid architectural choice as they enable the design of custom, parallel, and scalable ISL accelerators. Besides, the regular structure of ISLs makes them an ideal candidate for automatic optimization and generation flows. For these reasons, this paper introduces Senju, an automation framework for the design of highly parallel ISL accelerators targeting single-/multi-FPGA systems. Given an input description, Senju automates the entire design process and provides accurate performance estimations. The experimental evaluation shows remarkable and scalable results, outperforming single- and multi-FPGA literature approaches under different metrics. Finally, we present a new analysis of temporal and spatial parallelism trade-offs in a real-case scenario and discuss our performance through a single- and novel specialized multi-FPGA formulation of the Roofline Model.
{"title":"Across Time and Space: Senju’s Approach for Scaling Iterative Stencil Loop Accelerators on Single and Multiple FPGAs","authors":"Emanuele Del Sozzo, Davide Conficconi, Kentaro Sano","doi":"10.1145/3634920","DOIUrl":"https://doi.org/10.1145/3634920","url":null,"abstract":"<p>Stencil-based applications play an essential role in high-performance systems as they occur in numerous computational areas, such as partial differential equation solving. In this context, Iterative Stencil Loops (ISLs) represent a prominent and well-known algorithmic class within the stencil domain. Specifically, ISL-based calculations iteratively apply the same stencil to a multi-dimensional point grid multiple times or until convergence. However, due to their iterative and intensive nature, ISLs are highly performance-hungry, demanding specialized solutions. Here, Field Programmable Gate Arrays (FPGAs) represent a valid architectural choice as they enable the design of custom, parallel, and scalable ISL accelerators. Besides, the regular structure of ISLs makes them an ideal candidate for automatic optimization and generation flows. For these reasons, this paper introduces <span>Senju</span>, an automation framework for the design of highly parallel ISL accelerators targeting single-/multi-FPGA systems. Given an input description, <span>Senju</span> automates the entire design process and provides accurate performance estimations. The experimental evaluation shows remarkable and scalable results, outperforming single- and multi-FPGA literature approaches under different metrics. Finally, we present a new analysis of temporal and spatial parallelism trade-offs in a real-case scenario and discuss our performance through a single- and novel specialized multi-FPGA formulation of the Roofline Model.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nils Albartus, Maik Ender, Jan-Niklas Möller, Marc Fyrbiak, Christof Paar, Russell Tessier
FPGAs have become increasingly popular in computing platforms. With recent advances in bitstream format reverse engineering, the scientific community has widely explored static FPGA security threats. For example, it is now possible to convert a bitstream to a netlist, revealing design information, and apply modifications to the static bitstream based on this knowledge. However, a systematic study of the influence of the bitstream format understanding in regards to the security aspects of the dynamic configuration process, particularly for Xilinx’s Internal Configuration Access Port (ICAP), is lacking. This paper fills this gap by comprehensively analyzing the security implications of ICAP interfaces, which primarily support dynamic partial reconfiguration. We delve into the Xilinx bitstream file format, identify misconceptions in official documentation, and propose novel configuration (attack) primitives based on dynamic reconfiguration, i.e., create/read/update/delete circuits in the FPGA, without requiring pre-definition during the design phase. Our primitives are consolidated in a novel Stealthy Reconfigurable Adaptive Trojan (STRAT) framework to conceal Trojans and evade state-of-the-art netlist reverse engineering methods. As FPGAs become integral to modern cloud computing, this research presents crucial insights on potential security risks, including the possibility of a malicious tenant or provider altering or spying on another tenant’s configuration undetected.
{"title":"On the Malicious Potential of Xilinx’ Internal Configuration Access Port (ICAP)","authors":"Nils Albartus, Maik Ender, Jan-Niklas Möller, Marc Fyrbiak, Christof Paar, Russell Tessier","doi":"10.1145/3633204","DOIUrl":"https://doi.org/10.1145/3633204","url":null,"abstract":"<p>FPGAs have become increasingly popular in computing platforms. With recent advances in bitstream format reverse engineering, the scientific community has widely explored static FPGA security threats. For example, it is now possible to convert a bitstream to a netlist, revealing design information, and apply modifications to the static bitstream based on this knowledge. However, a systematic study of the influence of the bitstream format understanding in regards to the security aspects of the dynamic configuration process, particularly for Xilinx’s Internal Configuration Access Port (ICAP), is lacking. This paper fills this gap by comprehensively analyzing the security implications of ICAP interfaces, which primarily support dynamic partial reconfiguration. We delve into the Xilinx bitstream file format, identify misconceptions in official documentation, and propose novel configuration (attack) primitives based on dynamic reconfiguration, i.e., create/read/update/delete circuits in the FPGA, without requiring pre-definition during the design phase. Our primitives are consolidated in a novel <i>Stealthy Reconfigurable Adaptive Trojan</i> (<monospace>STRAT</monospace>) framework to conceal Trojans and evade state-of-the-art netlist reverse engineering methods. As FPGAs become integral to modern cloud computing, this research presents crucial insights on potential security risks, including the possibility of a malicious tenant or provider altering or spying on another tenant’s configuration undetected.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138504977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Binary neural network (BNN), where both the weight and the activation values are represented with one bit, provides an attractive alternative to deploy highly efficient deep learning inference on resource-constrained edge devices. However, our investigation reveals that, to achieve satisfactory accuracy gains, state-of-the-art (SOTA) BNNs, such as FracBNN and ReActNet, usually have to incorporate various auxiliary floating-point components and increase the model size, which in turn degrades the hardware performance efficiency. In this paper, we aim to quantify such hardware inefficiency in SOTA BNNs and further mitigate it with negligible accuracy loss. First, we observe that the auxiliary floating-point (AFP) components consume an average of 93% DSPs, 46% LUTs, and 62% FFs, among the entire BNN accelerator resource utilization. To mitigate such overhead, we propose a novel algorithm-hardware co-design, called FuseBNN , to fuse those AFP operators without hurting the accuracy. On average, FuseBNN reduces AFP resource utilization to 59% DSPs, 13% LUTs, and 16% FFs. Second, SOTA BNNs often use the compact MobileNetV1 as the backbone network but have to replace the lightweight 3 × 3 depth-wise convolution (DWC) with the 3 × 3 standard convolution (SC, e.g., in ReActNet and our ReActNet-adapted BaseBNN) or even more complex fractional 3 × 3 SC (e.g., in FracBNN) to bridge the accuracy gap. As a result, the model parameter size is significantly increased and becomes 2.25 × larger than that of the 4-bit direct quantization with the original DWC (4-Bit-Net); the number of multiply-accumulate operations is also significantly increased so that the overall LUT resource usage of BaseBNN is almost the same as that of 4-Bit-Net. To address this issue, we propose HyBNN , where we binarize depth-wise separation convolution (DSC) blocks for the first time to decrease the model size and incorporate 4-bit DSC blocks to compensate for the accuracy loss. For the ship detection task in synthetic aperture radar imagery on the AMD-Xilinx ZCU102 FPGA, HyBNN achieves a detection accuracy of 94.8% and a detection speed of 615 frames per second (FPS), which is 6.8 × faster than FuseBNN+ (94.9% accuracy) and 2.7 × faster than 4-Bit-Net (95.9% accuracy). For image classification on the CIFAR-10 dataset on the AMD-Xilinx Ultra96-V2 FPGA, HyBNN achieves 1.5 × speedup and 0.7% better accuracy over SOTA FracBNN.
{"title":"HyBNN: Quantifying and Optimizing Hardware Efficiency of Binary Neural Networks","authors":"Geng Yang, Jie Lei, Zhenman Fang, Yunsong li, Jiaqing Zhang, Weiying Xie","doi":"10.1145/3631610","DOIUrl":"https://doi.org/10.1145/3631610","url":null,"abstract":"Binary neural network (BNN), where both the weight and the activation values are represented with one bit, provides an attractive alternative to deploy highly efficient deep learning inference on resource-constrained edge devices. However, our investigation reveals that, to achieve satisfactory accuracy gains, state-of-the-art (SOTA) BNNs, such as FracBNN and ReActNet, usually have to incorporate various auxiliary floating-point components and increase the model size, which in turn degrades the hardware performance efficiency. In this paper, we aim to quantify such hardware inefficiency in SOTA BNNs and further mitigate it with negligible accuracy loss. First, we observe that the auxiliary floating-point (AFP) components consume an average of 93% DSPs, 46% LUTs, and 62% FFs, among the entire BNN accelerator resource utilization. To mitigate such overhead, we propose a novel algorithm-hardware co-design, called FuseBNN , to fuse those AFP operators without hurting the accuracy. On average, FuseBNN reduces AFP resource utilization to 59% DSPs, 13% LUTs, and 16% FFs. Second, SOTA BNNs often use the compact MobileNetV1 as the backbone network but have to replace the lightweight 3 × 3 depth-wise convolution (DWC) with the 3 × 3 standard convolution (SC, e.g., in ReActNet and our ReActNet-adapted BaseBNN) or even more complex fractional 3 × 3 SC (e.g., in FracBNN) to bridge the accuracy gap. As a result, the model parameter size is significantly increased and becomes 2.25 × larger than that of the 4-bit direct quantization with the original DWC (4-Bit-Net); the number of multiply-accumulate operations is also significantly increased so that the overall LUT resource usage of BaseBNN is almost the same as that of 4-Bit-Net. To address this issue, we propose HyBNN , where we binarize depth-wise separation convolution (DSC) blocks for the first time to decrease the model size and incorporate 4-bit DSC blocks to compensate for the accuracy loss. For the ship detection task in synthetic aperture radar imagery on the AMD-Xilinx ZCU102 FPGA, HyBNN achieves a detection accuracy of 94.8% and a detection speed of 615 frames per second (FPS), which is 6.8 × faster than FuseBNN+ (94.9% accuracy) and 2.7 × faster than 4-Bit-Net (95.9% accuracy). For image classification on the CIFAR-10 dataset on the AMD-Xilinx Ultra96-V2 FPGA, HyBNN achieves 1.5 × speedup and 0.7% better accuracy over SOTA FracBNN.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135475095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents Eciton, a very low-power recurrent neural network accelerator for time series data within low-power edge sensor nodes, achieving real-time inference with a power consumption of 17 mW under load. Eciton reduces memory and chip resource requirements via 8-bit quantization and hard sigmoid activation, allowing the accelerator as well as the recurrent neural network model parameters to fit in a low-cost, low-power Lattice iCE40 UP5K FPGA. We evaluate Eciton on multiple, established time-series classification applications including predictive maintenance of mechanical systems, sound classification, and intrusion detection for IoT nodes. Binary and multi-class classification edge models are explored, demonstrating that Eciton can adapt to a variety of deployable environments and remote use cases. Eciton demonstrates real-time processing at a very low power consumption with minimal loss of accuracy on multiple inference scenarios with differing characteristics, while achieving competitive power efficiency against the state-of-the-art of similar scale. We show that the addition of this accelerator actually reduces the power budget of the sensor node by reducing power-hungry wireless transmission. The resulting power budget of the sensor node is small enough to be powered by a power harvester, potentially allowing it to run indefinitely without a battery or periodic maintenance.
{"title":"Eciton: Very Low-Power Recurrent Neural Network Accelerator for Real-Time Inference at the Edge","authors":"Jeffrey Chen, Sang-Woo Jun, Sehwan Hong, Warrick He, Jinyeong Moon","doi":"10.1145/3629979","DOIUrl":"https://doi.org/10.1145/3629979","url":null,"abstract":"This paper presents Eciton, a very low-power recurrent neural network accelerator for time series data within low-power edge sensor nodes, achieving real-time inference with a power consumption of 17 mW under load. Eciton reduces memory and chip resource requirements via 8-bit quantization and hard sigmoid activation, allowing the accelerator as well as the recurrent neural network model parameters to fit in a low-cost, low-power Lattice iCE40 UP5K FPGA. We evaluate Eciton on multiple, established time-series classification applications including predictive maintenance of mechanical systems, sound classification, and intrusion detection for IoT nodes. Binary and multi-class classification edge models are explored, demonstrating that Eciton can adapt to a variety of deployable environments and remote use cases. Eciton demonstrates real-time processing at a very low power consumption with minimal loss of accuracy on multiple inference scenarios with differing characteristics, while achieving competitive power efficiency against the state-of-the-art of similar scale. We show that the addition of this accelerator actually reduces the power budget of the sensor node by reducing power-hungry wireless transmission. The resulting power budget of the sensor node is small enough to be powered by a power harvester, potentially allowing it to run indefinitely without a battery or periodic maintenance.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135371182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep learning models are becoming more complex and heterogeneous with new layer types to improve their accuracy. This brings a considerable challenge to the designers of accelerators of deep neural networks. There have been several architectures and design flows to map deep learning models on hardware, but they are limited to a particular model and/or layer types. Also, the architectures generated by these tools target, in general, high-performance devices, not appropriate for embedded computing. This paper proposes a multi-engine architecture and a design flow to implement deep learning models on FPGA. The hardware design uses high-level synthesis to allow design space exploration. The architecture is scalable and therefore applicable to any density FPGAs. The architecture and design flow were applied to the development of a hardware/software system for image classification with ResNet50, object detection with YOLOv3-Tiny and image segmentation with Deeplabv3+. The system was tested in a low-density Zynq UltraScale+ ZU3EG FPGA to show its scalability. The results show that the proposed multi-engine architecture generates efficient accelerators. An accelerator of ResNet50 with a 4-bit quantization achieves 67 FPS, and the object detector with YOLOv3-Tiny with a throughput of 36 FPS and the image segmentation application achieves 1.4 FPS.
{"title":"Designing Deep Learning Models on FPGA with Multiple Heterogeneous Engines","authors":"Miguel Reis, Mário Véstias, Horácio Neto","doi":"10.1145/3615870","DOIUrl":"https://doi.org/10.1145/3615870","url":null,"abstract":"Deep learning models are becoming more complex and heterogeneous with new layer types to improve their accuracy. This brings a considerable challenge to the designers of accelerators of deep neural networks. There have been several architectures and design flows to map deep learning models on hardware, but they are limited to a particular model and/or layer types. Also, the architectures generated by these tools target, in general, high-performance devices, not appropriate for embedded computing. This paper proposes a multi-engine architecture and a design flow to implement deep learning models on FPGA. The hardware design uses high-level synthesis to allow design space exploration. The architecture is scalable and therefore applicable to any density FPGAs. The architecture and design flow were applied to the development of a hardware/software system for image classification with ResNet50, object detection with YOLOv3-Tiny and image segmentation with Deeplabv3+. The system was tested in a low-density Zynq UltraScale+ ZU3EG FPGA to show its scalability. The results show that the proposed multi-engine architecture generates efficient accelerators. An accelerator of ResNet50 with a 4-bit quantization achieves 67 FPS, and the object detector with YOLOv3-Tiny with a throughput of 36 FPS and the image segmentation application achieves 1.4 FPS.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136353080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}