Konstantin J. Hoßfeld, Hans Jakob Damsgaard, Jari Nurmi, Michaela Blott, Thomas B. Preußer
High-fan-in dot product computations are ubiquitous in highly relevant application domains, such as signal processing and machine learning. Particularly, the diverse set of data formats used in machine learning poses a challenge for flexible efficient design solutions. Ideally, a dot product summation is composed from a carry-free compressor tree followed by a terminal carry-propagate addition. On FPGA, these compressor trees are constructed from generalized parallel counters (GPCs) whose architecture is closely tied to the underlying reconfigurable fabric. This work reviews known counter designs and proposes new ones in the context of the new AMD Versal™ fabric. On this basis, we develop a compressor generator featuring variable-sized counters, novel counter composition heuristics, explicit clustering strategies, and case-specific optimizations like logic gate absorption. In comparison to the Vivado™ default implementation, the combination of such a compressor with a novel, highly efficient quaternary adder reduces the LUT footprint across different bit matrix input shapes by 45% for a plain summation and by 46% for a terminal accumulation at a slight cost in critical path delay still allowing an operation well above 500 MHz. We demonstrate the aptness of our solution at examples of low-precision integer dot product accumulation units.
{"title":"High-Efficiency Compressor Trees for Latest AMD FPGAs","authors":"Konstantin J. Hoßfeld, Hans Jakob Damsgaard, Jari Nurmi, Michaela Blott, Thomas B. Preußer","doi":"10.1145/3645097","DOIUrl":"https://doi.org/10.1145/3645097","url":null,"abstract":"<p>High-fan-in dot product computations are ubiquitous in highly relevant application domains, such as signal processing and machine learning. Particularly, the diverse set of data formats used in machine learning poses a challenge for flexible efficient design solutions. Ideally, a dot product summation is composed from a carry-free compressor tree followed by a terminal carry-propagate addition. On FPGA, these compressor trees are constructed from generalized parallel counters (GPCs) whose architecture is closely tied to the underlying reconfigurable fabric. This work reviews known counter designs and proposes new ones in the context of the new AMD Versal™ fabric. On this basis, we develop a compressor generator featuring variable-sized counters, novel counter composition heuristics, explicit clustering strategies, and case-specific optimizations like logic gate absorption. In comparison to the Vivado™ default implementation, the combination of such a compressor with a novel, highly efficient quaternary adder reduces the LUT footprint across different bit matrix input shapes by 45% for a plain summation and by 46% for a terminal accumulation at a slight cost in critical path delay still allowing an operation well above 500 MHz. We demonstrate the aptness of our solution at examples of low-precision integer dot product accumulation units.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"34 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139772659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Field Programmable Gate Array (FPGA) is a versatile and programmable hardware platform, which makes it a promising candidate for accelerating Deep Neural Networks (DNNs). However, FPGA’s computing energy efficiency is low due to the domination of energy consumption by interconnect data movement. In this paper, we propose an all-digital Compute-In-Memory FPGA architecture for deep learning acceleration. Furthermore, we present a bit-serial computing circuit of the Digital CIM core for accelerating vector-matrix multiplication (VMM) operations. A Network-CIM-Deployer (NCIMD) is also developed to support automatic deployment and mapping of DNN networks. NCIMD provides a user-friendly API of DNN models in Caffe format. Meanwhile, we introduce a Weight-Stationary (WS) dataflow and describe the method of mapping a single layer of the network to the CIM array in the architecture. We conduct experimental tests on the proposed FPGA architecture in the field of Deep Learning (DL), as well as in non-DL fields, using different architectural layouts and mapping strategies. We also compare the results with the conventional FPGA architecture. The experimental results show that compared to the conventional FPGA architecture, the energy efficiency can achieve a maximum speedup of 16.1 ×, while the latency can decrease up to (40% ) in our proposed CIM FPGA architecture.
{"title":"An All-Digital Compute-In-Memory FPGA Architecture for Deep Learning Acceleration","authors":"Yonggen Li, Xin Li, Haibin Shen, Jicong Fan, Yanfeng Xu, Kejie Huang","doi":"10.1145/3640469","DOIUrl":"https://doi.org/10.1145/3640469","url":null,"abstract":"<p>Field Programmable Gate Array (FPGA) is a versatile and programmable hardware platform, which makes it a promising candidate for accelerating Deep Neural Networks (DNNs). However, FPGA’s computing energy efficiency is low due to the domination of energy consumption by interconnect data movement. In this paper, we propose an all-digital Compute-In-Memory FPGA architecture for deep learning acceleration. Furthermore, we present a bit-serial computing circuit of the Digital CIM core for accelerating vector-matrix multiplication (VMM) operations. A Network-CIM-Deployer (<i>NCIMD</i>) is also developed to support automatic deployment and mapping of DNN networks. <i>NCIMD</i> provides a user-friendly API of DNN models in Caffe format. Meanwhile, we introduce a Weight-Stationary (WS) dataflow and describe the method of mapping a single layer of the network to the CIM array in the architecture. We conduct experimental tests on the proposed FPGA architecture in the field of Deep Learning (DL), as well as in non-DL fields, using different architectural layouts and mapping strategies. We also compare the results with the conventional FPGA architecture. The experimental results show that compared to the conventional FPGA architecture, the energy efficiency can achieve a maximum speedup of 16.1 ×, while the latency can decrease up to (40% ) in our proposed CIM FPGA architecture.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"14 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139469745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A new area model for estimating the layout area of switch blocks is introduced in this work. The model is based on a realistic layout strategy. As a result, it not only takes into consideration the active area that is needed to construct a switch block but also the number of metal layers available and the actual dimensions of these metals. The model assigns metal layers to the routing tracks in a way that reduces the number of vias that are needed to connect different routing tracks together while maintaining the tile-based structure of FPGAs. It also takes into account the wiring area required for buffer insertion for long wire segments. The model is evaluated based on the layouts constructed in ASAP7 FinFET 7nm Predictive Design Kit. We found that the new model, while specific to the layout strategy that it employs, improves upon the traditional active-based area estimation models by considering the growth of the metal area independently from the growth of the active area. As a result, the new model is able to more accurately estimate layout area by predicting when metal area will overtake active area as the number of routing tracks is increased. This ability allows the more accurate estimation of the true layout cost of FPGA fabrics at the early floor planning and architectural exploration stage; and this increase in accuracy can encourage a wider use of custom FPGA fabrics that target specific sets of benchmarks in future SOC designs. Furthermore, our data indicate that the conclusions drawn from several significant prior architectural studies remain to be correct under FinFET geometries and wiring area considerations despite their exclusive use of active-only area models. This correctness is due to the small channel widths, around 30-60 tracks per channel, of the architectures that these studies investigate. For architectures that approach the channel width of modern commercial FPGAs with over one to two hundreds tracks per channel, our data show that wiring area models justified by detailed layout considerations are an essential addition to active area models in the correct prediction of the implementation area of FPGAs.
{"title":"Evaluating the Impact of Using Multiple-Metal Layers on the Layout Area of Switch Blocks for Tile-Based FPGAs in FinFET 7nm","authors":"Sajjad Rostami Sani, Andy Ye","doi":"10.1145/3639055","DOIUrl":"https://doi.org/10.1145/3639055","url":null,"abstract":"<p>A new area model for estimating the layout area of switch blocks is introduced in this work. The model is based on a realistic layout strategy. As a result, it not only takes into consideration the active area that is needed to construct a switch block but also the number of metal layers available and the actual dimensions of these metals. The model assigns metal layers to the routing tracks in a way that reduces the number of vias that are needed to connect different routing tracks together while maintaining the tile-based structure of FPGAs. It also takes into account the wiring area required for buffer insertion for long wire segments. The model is evaluated based on the layouts constructed in ASAP7 FinFET 7nm Predictive Design Kit. We found that the new model, while specific to the layout strategy that it employs, improves upon the traditional active-based area estimation models by considering the growth of the metal area independently from the growth of the active area. As a result, the new model is able to more accurately estimate layout area by predicting when metal area will overtake active area as the number of routing tracks is increased. This ability allows the more accurate estimation of the true layout cost of FPGA fabrics at the early floor planning and architectural exploration stage; and this increase in accuracy can encourage a wider use of custom FPGA fabrics that target specific sets of benchmarks in future SOC designs. Furthermore, our data indicate that the conclusions drawn from several significant prior architectural studies remain to be correct under FinFET geometries and wiring area considerations despite their exclusive use of active-only area models. This correctness is due to the small channel widths, around 30-60 tracks per channel, of the architectures that these studies investigate. For architectures that approach the channel width of modern commercial FPGAs with over one to two hundreds tracks per channel, our data show that wiring area models justified by detailed layout considerations are an essential addition to active area models in the correct prediction of the implementation area of FPGAs.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"4 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139082978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The CSAIL2019 time-lock puzzle is an unsolved cryptographic challenge introduced by Ron Rivest in 2019, replacing the solved LCS35 puzzle. Solving these types of puzzles requires large amounts of intrinsically sequential computations, with each iteration performing a very large (3072-bit for CSAIL2019) modular multiplication operation. The complexity of each iteration is several times greater than known FPGA implementations, and the number of iterations has been increased by about 1000x compared to LCS35. Because of the high complexity of this new puzzle, a number of intermediate, or milestone versions of the puzzle have been specified. In this article, we present several FPGA architectures for the CSAIL2019 solver, which we implement on a medium-sized Intel Agilex device. We develop a new multi-cycle modular multiplication method, which is flexible and can fit on a wide variety of sizes of current FPGAs. We introduce a class of multi-cycle squarer-based architectures that allow for better resource and area trade-offs. We also demonstrate a new approach for improving the fitting and timing closure of large, chip-filling arithmetic designs. We used the solver to compute the first 22 out of the 28 milestone solutions of the puzzle, which are the first reported results for this problem.
{"title":"CSAIL2019 Crypto-Puzzle Solver Architecture","authors":"Sergey Gribok, Bogdan Pasca, Martin Langhammer","doi":"10.1145/3639056","DOIUrl":"https://doi.org/10.1145/3639056","url":null,"abstract":"<p>The CSAIL2019 time-lock puzzle is an unsolved cryptographic challenge introduced by Ron Rivest in 2019, replacing the solved LCS35 puzzle. Solving these types of puzzles requires large amounts of intrinsically sequential computations, with each iteration performing a very large (3072-bit for CSAIL2019) modular multiplication operation. The complexity of each iteration is several times greater than known FPGA implementations, and the number of iterations has been increased by about 1000x compared to LCS35. Because of the high complexity of this new puzzle, a number of intermediate, or milestone versions of the puzzle have been specified. In this article, we present several FPGA architectures for the CSAIL2019 solver, which we implement on a medium-sized Intel Agilex device. We develop a new multi-cycle modular multiplication method, which is flexible and can fit on a wide variety of sizes of current FPGAs. We introduce a class of multi-cycle squarer-based architectures that allow for better resource and area trade-offs. We also demonstrate a new approach for improving the fitting and timing closure of large, chip-filling arithmetic designs. We used the solver to compute the first 22 out of the 28 milestone solutions of the puzzle, which are the first reported results for this problem.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139069353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introduction to the Special Section on FPGA 2022","authors":"P. Ienne","doi":"10.1145/3618114","DOIUrl":"https://doi.org/10.1145/3618114","url":null,"abstract":"","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"15 6","pages":"1 - 2"},"PeriodicalIF":2.3,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139004736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianyou Bao, Pengzhou He, Jiafeng Xie, H S. Jacinto
Lightweight PQC-related research and development have gradually gained attention from the research community recently. Ring-Binary-Learning-with-Errors (RBLWE)-based encryption scheme (RBLWE-ENC), a promising lightweight PQC based on small parameter sets to fit related applications (but not in favor of deploying popular fast algorithms like number theoretic transform). To solve this problem, in this paper, we present a novel implementation of hardware acceleration for RBLWE-ENC based on Karatsuba algorithm, particularly on the field-programmable gate array (FPGA) platform. In detail, we have proposed an area-efficient Karatsuba Accelerator (AEKA) for RBLWE-ENC, based on three layers of innovative efforts. First of all, we reformulate the signal processing sequence within the major arithmetic component of the KA-based polynomial multiplication for RBLWE-ENC to obtain a new algorithm. Then, we have designed the proposed algorithm into a new hardware accelerator with several novel algorithm-to-architecture mapping techniques. Finally, we have conducted thorough complexity analysis and comparison to demonstrate the efficiency of the proposed accelerator, e.g., it involves 62.5% higher throughput and 60.2% less area-delay product (ADP) than the state-of-the-art design for n = 512 (Virtex-7 device, similar setup). The proposed AEKA design strategy is highly efficient on the FPGA devices, i.e., small resource usage with superior timing, which can be integrated with other necessary systems for lightweight-oriented high-performance applications (e.g., servers). The outcome of this work is also expected to generate impacts for lightweight PQC advancement.
{"title":"AEKA: FPGA Implementation of Area-Efficient Karatsuba Accelerator for Ring-Binary-LWE-based Lightweight PQC","authors":"Tianyou Bao, Pengzhou He, Jiafeng Xie, H S. Jacinto","doi":"10.1145/3637215","DOIUrl":"https://doi.org/10.1145/3637215","url":null,"abstract":"<p>Lightweight PQC-related research and development have gradually gained attention from the research community recently. Ring-Binary-Learning-with-Errors (RBLWE)-based encryption scheme (RBLWE-ENC), a promising lightweight PQC based on small parameter sets to fit related applications (but not in favor of deploying popular fast algorithms like number theoretic transform). To solve this problem, in this paper, we present a novel implementation of hardware acceleration for RBLWE-ENC based on Karatsuba algorithm, particularly on the field-programmable gate array (FPGA) platform. In detail, we have proposed an area-efficient Karatsuba Accelerator (AEKA) for RBLWE-ENC, based on three layers of innovative efforts. First of all, we reformulate the signal processing sequence within the major arithmetic component of the KA-based polynomial multiplication for RBLWE-ENC to obtain a new algorithm. Then, we have designed the proposed algorithm into a new hardware accelerator with several novel algorithm-to-architecture mapping techniques. Finally, we have conducted thorough complexity analysis and comparison to demonstrate the efficiency of the proposed accelerator, e.g., it involves 62.5% higher throughput and 60.2% less area-delay product (ADP) than the state-of-the-art design for <i>n</i> = 512 (Virtex-7 device, similar setup). The proposed AEKA design strategy is highly efficient on the FPGA devices, i.e., small resource usage with superior timing, which can be integrated with other necessary systems for lightweight-oriented high-performance applications (e.g., servers). The outcome of this work is also expected to generate impacts for lightweight PQC advancement.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"12 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138566009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jens Trautmann, Paul Krüger, Andreas Becher, Stefan Wildermann, Jürgen Teich
Digitizing side-channel signals at high sampling rates produces huge amounts of data, while side-channel analysis techniques only need those specific trace segments containing Cryptographic Operations (COs). For detecting these segments, waveform-matching techniques have been established comparing the signal with a template of the CO’s characteristic pattern. Real-time waveform matching requires highly parallel implementations as achieved by hardware design but also reconfigurability as provided by FPGAs to adapt the matching hardware to a specific CO pattern. However, currently proposed designs process the samples from analog-to-digital converters sequentially and can only process low sampling rates due to the limited clock speed of FPGAs.
In this paper, we present a parallel waveform-matching architecture capable of performing high-speed waveform matching on a high-end FPGA-based digitizer. We also present a workflow for calibrating the waveform-matching system to the specific pattern of the CO in the presence of hardware restrictions provided by the FPGA hardware. Our implementation enables waveform matching at 10 GS/s, offering a speedup of 50x compared to the fastest state-of-the-art implementation known to us. We demonstrate how to apply the technique for attacking the widespread XTS-AES algorithm using waveform matching to recover the encrypted tweak even in the presence of so-called systemic noise.
{"title":"Design, Calibration, and Evaluation of Real-Time Waveform Matching on an FPGA-based Digitizer at 10 GS/s","authors":"Jens Trautmann, Paul Krüger, Andreas Becher, Stefan Wildermann, Jürgen Teich","doi":"10.1145/3635719","DOIUrl":"https://doi.org/10.1145/3635719","url":null,"abstract":"<p>Digitizing side-channel signals at high sampling rates produces huge amounts of data, while side-channel analysis techniques only need those specific trace segments containing Cryptographic Operations (COs). For detecting these segments, waveform-matching techniques have been established comparing the signal with a template of the CO’s characteristic pattern. Real-time waveform matching requires highly parallel implementations as achieved by hardware design but also reconfigurability as provided by FPGAs to adapt the matching hardware to a specific CO pattern. However, currently proposed designs process the samples from analog-to-digital converters sequentially and can only process low sampling rates due to the limited clock speed of FPGAs. </p><p>In this paper, we present a parallel waveform-matching architecture capable of performing high-speed waveform matching on a high-end FPGA-based digitizer. We also present a workflow for calibrating the waveform-matching system to the specific pattern of the CO in the presence of hardware restrictions provided by the FPGA hardware. Our implementation enables waveform matching at 10 GS/s, offering a speedup of 50x compared to the fastest state-of-the-art implementation known to us. We demonstrate how to apply the technique for attacking the widespread XTS-AES algorithm using waveform matching to recover the encrypted tweak even in the presence of so-called systemic noise.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"5123 1 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introduction to the Special Section on FCCM 2022","authors":"Jing Li, Martin Herbordt","doi":"10.1145/3632092","DOIUrl":"https://doi.org/10.1145/3632092","url":null,"abstract":"<p>No abstract available.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"29 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zimeng Fan, Wei Hu, Fang Liu, Dian Xu, Hong Guo, Yanxiang He, Min Peng
In computer vision, the joint development of the algorithm and computing dimensions cannot be separated. Models and algorithms are constantly evolving, while hardware designs must adapt to new or updated algorithms. Reconfigurable devices are recognized as important platforms for computer vision applications because of their reconfigurability. There are two typical design approaches: customized and overlay design. However, existing work is unable to achieve both efficient performance and scalability to adapt to a wide range of models. To address both considerations, we propose a design framework based on reconfigurable devices to provide unified support for computer vision models. It provides software-programmable modules while leaving unit design space for problem-specific algorithms. Based on the proposed framework, we design a model mapping method and a hardware architecture with two processor arrays to enable dynamic and static reconfiguration, thereby relieving redesign pressure. In addition, resource consumption and efficiency can be balanced by adjusting the hyperparameter. In experiments on CNN, vision Transformer, and vision MLP models, our work’s throughput is improved by 18.8x–33.6x and 1.4x–2.0x compared to CPU and GPU. Compared to others on the same platform, accelerators based on our framework can better balance resource consumption and efficiency.
{"title":"A hardware design framework for computer vision models based on reconfigurable devices","authors":"Zimeng Fan, Wei Hu, Fang Liu, Dian Xu, Hong Guo, Yanxiang He, Min Peng","doi":"10.1145/3635157","DOIUrl":"https://doi.org/10.1145/3635157","url":null,"abstract":"<p>In computer vision, the joint development of the algorithm and computing dimensions cannot be separated. Models and algorithms are constantly evolving, while hardware designs must adapt to new or updated algorithms. Reconfigurable devices are recognized as important platforms for computer vision applications because of their reconfigurability. There are two typical design approaches: customized and overlay design. However, existing work is unable to achieve both efficient performance and scalability to adapt to a wide range of models. To address both considerations, we propose a design framework based on reconfigurable devices to provide unified support for computer vision models. It provides software-programmable modules while leaving unit design space for problem-specific algorithms. Based on the proposed framework, we design a model mapping method and a hardware architecture with two processor arrays to enable dynamic and static reconfiguration, thereby relieving redesign pressure. In addition, resource consumption and efficiency can be balanced by adjusting the hyperparameter. In experiments on CNN, vision Transformer, and vision MLP models, our work’s throughput is improved by 18.8x–33.6x and 1.4x–2.0x compared to CPU and GPU. Compared to others on the same platform, accelerators based on our framework can better balance resource consumption and efficiency.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"192 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anupreetham Anupreetham, Mohamed Ibrahim, Mathew Hall, Andrew Boutros, Ajay Kuzhively, Abinash Mohanty, Eriko Nurvitadhi, Vaughn Betz, Yu Cao, Jae-sun Seo
Object detection and classification is a key task in many computer vision applications such as smart surveillance and autonomous vehicles. Recent advances in deep learning have significantly improved the quality of results achieved by these systems, making them more accurate and reliable in complex environments. Modern object detection systems make use of lightweight convolutional neural networks (CNNs) for feature extraction, coupled with single-shot multi-box detectors (SSDs) that generate bounding boxes around the identified objects along with their classification confidence scores. Subsequently, a non-maximum suppression (NMS) module removes any redundant detection boxes from the final output. Typical NMS algorithms must wait for all box predictions to be generated by the SSD-based feature extractor before processing them. This sequential dependency between box predictions and NMS results in a significant latency overhead and degrades the overall system throughput, even if a high-performance CNN accelerator is used for the SSD feature extraction component. In this paper, we present a novel pipelined NMS algorithm that eliminates this sequential dependency and associated NMS latency overhead. We then use our novel NMS algorithm to implement an end-to-end fully pipelined FPGA system for low-latency SSD-MobileNet-V1 object detection. Our system, implemented on an Intel Stratix 10 FPGA, runs at 400 MHz and achieves a throughput of 2,167 frames per second with an end-to-end batch-1 latency of 2.13 ms. Our system achieves 5.3 × higher throughput and 5 × lower latency compared to the best prior FPGA-based solution with comparable accuracy.
{"title":"High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-Design","authors":"Anupreetham Anupreetham, Mohamed Ibrahim, Mathew Hall, Andrew Boutros, Ajay Kuzhively, Abinash Mohanty, Eriko Nurvitadhi, Vaughn Betz, Yu Cao, Jae-sun Seo","doi":"10.1145/3634919","DOIUrl":"https://doi.org/10.1145/3634919","url":null,"abstract":"<p>Object detection and classification is a key task in many computer vision applications such as smart surveillance and autonomous vehicles. Recent advances in deep learning have significantly improved the quality of results achieved by these systems, making them more accurate and reliable in complex environments. Modern object detection systems make use of lightweight convolutional neural networks (CNNs) for feature extraction, coupled with single-shot multi-box detectors (SSDs) that generate bounding boxes around the identified objects along with their classification confidence scores. Subsequently, a non-maximum suppression (NMS) module removes any redundant detection boxes from the final output. Typical NMS algorithms must wait for all box predictions to be generated by the SSD-based feature extractor before processing them. This sequential dependency between box predictions and NMS results in a significant latency overhead and degrades the overall system throughput, even if a high-performance CNN accelerator is used for the SSD feature extraction component. In this paper, we present a novel pipelined NMS algorithm that eliminates this sequential dependency and associated NMS latency overhead. We then use our novel NMS algorithm to implement an end-to-end fully pipelined FPGA system for low-latency SSD-MobileNet-V1 object detection. Our system, implemented on an Intel Stratix 10 FPGA, runs at 400 MHz and achieves a throughput of 2,167 frames per second with an end-to-end batch-1 latency of 2.13 ms. Our system achieves 5.3 × higher throughput and 5 × lower latency compared to the best prior FPGA-based solution with comparable accuracy.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"8 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}