Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609874
A. M. García, C. O. Quero, J. Rangel-Magdaleno, J. Martínez-Carranza, D. D. Romero
Walsh Hadamard Transform (WHT) is an orthogonal, symmetric, involutional, and linear operation used in data encryption, data compression, and quantum computing. The WHT belongs to a generalized class of Fourier transforms, which allows that many algorithms developed for the fast Fourier transform (FFT) work for fast WHT implementations (FWHT). This paper employs this property and uses a parallel-pipeline FFT well-known strategy for VLSI implementation to build parallel-pipeline architectures for FWHT. We apply the FFT parallel-pipeline approach on a Fast WHT and use the High-Level Synthesis (HLS) tool from Xilinx Vitis to generate an FPGA solution. We also provide an open-source code with the basic blocks to build any model with any parallelization level. The parallel-pipeline proposed solutions achieve a latency reduction of up to 3.57% compared to a pipeline approach on a 256-long signal using 32 bit floating-point numbers.
{"title":"Parallel-Pipeline Fast Walsh-Hadamard Transform Implementation Using HLS","authors":"A. M. García, C. O. Quero, J. Rangel-Magdaleno, J. Martínez-Carranza, D. D. Romero","doi":"10.1109/ICFPT52863.2021.9609874","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609874","url":null,"abstract":"Walsh Hadamard Transform (WHT) is an orthogonal, symmetric, involutional, and linear operation used in data encryption, data compression, and quantum computing. The WHT belongs to a generalized class of Fourier transforms, which allows that many algorithms developed for the fast Fourier transform (FFT) work for fast WHT implementations (FWHT). This paper employs this property and uses a parallel-pipeline FFT well-known strategy for VLSI implementation to build parallel-pipeline architectures for FWHT. We apply the FFT parallel-pipeline approach on a Fast WHT and use the High-Level Synthesis (HLS) tool from Xilinx Vitis to generate an FPGA solution. We also provide an open-source code with the basic blocks to build any model with any parallelization level. The parallel-pipeline proposed solutions achieve a latency reduction of up to 3.57% compared to a pipeline approach on a 256-long signal using 32 bit floating-point numbers.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127612743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609846
Sameh Attia, Vaughn Betz
Checkpoint-based debugging flows that allow moving the design state between an FPGA and a simulator have recently emerged. These flows combine the speed of hardware execution and the full observability and controllability of HDL simulation. However, they assume the entire system state can be moved to a simulator, limiting them to self-contained systems and precluding their use in network or CPU-attached FPGAs. In this paper, we present StateLink, a co-simulation framework that allows a design-under-test (DUT) running in a simulator to interact with other design elements that reside in hardware. StateLink creates links between DUT interfaces in the HDL simulation and their equivalents in hardware, thereby allowing the DUT to remain connected to and active in the overall hardware system after its state is moved to a simulator. This extends the functionality of checkpoint-based debugging frameworks to designs with external I/Os such as DRAM and Ethernet, and to designs that contain components with no simulation models. It also significantly decreases the simulation time of DUTs that are part of a large system. For example, it speeds up the HDL simulation of designs that interface with DRAM by up to 25 ×. Incorporating StateLink in a design typically adds no timing overhead and a modest hardware area overhead; for example, StateLink adds 916 LUTs to a 32-bit AXI memory-mapped and 1423 LUTs to a 32-bit AXI streaming interface.
{"title":"StateLink: FPGA System Debugging via Flexible Simulation/Hardware Integration","authors":"Sameh Attia, Vaughn Betz","doi":"10.1109/ICFPT52863.2021.9609846","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609846","url":null,"abstract":"Checkpoint-based debugging flows that allow moving the design state between an FPGA and a simulator have recently emerged. These flows combine the speed of hardware execution and the full observability and controllability of HDL simulation. However, they assume the entire system state can be moved to a simulator, limiting them to self-contained systems and precluding their use in network or CPU-attached FPGAs. In this paper, we present StateLink, a co-simulation framework that allows a design-under-test (DUT) running in a simulator to interact with other design elements that reside in hardware. StateLink creates links between DUT interfaces in the HDL simulation and their equivalents in hardware, thereby allowing the DUT to remain connected to and active in the overall hardware system after its state is moved to a simulator. This extends the functionality of checkpoint-based debugging frameworks to designs with external I/Os such as DRAM and Ethernet, and to designs that contain components with no simulation models. It also significantly decreases the simulation time of DUTs that are part of a large system. For example, it speeds up the HDL simulation of designs that interface with DRAM by up to 25 ×. Incorporating StateLink in a design typically adds no timing overhead and a modest hardware area overhead; for example, StateLink adds 916 LUTs to a 32-bit AXI memory-mapped and 1423 LUTs to a 32-bit AXI streaming interface.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"473 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131400079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609950
J. Burgiel, Daniel E. Holcomb, Ilias Giechaskiel, Shanquan Tian, Jakub Szefer
Ring Oscillators (ROs) are fundamental primitives that are used as building blocks in many other types of circuits. This paper presents an in-depth characterization of ring oscillators which leverage the IOBUF primitive found in modern Xilinx FPGAs. This work first analyzes the impact of the drive strength and slew rate attributes of the IOBUFs on the ROs, and also characterizes the impacts of external temperature, internal voltage, and external voltage fluctuations on the frequency of the proposed ROs. This work further demonstrates that IOBUF-based ROs can detect whether electrical connections to the IOBUF pins have changed, including whether the DRAM module has been physically removed. Finally, the proposed ROs can be realized on cloud FPGAs, bypassing the restrictions that some cloud providers impose on combinatorial loops, and thus presenting a new security threat to remote FPGAs.
{"title":"Characterization of IOBUF-based Ring Oscillators","authors":"J. Burgiel, Daniel E. Holcomb, Ilias Giechaskiel, Shanquan Tian, Jakub Szefer","doi":"10.1109/ICFPT52863.2021.9609950","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609950","url":null,"abstract":"Ring Oscillators (ROs) are fundamental primitives that are used as building blocks in many other types of circuits. This paper presents an in-depth characterization of ring oscillators which leverage the IOBUF primitive found in modern Xilinx FPGAs. This work first analyzes the impact of the drive strength and slew rate attributes of the IOBUFs on the ROs, and also characterizes the impacts of external temperature, internal voltage, and external voltage fluctuations on the frequency of the proposed ROs. This work further demonstrates that IOBUF-based ROs can detect whether electrical connections to the IOBUF pins have changed, including whether the DRAM module has been physically removed. Finally, the proposed ROs can be realized on cloud FPGAs, bypassing the restrictions that some cloud providers impose on combinatorial loops, and thus presenting a new security threat to remote FPGAs.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113970725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609952
Prajith Ramakrishnan Geethakumari, I. Sourdis
High performance stream aggregation is critical for many emerging applications that analyze massive volumes of data. Incoming data needs to be stored in a sliding-window before processing, in case the aggregation functions cannot be computed incrementally. Updating the window with new incoming values and reading it to feed the aggregation functions are the two primary steps in stream aggregation. Although window updates can be supported efficiently using multi-level queues, frequent window aggregations remain a performance bottleneck as they put tremendous pressure on the memory bandwidth and capacity. This paper addresses this problem by introducing StreamZip, a dataflow stream aggregation engine that is able to compress the sliding-windows. StreamZip deals with a number of data and control dependency challenges to integrate a compressor in the stream aggregation pipeline and alleviate the memory pressure posed by frequent aggregations. In doing so, StreamZip offers higher throughput as well as larger effective window capacity to support larger problems. StreamZip supports diverse compression algorithms offering both lossless and lossy compression to integers as well as floating point numbers. Compared to designs without compression, StreamZip lossless and lossy designs achieve up to 7× and 22× higher throughput, while improving the effective memory capacity by up to 5× and 23×, respectively.
{"title":"StreamZip: Compressed Sliding-Windows for Stream Aggregation","authors":"Prajith Ramakrishnan Geethakumari, I. Sourdis","doi":"10.1109/ICFPT52863.2021.9609952","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609952","url":null,"abstract":"High performance stream aggregation is critical for many emerging applications that analyze massive volumes of data. Incoming data needs to be stored in a sliding-window before processing, in case the aggregation functions cannot be computed incrementally. Updating the window with new incoming values and reading it to feed the aggregation functions are the two primary steps in stream aggregation. Although window updates can be supported efficiently using multi-level queues, frequent window aggregations remain a performance bottleneck as they put tremendous pressure on the memory bandwidth and capacity. This paper addresses this problem by introducing StreamZip, a dataflow stream aggregation engine that is able to compress the sliding-windows. StreamZip deals with a number of data and control dependency challenges to integrate a compressor in the stream aggregation pipeline and alleviate the memory pressure posed by frequent aggregations. In doing so, StreamZip offers higher throughput as well as larger effective window capacity to support larger problems. StreamZip supports diverse compression algorithms offering both lossless and lossy compression to integers as well as floating point numbers. Compared to designs without compression, StreamZip lossless and lossy designs achieve up to 7× and 22× higher throughput, while improving the effective memory capacity by up to 5× and 23×, respectively.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"17 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113977384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces our autonomous driving system equipped with recognition processing units from a camera image for hazard object / human-doll detection and drive lane detection. In particular, this paper focuses on a dataset generation method for neural networks and a generation tool “FPGA Oriented Easy Synthesizer Tool (FOrEST)” for ROS2-FPGA nodes. The results show that mAP of a neural network trained by the generated dataset is 94%, and a overhead of ROS2-FPGA communication by the FOrEST is 2–3 ms.
{"title":"A dataset generation for object recognition and a tool for generating ROS2 FPGA node","authors":"Hayato Amano, Hayato Mori, Akinobu Mizutani, Tomohiro Ono, Yuma Yoshimoto, Takeshi Ohkawa, H. Tamukoh","doi":"10.1109/ICFPT52863.2021.9609880","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609880","url":null,"abstract":"This paper introduces our autonomous driving system equipped with recognition processing units from a camera image for hazard object / human-doll detection and drive lane detection. In particular, this paper focuses on a dataset generation method for neural networks and a generation tool “FPGA Oriented Easy Synthesizer Tool (FOrEST)” for ROS2-FPGA nodes. The results show that mAP of a neural network trained by the generated dataset is 94%, and a overhead of ROS2-FPGA communication by the FOrEST is 2–3 ms.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134160428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609865
Xuan Feng, Yue Li, Yu Qian, Jingbo Gao, Wei Cao, Lingli Wang
Nonlinear activation functions (NAFs) play an essential role in deep neural networks (DNNs). Since versatile DNN accelerators need to support various DNNs which contain different NAFs, the flexible hardware design supporting those NAFs has become crucial. However, there are few high-precision flexible hardware architectures, and the symmetries of different NAFs have not been fully studied. This paper proposes a high-precision symmetry-aware architecture based on piecewise linear approximation. Through the reconfigurable data path, the architecture can support various typical NAFs. The efficient non-uniform segmentation scheme is proposed to achieve high precision for each NAF. Besides, the utilization of unified symmetry for NAFs can save half the memory. To reduce the computational cost, a 25×18 DSP is shared by two INT 7×9 multipliers with two independent inputs. The architecture is implemented on Xilinx ZC706 at a frequency of 410MHz. Compared with the state-of-the-art flexible nonlinear core, our flexible architecture costs fewer hardware resources with higher precision. Applying the design to BERT-BASE, MobileNetV3, and EfficientNet-B3 on the PyTorch platform, experimental results show that the accuracy loss is either 0 for BERT-BASE, or 0.002% for EfficientNet-B3. For MobileNetV3, the accuracy is even improved by 0.01%.
{"title":"A High-Precision Flexible Symmetry-Aware Architecture for Element-Wise Activation Functions","authors":"Xuan Feng, Yue Li, Yu Qian, Jingbo Gao, Wei Cao, Lingli Wang","doi":"10.1109/ICFPT52863.2021.9609865","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609865","url":null,"abstract":"Nonlinear activation functions (NAFs) play an essential role in deep neural networks (DNNs). Since versatile DNN accelerators need to support various DNNs which contain different NAFs, the flexible hardware design supporting those NAFs has become crucial. However, there are few high-precision flexible hardware architectures, and the symmetries of different NAFs have not been fully studied. This paper proposes a high-precision symmetry-aware architecture based on piecewise linear approximation. Through the reconfigurable data path, the architecture can support various typical NAFs. The efficient non-uniform segmentation scheme is proposed to achieve high precision for each NAF. Besides, the utilization of unified symmetry for NAFs can save half the memory. To reduce the computational cost, a 25×18 DSP is shared by two INT 7×9 multipliers with two independent inputs. The architecture is implemented on Xilinx ZC706 at a frequency of 410MHz. Compared with the state-of-the-art flexible nonlinear core, our flexible architecture costs fewer hardware resources with higher precision. Applying the design to BERT-BASE, MobileNetV3, and EfficientNet-B3 on the PyTorch platform, experimental results show that the accuracy loss is either 0 for BERT-BASE, or 0.002% for EfficientNet-B3. For MobileNetV3, the accuracy is even improved by 0.01%.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123892828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609944
Kazuki Furukawa, Ryohei Kobayashi, Tomoya Yokono, N. Fujita, Y. Yamaguchi, T. Boku, K. Yoshikawa, M. Umemura
This paper proposes the efficient buffering approach for implementing radiative transfer equations to bridge the performance gap between processing elements and HBM memory bandwidth. The radiation transfer equation originally focuses on the fundamental physics process in astrophysics. Besides, it has become the focus of a lot of attention in recent years because of the wealth of applications such as medical bioimaging. However, the acceleration requires a complicated memory access pattern with low latency, and the earlier studies unveil conventional memory access based on software control has no aptitude for this computation. Thus, this article introduced an HBM FPGA and proposed an application-specific buffering mechanism called PRISM (PRefetchable and Instantly accessible Scratchpad Memory) to efficiently bridge the computational unit and the HBM. The proposed approach was evaluated on a XILINX Alveo U280 FPGA, and the experimental results are also discussed.
{"title":"An efficient RTL buffering scheme for an FPGA-accelerated simulation of diffuse radiative transfer","authors":"Kazuki Furukawa, Ryohei Kobayashi, Tomoya Yokono, N. Fujita, Y. Yamaguchi, T. Boku, K. Yoshikawa, M. Umemura","doi":"10.1109/ICFPT52863.2021.9609944","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609944","url":null,"abstract":"This paper proposes the efficient buffering approach for implementing radiative transfer equations to bridge the performance gap between processing elements and HBM memory bandwidth. The radiation transfer equation originally focuses on the fundamental physics process in astrophysics. Besides, it has become the focus of a lot of attention in recent years because of the wealth of applications such as medical bioimaging. However, the acceleration requires a complicated memory access pattern with low latency, and the earlier studies unveil conventional memory access based on software control has no aptitude for this computation. Thus, this article introduced an HBM FPGA and proposed an application-specific buffering mechanism called PRISM (PRefetchable and Instantly accessible Scratchpad Memory) to efficiently bridge the computational unit and the HBM. The proposed approach was evaluated on a XILINX Alveo U280 FPGA, and the experimental results are also discussed.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124449416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609917
Luke Beckwith, D. Nguyen, K. Gaj
Many currently deployed public-key cryptosystems are based on the difficulty of the discrete logarithm and integer factorization problems. However, given an adequately sized quantum computer, these problems can be solved in polynomial time as a function of the key size. Due to the future threat of quantum computing to current cryptographic standards, alternative algorithms that remain secure under quantum computing are being evaluated for future use. One such algorithm is CRYSTALS-Dilithium, a lattice-based digital signature scheme, which is a finalist in the NIST Post Quantum Cryptography (PQC) competition. As a part of this evaluation, high-performance implementations of these algorithms must be investigated. This work presents a high-performance implementation of CRYSTALS-Dilithium targeting FPGAs. In particular, we present a design that achieves the best latency for an FPGA implementation to date. We also compare our results with the most-relevant previous work on hardware implementations of NIST Round 3 post-quantum digital signature candidates.
{"title":"High-Performance Hardware Implementation of CRYSTALS-Dilithium","authors":"Luke Beckwith, D. Nguyen, K. Gaj","doi":"10.1109/ICFPT52863.2021.9609917","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609917","url":null,"abstract":"Many currently deployed public-key cryptosystems are based on the difficulty of the discrete logarithm and integer factorization problems. However, given an adequately sized quantum computer, these problems can be solved in polynomial time as a function of the key size. Due to the future threat of quantum computing to current cryptographic standards, alternative algorithms that remain secure under quantum computing are being evaluated for future use. One such algorithm is CRYSTALS-Dilithium, a lattice-based digital signature scheme, which is a finalist in the NIST Post Quantum Cryptography (PQC) competition. As a part of this evaluation, high-performance implementations of these algorithms must be investigated. This work presents a high-performance implementation of CRYSTALS-Dilithium targeting FPGAs. In particular, we present a design that achieves the best latency for an FPGA implementation to date. We also compare our results with the most-relevant previous work on hardware implementations of NIST Round 3 post-quantum digital signature candidates.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122657831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609813
Zhewen Yu, C. Bouganis
The post-training compression of a Convolutional Neural Network (CNN) aims to produce Pareto-optimal designs on the accuracy-performance frontier when the access to training data is not possible. Low-rank approximation is one of the methods that is often utilised in such cases. However, existing work considers the low-rank approximation of the network and the optimisation of the hardware accelerator separately, leading to systems with sub-optimal performance. This work focuses on the efficient mapping of a CNN into an FPGA device, and presents StreamSVD, a model-accelerator co-design framework1. The framework considers simultaneously the compression of a CNN model through a hardware-aware low-rank approximation scheme, and the optimisation of the hardware accelerator's architecture by taking into account the approximation scheme's compute structure. Our results show that the co-designed StreamSVD outperforms existing work that utilises similar low-rank approximation schemes by providing better accuracy-throughput trade-off. The proposed framework also achieves competitive performance compared with other post-training compression methods, even outperforming them under certain cases.
{"title":"StreamSVD: Low-rank Approximation and Streaming Accelerator Co-design","authors":"Zhewen Yu, C. Bouganis","doi":"10.1109/ICFPT52863.2021.9609813","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609813","url":null,"abstract":"The post-training compression of a Convolutional Neural Network (CNN) aims to produce Pareto-optimal designs on the accuracy-performance frontier when the access to training data is not possible. Low-rank approximation is one of the methods that is often utilised in such cases. However, existing work considers the low-rank approximation of the network and the optimisation of the hardware accelerator separately, leading to systems with sub-optimal performance. This work focuses on the efficient mapping of a CNN into an FPGA device, and presents StreamSVD, a model-accelerator co-design framework1. The framework considers simultaneously the compression of a CNN model through a hardware-aware low-rank approximation scheme, and the optimisation of the hardware accelerator's architecture by taking into account the approximation scheme's compute structure. Our results show that the co-designed StreamSVD outperforms existing work that utilises similar low-rank approximation schemes by providing better accuracy-throughput trade-off. The proposed framework also achieves competitive performance compared with other post-training compression methods, even outperforming them under certain cases.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122001560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609935
Jiadong Qian, Yuhang Shen, Kaichuang Shi, Hao Zhou, Lingli Wang
Routing architecture has a significant impact on the area, critical path delay and power consumption of modern FPGAs. The most common routing architecture of island-style FPGAs in academia is the CB-SB model, which is not effective to model complex routing architectures in modern FPGAs. To improve the routability and performance of the existing routing model, we propose a new routing model called General Routing Block (GRB) to model complex commercial FPGAs. In the proposed model, all routing resources can be divided into three modules: general switch block (GSB), input connection block (ICB) and output connection block (OCB). The GSB and ICB are extended from the SB and CB with more flexible and richer connections. The OCB is a new module that provides novel connections for the LB output pins. We support bent wire architecture to reduce the delay, and two-level MUXes with output sharing to achieve a better trade-off between the area and flexibility. Moreover, to explore the trade-offs of different design spaces and find better architectures, an architecture exploration platform based on the simulated annealing algorithm is proposed to efficiently explore the enormous design space specified by a set of parameters. The results of global design space exploration show that the architecture with the proposed GRB model reduces the critical path delay by 15.5% and area-delay product by 14.8% compared to the length-4 CB-SB architecture based on the VTR benchmarks. After further local subspace explorations, the best architecture can achieve an 18.7% improvement on the critical path delay and a 23.8% improvement on the area-delay product, which represents a significant improvement over other routing architectures.
{"title":"General routing architecture modelling and exploration for modern FPGAs","authors":"Jiadong Qian, Yuhang Shen, Kaichuang Shi, Hao Zhou, Lingli Wang","doi":"10.1109/ICFPT52863.2021.9609935","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609935","url":null,"abstract":"Routing architecture has a significant impact on the area, critical path delay and power consumption of modern FPGAs. The most common routing architecture of island-style FPGAs in academia is the CB-SB model, which is not effective to model complex routing architectures in modern FPGAs. To improve the routability and performance of the existing routing model, we propose a new routing model called General Routing Block (GRB) to model complex commercial FPGAs. In the proposed model, all routing resources can be divided into three modules: general switch block (GSB), input connection block (ICB) and output connection block (OCB). The GSB and ICB are extended from the SB and CB with more flexible and richer connections. The OCB is a new module that provides novel connections for the LB output pins. We support bent wire architecture to reduce the delay, and two-level MUXes with output sharing to achieve a better trade-off between the area and flexibility. Moreover, to explore the trade-offs of different design spaces and find better architectures, an architecture exploration platform based on the simulated annealing algorithm is proposed to efficiently explore the enormous design space specified by a set of parameters. The results of global design space exploration show that the architecture with the proposed GRB model reduces the critical path delay by 15.5% and area-delay product by 14.8% compared to the length-4 CB-SB architecture based on the VTR benchmarks. After further local subspace explorations, the best architecture can achieve an 18.7% improvement on the critical path delay and a 23.8% improvement on the area-delay product, which represents a significant improvement over other routing architectures.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"306 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116606395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}