Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609927
Yuan Dai, Simin Liu, Yao Lu, Hao Zhou, Seyedramin Rasoulinezhad, Philip H. W. Leong, Lingli Wang
In error-tolerant applications such as low-precision DNNs and digital filters, approximate arithmetic circuits can significantly reduce hardware resource utilization. In this work we propose an embedded block for field-programmable gate arrays, called APIR-DSP, which incorporates an approximate 9×9 hard multiplier based on the PIR-DSP architecture to improve speed and reduce area. In addition, a DSP unit evaluation platform based on Yosys and VPR which packs multiply accumulate operations into DSP blocks is developed. Using this tool we synthesis designs from Verilog implementations of matrix multiplication in DeepBench and the DoReFaNet low-precision neural network and show that APIR-DSP significantly reduces DSP resources and improves hardware utilization and performance compared with the Xilinx DSP48E2 embedded block. Compared with exact multiplication, it is shown that accuracy loss is optimized with the SNR of an FIR filter being reduced by 1.03 dB. For DNNs, accuracy loss for AlexNet is 0.31% on CIFAR10 dataset and no accuracy loss for LeNet on MNIST dataset is observed. Synthesis results show that the APIR-DSP enjoys an area reduction of 21.60%, critical path reduction of 4.85% and power consumption is reduced by 2.80%, compared with PIR-DSP.
{"title":"APIR-DSP: An approximate PIR-DSP architecture for error-tolerant applications","authors":"Yuan Dai, Simin Liu, Yao Lu, Hao Zhou, Seyedramin Rasoulinezhad, Philip H. W. Leong, Lingli Wang","doi":"10.1109/ICFPT52863.2021.9609927","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609927","url":null,"abstract":"In error-tolerant applications such as low-precision DNNs and digital filters, approximate arithmetic circuits can significantly reduce hardware resource utilization. In this work we propose an embedded block for field-programmable gate arrays, called APIR-DSP, which incorporates an approximate 9×9 hard multiplier based on the PIR-DSP architecture to improve speed and reduce area. In addition, a DSP unit evaluation platform based on Yosys and VPR which packs multiply accumulate operations into DSP blocks is developed. Using this tool we synthesis designs from Verilog implementations of matrix multiplication in DeepBench and the DoReFaNet low-precision neural network and show that APIR-DSP significantly reduces DSP resources and improves hardware utilization and performance compared with the Xilinx DSP48E2 embedded block. Compared with exact multiplication, it is shown that accuracy loss is optimized with the SNR of an FIR filter being reduced by 1.03 dB. For DNNs, accuracy loss for AlexNet is 0.31% on CIFAR10 dataset and no accuracy loss for LeNet on MNIST dataset is observed. Synthesis results show that the APIR-DSP enjoys an area reduction of 21.60%, critical path reduction of 4.85% and power consumption is reduced by 2.80%, compared with PIR-DSP.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123228078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609833
J. Peltenburg, Á. Hadnagy, M. Brobbel, Robert Morrow, Z. Al-Ars
JSON is a popular data interchange format for many web, cloud, and IoT systems due to its simplicity, human readability, and widespread support. However, applications must first parse and convert the data to a native in-memory format before being able to perform useful computations. Many big data applications with high performance requirements convert JSON data to Apache Arrow RecordBatches, the latter being a widely-used columnar in-memory format for large tabular data sets used in data analytics. In this paper, we analyze the performance characteristics of such applications and show that JSON parsing represents a bottleneck in the system. Various strategies are explored to speed up JSON parsing on CPU and GPU as much as possible. Due to performance limitation of the CPU and GPU implementations, we furthermore present an FPGA accelerated implementation. We explain how hardware components that can parse variable-sized and nested structures can be combined to produce JSON parsers for any type of JSON document. Several fully integrated FPGA-accelerated JSON parser implementations are presented using the Intel Arria 10 GX and Xilinx VU37P devices, and compared to the performance of their respective host systems; an Intel Xeon and an IBM POWER9 system. Result show the accelerators achieve an end-to-end throughput close to 7 GB/s with the Arria 10 GX using PCIe, and close to 20 GB/s with the VU37P using OpenCAPI 3. Depending on the complexity of the JSON data to parse, the bandwidth is limited by the host-to-accelerator interface or available FPGA resources. Overall, this provides a throughput increase of up to 6x, compared to the baseline application. Also, we observe a full system energy efficiency improvement of up to 59x more JSON data parsed per joule.
{"title":"Tens of gigabytes per second JSON-to-Arrow conversion with FPGA accelerators","authors":"J. Peltenburg, Á. Hadnagy, M. Brobbel, Robert Morrow, Z. Al-Ars","doi":"10.1109/ICFPT52863.2021.9609833","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609833","url":null,"abstract":"JSON is a popular data interchange format for many web, cloud, and IoT systems due to its simplicity, human readability, and widespread support. However, applications must first parse and convert the data to a native in-memory format before being able to perform useful computations. Many big data applications with high performance requirements convert JSON data to Apache Arrow RecordBatches, the latter being a widely-used columnar in-memory format for large tabular data sets used in data analytics. In this paper, we analyze the performance characteristics of such applications and show that JSON parsing represents a bottleneck in the system. Various strategies are explored to speed up JSON parsing on CPU and GPU as much as possible. Due to performance limitation of the CPU and GPU implementations, we furthermore present an FPGA accelerated implementation. We explain how hardware components that can parse variable-sized and nested structures can be combined to produce JSON parsers for any type of JSON document. Several fully integrated FPGA-accelerated JSON parser implementations are presented using the Intel Arria 10 GX and Xilinx VU37P devices, and compared to the performance of their respective host systems; an Intel Xeon and an IBM POWER9 system. Result show the accelerators achieve an end-to-end throughput close to 7 GB/s with the Arria 10 GX using PCIe, and close to 20 GB/s with the VU37P using OpenCAPI 3. Depending on the complexity of the JSON data to parse, the bandwidth is limited by the host-to-accelerator interface or available FPGA resources. Overall, this provides a throughput increase of up to 6x, compared to the baseline application. Also, we observe a full system energy efficiency improvement of up to 59x more JSON data parsed per joule.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114182377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609901
Alexander Klemd, P. Nowak, Piero Rivera Benois, Etienne Gerat, U. Zölzer, B. Klauer
In this paper a field programmable gate array (FPGA) is considered as a digital signal processing platform for the implementation of an exponential sine sweep measurement algorithm. Aiming at minimizing the required computational resources, two strategies are proposed. Firstly, an oscillator implemented with the coordinate rotation digital computer (CORDIC) algorithm is used to generate the exponential sine sweep. Secondly, only the calculations are performed that lead to the linear impulse response of the system for a desired length. Furthermore, aiming at minimizing the required memory resources, the measured impulse response is stored in the memory previously allocated to the recorded signal. In order to validate the proposed implementation, measurements of an acoustical system are performed using a platform that is equipped with an FPGA and a processor. In this way, the results achieved by the FPGA fixed-point implementation can be compared to reference results achieved using a floating-point MATLAB implementation running on the processor. This comparison corroborates the validity of the proposed implementation.
{"title":"Exponential sine sweep measurement implementation targeting FPGA platforms","authors":"Alexander Klemd, P. Nowak, Piero Rivera Benois, Etienne Gerat, U. Zölzer, B. Klauer","doi":"10.1109/ICFPT52863.2021.9609901","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609901","url":null,"abstract":"In this paper a field programmable gate array (FPGA) is considered as a digital signal processing platform for the implementation of an exponential sine sweep measurement algorithm. Aiming at minimizing the required computational resources, two strategies are proposed. Firstly, an oscillator implemented with the coordinate rotation digital computer (CORDIC) algorithm is used to generate the exponential sine sweep. Secondly, only the calculations are performed that lead to the linear impulse response of the system for a desired length. Furthermore, aiming at minimizing the required memory resources, the measured impulse response is stored in the memory previously allocated to the recorded signal. In order to validate the proposed implementation, measurements of an acoustical system are performed using a platform that is equipped with an FPGA and a processor. In this way, the results achieved by the FPGA fixed-point implementation can be compared to reference results achieved using a floating-point MATLAB implementation running on the processor. This comparison corroborates the validity of the proposed implementation.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123390291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609841
Jing Tan, Gaofeng Lv, Yanni Ma, Guanjie Qiao
Packet classification is a fundamental problem in the network. With the rapid growth of network bandwidth, wire-speed packet classification has become a key challenge for next-generation network processors. In this paper, we propose a decision-tree-based, multi-pipeline architecture for packet classification accelerator in Data Processing Unit (DPU). Our solution is based on MBitTree, a memory-efficient decision tree algorithm for packet classification. First, we present a parallel architecture composed of multiple linear pipelines for efficiently mapping the decision tree built by MBitTree. Second, a special logic is designed to quickly traverse the decision tree, reducing the logic delay of the pipeline stage. Finally, several pipeline optimization techniques are proposed to improve the performance of the architecture. The implementation results show that our architecture can achieve more than 250 Gbps throughput for the 64-byte minimum Ethernet packets, and can store 100K rules in the on-chip memory of a single NetFPGA_SUME.
分组分类是网络中的一个基本问题。随着网络带宽的快速增长,线速分组分类已成为下一代网络处理器面临的关键挑战。本文提出了一种基于决策树的多管道结构,用于数据处理单元(Data Processing Unit, DPU)中的分组分类加速器。我们的解决方案是基于MBitTree,一种内存高效的数据包分类决策树算法。首先,我们提出了一个由多个线性管道组成的并行架构,用于有效地映射由MBitTree构建的决策树。其次,设计了一种特殊的逻辑来快速遍历决策树,减少了流水线阶段的逻辑延迟。最后,提出了几种管道优化技术来提高体系结构的性能。实现结果表明,我们的架构可以在最小64字节的以太网数据包中实现超过250 Gbps的吞吐量,并且可以在单个NetFPGA_SUME的片上存储器中存储100K的规则。
{"title":"High-performance pipeline architecture for packet classification accelerator in DPU","authors":"Jing Tan, Gaofeng Lv, Yanni Ma, Guanjie Qiao","doi":"10.1109/ICFPT52863.2021.9609841","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609841","url":null,"abstract":"Packet classification is a fundamental problem in the network. With the rapid growth of network bandwidth, wire-speed packet classification has become a key challenge for next-generation network processors. In this paper, we propose a decision-tree-based, multi-pipeline architecture for packet classification accelerator in Data Processing Unit (DPU). Our solution is based on MBitTree, a memory-efficient decision tree algorithm for packet classification. First, we present a parallel architecture composed of multiple linear pipelines for efficiently mapping the decision tree built by MBitTree. Second, a special logic is designed to quickly traverse the decision tree, reducing the logic delay of the pipeline stage. Finally, several pipeline optimization techniques are proposed to improve the performance of the architecture. The implementation results show that our architecture can achieve more than 250 Gbps throughput for the 64-byte minimum Ethernet packets, and can store 100K rules in the on-chip memory of a single NetFPGA_SUME.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127607773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609946
Hossein Borhanifar, Hamed Jani, Mohammad Mahdi Gohari, Amir Hossein Heydarian, Mostafa Lashkari, Mohammad Reza Lashkari
In this paper, a method for Autonomous Vehicle is presented based on real time image processing. The system detects the road by sobel edge detection, and it recognizes the obstacles, humans, and traffic lights by heuristic techniques. For this aim, some features are defined based on key points. One important technique is that every frame is divided in some sections which significantly affect time of processing. The system is able to analyze 30 frames per second to get the best decision for controlling the vehicle. The achieved structure is optimized on accuracy and the number of logic cells. The algorithms completely describes in hardware by VHDL, then implemented on DE1-SOC board which uses cyclone V FPGA.
本文提出了一种基于实时图像处理的自动驾驶汽车定位方法。该系统通过索贝尔边缘检测来检测道路,并通过启发式技术来识别障碍物、人类和交通灯。为此,基于关键点定义了一些特征。一个重要的技术是,每一帧被分成一些部分,这显著影响处理时间。该系统能够每秒分析30帧,以获得控制车辆的最佳决策。所实现的结构在精度和逻辑单元数量上进行了优化。在硬件上用VHDL对算法进行了完整的描述,然后用cyclone V FPGA在DE1-SOC板上实现。
{"title":"Fast controling autonomous vehicle based on real time image processing","authors":"Hossein Borhanifar, Hamed Jani, Mohammad Mahdi Gohari, Amir Hossein Heydarian, Mostafa Lashkari, Mohammad Reza Lashkari","doi":"10.1109/ICFPT52863.2021.9609946","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609946","url":null,"abstract":"In this paper, a method for Autonomous Vehicle is presented based on real time image processing. The system detects the road by sobel edge detection, and it recognizes the obstacles, humans, and traffic lights by heuristic techniques. For this aim, some features are defined based on key points. One important technique is that every frame is divided in some sections which significantly affect time of processing. The system is able to analyze 30 frames per second to get the best decision for controlling the vehicle. The achieved structure is optimized on accuracy and the number of logic cells. The algorithms completely describes in hardware by VHDL, then implemented on DE1-SOC board which uses cyclone V FPGA.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"113 12","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120992547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In order to fulfill the rich functions of the application layer, robust and accurate Simultaneous Localization and Mapping (SLAM) technique is very critical for robotics. However, due to the lack of sufficient computing power and storage capacity, it is challenging to delpoy high-accuracy SLAM in embedded devices efficiently. In this work, we propose a complete acceleration scheme, termed ac2SLAM, based on the ORB-SLAM2 algorithm including both front and back ends, and implement it on an FPGA platform. Specifically, the proposed ac2SLAM features with: 1) a scalable and parallel ORB extractor to extract sufficient keypoints and scores for throughput matching with 4% error, 2) a PingPong heapsort component (pp-heapsort) to select the significant keypoints, that could achieve single-cycle initiation interval to reduce the amount of data transfer between accelerator and the host CPU, and 3) the potential parallel acceleration strategies for the back-end optimization. Compared with running ORB-SLAM2 on the ARM processor, ac2SLAM achieves 2.1 × and 2.7 × faster in the TUM and KITTI datasets, while maintaining 10% error of SOTA eSLAM. In addition, the FPGA accelerated front-end achieves 4.55 × and 40 × faster than eSLAM and ARM. The ac2SLAM is fully open-sourced at https://github.com/SLAM-Hardware/acSLAM.
{"title":"ac2SLAM: FPGA Accelerated High-Accuracy SLAM with Heapsort and Parallel Keypoint Extractor","authors":"Cheng Wang, Yingkun Liu, Kedai Zuo, Jianming Tong, Yan Ding, Pengju Ren","doi":"10.1109/ICFPT52863.2021.9609808","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609808","url":null,"abstract":"In order to fulfill the rich functions of the application layer, robust and accurate Simultaneous Localization and Mapping (SLAM) technique is very critical for robotics. However, due to the lack of sufficient computing power and storage capacity, it is challenging to delpoy high-accuracy SLAM in embedded devices efficiently. In this work, we propose a complete acceleration scheme, termed ac2SLAM, based on the ORB-SLAM2 algorithm including both front and back ends, and implement it on an FPGA platform. Specifically, the proposed ac2SLAM features with: 1) a scalable and parallel ORB extractor to extract sufficient keypoints and scores for throughput matching with 4% error, 2) a PingPong heapsort component (pp-heapsort) to select the significant keypoints, that could achieve single-cycle initiation interval to reduce the amount of data transfer between accelerator and the host CPU, and 3) the potential parallel acceleration strategies for the back-end optimization. Compared with running ORB-SLAM2 on the ARM processor, ac2SLAM achieves 2.1 × and 2.7 × faster in the TUM and KITTI datasets, while maintaining 10% error of SOTA eSLAM. In addition, the FPGA accelerated front-end achieves 4.55 × and 40 × faster than eSLAM and ARM. The ac2SLAM is fully open-sourced at https://github.com/SLAM-Hardware/acSLAM.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125888617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Quantum computers will be a revolutionary extension of the heterogeneous computing world. They consist of many quantum bits (qubits) and require a careful design of the interface between the classical computer architecture and the quantum processor. Even single nanosecond variations of the interaction may have an influence on the quantum state. In this paper, we present the modular design of the FPGA firmware which is part of our qubit control electronics. It features so-called digital unit cells where each cell contains all the logic necessary to interact with a single superconducting qubit. The cell includes a custom-built RISC-V-based sequencer, as well as two signal generators and a signal recorder. Internal communication within the cell is handled using a modified Wishbone bus with custom 2-to-N interconnect and deterministic broadcast functionality. We furthermore provide the resource utilization of our design and demonstrate its correct operation using an actual superconducting five qubit chip.
{"title":"A modular RFSoC-based approach to interface superconducting quantum bits","authors":"R. Gebauer, N. Karcher, Mehmed Güler, O. Sander","doi":"10.1145/3571820","DOIUrl":"https://doi.org/10.1145/3571820","url":null,"abstract":"Quantum computers will be a revolutionary extension of the heterogeneous computing world. They consist of many quantum bits (qubits) and require a careful design of the interface between the classical computer architecture and the quantum processor. Even single nanosecond variations of the interaction may have an influence on the quantum state. In this paper, we present the modular design of the FPGA firmware which is part of our qubit control electronics. It features so-called digital unit cells where each cell contains all the logic necessary to interact with a single superconducting qubit. The cell includes a custom-built RISC-V-based sequencer, as well as two signal generators and a signal recorder. Internal communication within the cell is handled using a modified Wishbone bus with custom 2-to-N interconnect and deterministic broadcast functionality. We furthermore provide the resource utilization of our design and demonstrate its correct operation using an actual superconducting five qubit chip.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132218858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609868
Tiago Santos, N. Paulino, João Bispo, João M. P. Cardoso, João C. Ferreira
By using Dynamic Binary Translation, instruction traces from pre-compiled applications can be offloaded, at runtime, to FPGA-based accelerators, such as Coarse-Grained Loop Accelerators, in a transparent way. However, scheduling onto coarse-grain accelerators is challenging, with two of current known issues being the density of computations that can be mapped, and the effects of memory accesses on performance. Using an in-house framework for analysis of instruction traces, we explore the effect of different window sizes when applying list scheduling, to map the window operations to a coarse-grain loop accelerator model that has been previously experimentally validated. For all window sizes, we vary the number of ALUs and memory ports available in the model, and comment how these parameters affect the resulting latency. For a set of benchmarks taken from the PolyBench suite, compiled for the 32-bit MicroBlaze softcore, we have achieved an average iteration speedup of 5.10x for a basic block repeated 5 times and scheduled with 8 ALUs and memory ports, and an average speedup of 5.46x when not considering resource constraints. We also identify which benchmarks contribute to the difference between these two speedups, and breakdown their limiting factors. Finally, we reflect on the impact memory dependencies have on scheduling.
{"title":"On the Performance Effect of Loop Trace Window Size on Scheduling for Configurable Coarse Grain Loop Accelerators","authors":"Tiago Santos, N. Paulino, João Bispo, João M. P. Cardoso, João C. Ferreira","doi":"10.1109/ICFPT52863.2021.9609868","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609868","url":null,"abstract":"By using Dynamic Binary Translation, instruction traces from pre-compiled applications can be offloaded, at runtime, to FPGA-based accelerators, such as Coarse-Grained Loop Accelerators, in a transparent way. However, scheduling onto coarse-grain accelerators is challenging, with two of current known issues being the density of computations that can be mapped, and the effects of memory accesses on performance. Using an in-house framework for analysis of instruction traces, we explore the effect of different window sizes when applying list scheduling, to map the window operations to a coarse-grain loop accelerator model that has been previously experimentally validated. For all window sizes, we vary the number of ALUs and memory ports available in the model, and comment how these parameters affect the resulting latency. For a set of benchmarks taken from the PolyBench suite, compiled for the 32-bit MicroBlaze softcore, we have achieved an average iteration speedup of 5.10x for a basic block repeated 5 times and scheduled with 8 ALUs and memory ports, and an average speedup of 5.46x when not considering resource constraints. We also identify which benchmarks contribute to the difference between these two speedups, and breakdown their limiting factors. Finally, we reflect on the impact memory dependencies have on scheduling.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121108068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609837
R. Abra, Dmitry Denisenko, Richard Allen, Tim Vanderhoek, Sarah Wolstencroft, Peter M. Gibson
Block Floating Point (BFP) is a type of quantization that combines high dynamic range with low-cost inference. BFP can be implemented efficiently on FPGA hardware and, at low precision, halves the logic footprint versus blocked FP16 while maintaining accuracy. Moving to very low precision halves the logic footprint again and retraining allows the recovery of any accuracy lost in transition. This paper describes our approach to achieving target accuracy and FPGA resource usage in a low-precision end-to-end AI solution. We go on to investigate the effects of retraining with our software model that replicates the low-level implementation of BFP on FPGA. Our solution allows efficacy testing for the quantization of custom networks and provides accuracy indications and resource usage for the final application. Using our solution, we were able to quantize ResNet 50, SSD300 and UNet to int5/4bfp precision without losing accuracy while reducing FPGA resources and improving performance.
{"title":"Low Precision Networks for Efficient Inference on FPGAs","authors":"R. Abra, Dmitry Denisenko, Richard Allen, Tim Vanderhoek, Sarah Wolstencroft, Peter M. Gibson","doi":"10.1109/ICFPT52863.2021.9609837","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609837","url":null,"abstract":"Block Floating Point (BFP) is a type of quantization that combines high dynamic range with low-cost inference. BFP can be implemented efficiently on FPGA hardware and, at low precision, halves the logic footprint versus blocked FP16 while maintaining accuracy. Moving to very low precision halves the logic footprint again and retraining allows the recovery of any accuracy lost in transition. This paper describes our approach to achieving target accuracy and FPGA resource usage in a low-precision end-to-end AI solution. We go on to investigate the effects of retraining with our software model that replicates the low-level implementation of BFP on FPGA. Our solution allows efficacy testing for the quantization of custom networks and provides accuracy indications and resource usage for the final application. Using our solution, we were able to quantize ResNet 50, SSD300 and UNet to int5/4bfp precision without losing accuracy while reducing FPGA resources and improving performance.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126303537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609855
A. Kojima
We are developing an FPGA-based robot car for the FPT'21 FPGA design competition. The FPGA we are using is Xilinx UltraScale+ MPSoC, which is an SoC type FPGA and mainly consists of a processor part and a programmable logic part. Among the functions of the autonomous driving system, lane-keeping, localization, driving planning, and obstacle avoidance are implemented as software in the processor part. For object detection, Yolo, a machine learning algorithm, is implemented as hardware in the programmable logic part using the DPU IP provided by Xilinx. The PWM circuits for controlling the DC motor and servo motor are also implemented as the original hardware.
{"title":"Autonomous Driving System implemented on Robot Car using SoC FPGA","authors":"A. Kojima","doi":"10.1109/ICFPT52863.2021.9609855","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609855","url":null,"abstract":"We are developing an FPGA-based robot car for the FPT'21 FPGA design competition. The FPGA we are using is Xilinx UltraScale+ MPSoC, which is an SoC type FPGA and mainly consists of a processor part and a programmable logic part. Among the functions of the autonomous driving system, lane-keeping, localization, driving planning, and obstacle avoidance are implemented as software in the processor part. For object detection, Yolo, a machine learning algorithm, is implemented as hardware in the programmable logic part using the DPU IP provided by Xilinx. The PWM circuits for controlling the DC motor and servo motor are also implemented as the original hardware.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117045267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}