首页 > 最新文献

2021 International Conference on Field-Programmable Technology (ICFPT)最新文献

英文 中文
APIR-DSP: An approximate PIR-DSP architecture for error-tolerant applications APIR-DSP:用于容错应用的近似PIR-DSP架构
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609927
Yuan Dai, Simin Liu, Yao Lu, Hao Zhou, Seyedramin Rasoulinezhad, Philip H. W. Leong, Lingli Wang
In error-tolerant applications such as low-precision DNNs and digital filters, approximate arithmetic circuits can significantly reduce hardware resource utilization. In this work we propose an embedded block for field-programmable gate arrays, called APIR-DSP, which incorporates an approximate 9×9 hard multiplier based on the PIR-DSP architecture to improve speed and reduce area. In addition, a DSP unit evaluation platform based on Yosys and VPR which packs multiply accumulate operations into DSP blocks is developed. Using this tool we synthesis designs from Verilog implementations of matrix multiplication in DeepBench and the DoReFaNet low-precision neural network and show that APIR-DSP significantly reduces DSP resources and improves hardware utilization and performance compared with the Xilinx DSP48E2 embedded block. Compared with exact multiplication, it is shown that accuracy loss is optimized with the SNR of an FIR filter being reduced by 1.03 dB. For DNNs, accuracy loss for AlexNet is 0.31% on CIFAR10 dataset and no accuracy loss for LeNet on MNIST dataset is observed. Synthesis results show that the APIR-DSP enjoys an area reduction of 21.60%, critical path reduction of 4.85% and power consumption is reduced by 2.80%, compared with PIR-DSP.
在低精度深度神经网络和数字滤波器等容错应用中,近似算术电路可以显著降低硬件资源的利用率。在这项工作中,我们提出了一个用于现场可编程门阵列的嵌入式模块,称为APIR-DSP,它包含一个近似的9×9硬乘法器,基于PIR-DSP架构,以提高速度和减少面积。此外,还开发了基于Yosys和VPR的DSP单元评估平台,该平台将多重累加运算打包到DSP块中。使用该工具,我们综合了Verilog在DeepBench中实现的矩阵乘法和DoReFaNet低精度神经网络的设计,并表明与Xilinx DSP48E2嵌入式块相比,APIR-DSP显着减少了DSP资源,提高了硬件利用率和性能。与精确乘法相比,精度损失得到了优化,FIR滤波器的信噪比降低了1.03 dB。对于dnn, AlexNet在CIFAR10数据集上的准确率损失为0.31%,而LeNet在MNIST数据集上没有准确率损失。合成结果表明,与PIR-DSP相比,APIR-DSP的面积减小21.60%,关键路径减小4.85%,功耗降低2.80%。
{"title":"APIR-DSP: An approximate PIR-DSP architecture for error-tolerant applications","authors":"Yuan Dai, Simin Liu, Yao Lu, Hao Zhou, Seyedramin Rasoulinezhad, Philip H. W. Leong, Lingli Wang","doi":"10.1109/ICFPT52863.2021.9609927","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609927","url":null,"abstract":"In error-tolerant applications such as low-precision DNNs and digital filters, approximate arithmetic circuits can significantly reduce hardware resource utilization. In this work we propose an embedded block for field-programmable gate arrays, called APIR-DSP, which incorporates an approximate 9×9 hard multiplier based on the PIR-DSP architecture to improve speed and reduce area. In addition, a DSP unit evaluation platform based on Yosys and VPR which packs multiply accumulate operations into DSP blocks is developed. Using this tool we synthesis designs from Verilog implementations of matrix multiplication in DeepBench and the DoReFaNet low-precision neural network and show that APIR-DSP significantly reduces DSP resources and improves hardware utilization and performance compared with the Xilinx DSP48E2 embedded block. Compared with exact multiplication, it is shown that accuracy loss is optimized with the SNR of an FIR filter being reduced by 1.03 dB. For DNNs, accuracy loss for AlexNet is 0.31% on CIFAR10 dataset and no accuracy loss for LeNet on MNIST dataset is observed. Synthesis results show that the APIR-DSP enjoys an area reduction of 21.60%, critical path reduction of 4.85% and power consumption is reduced by 2.80%, compared with PIR-DSP.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123228078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tens of gigabytes per second JSON-to-Arrow conversion with FPGA accelerators 使用FPGA加速器实现每秒数十千兆字节的json到arrow转换
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609833
J. Peltenburg, Á. Hadnagy, M. Brobbel, Robert Morrow, Z. Al-Ars
JSON is a popular data interchange format for many web, cloud, and IoT systems due to its simplicity, human readability, and widespread support. However, applications must first parse and convert the data to a native in-memory format before being able to perform useful computations. Many big data applications with high performance requirements convert JSON data to Apache Arrow RecordBatches, the latter being a widely-used columnar in-memory format for large tabular data sets used in data analytics. In this paper, we analyze the performance characteristics of such applications and show that JSON parsing represents a bottleneck in the system. Various strategies are explored to speed up JSON parsing on CPU and GPU as much as possible. Due to performance limitation of the CPU and GPU implementations, we furthermore present an FPGA accelerated implementation. We explain how hardware components that can parse variable-sized and nested structures can be combined to produce JSON parsers for any type of JSON document. Several fully integrated FPGA-accelerated JSON parser implementations are presented using the Intel Arria 10 GX and Xilinx VU37P devices, and compared to the performance of their respective host systems; an Intel Xeon and an IBM POWER9 system. Result show the accelerators achieve an end-to-end throughput close to 7 GB/s with the Arria 10 GX using PCIe, and close to 20 GB/s with the VU37P using OpenCAPI 3. Depending on the complexity of the JSON data to parse, the bandwidth is limited by the host-to-accelerator interface or available FPGA resources. Overall, this provides a throughput increase of up to 6x, compared to the baseline application. Also, we observe a full system energy efficiency improvement of up to 59x more JSON data parsed per joule.
由于JSON的简单性、可读性和广泛的支持,它是许多web、云和物联网系统中流行的数据交换格式。但是,应用程序必须首先解析数据并将其转换为本机内存格式,然后才能执行有用的计算。许多具有高性能要求的大数据应用程序将JSON数据转换为Apache Arrow recordbatch,后者是一种广泛使用的列式内存格式,用于数据分析中使用的大型表格数据集。在本文中,我们分析了这些应用程序的性能特征,并表明JSON解析是系统中的瓶颈。探索了各种策略来尽可能加快CPU和GPU上的JSON解析。由于CPU和GPU实现的性能限制,我们进一步提出了FPGA加速实现。我们解释了如何将可以解析可变大小和嵌套结构的硬件组件组合起来,为任何类型的JSON文档生成JSON解析器。采用Intel Arria 10 GX和Xilinx VU37P器件,介绍了几种完全集成的fpga加速JSON解析器实现,并与各自主机系统的性能进行了比较;Intel Xeon和IBM POWER9系统。结果表明,使用PCIe的Arria 10 GX的加速器实现了接近7 GB/s的端到端吞吐量,使用OpenCAPI 3的VU37P接近20 GB/s。根据要解析的JSON数据的复杂性,带宽受到主机到加速器接口或可用FPGA资源的限制。总的来说,与基线应用程序相比,这提供了高达6倍的吞吐量增长。此外,我们还观察到整个系统的能源效率提高了59倍,每焦耳可解析的JSON数据增加了59倍。
{"title":"Tens of gigabytes per second JSON-to-Arrow conversion with FPGA accelerators","authors":"J. Peltenburg, Á. Hadnagy, M. Brobbel, Robert Morrow, Z. Al-Ars","doi":"10.1109/ICFPT52863.2021.9609833","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609833","url":null,"abstract":"JSON is a popular data interchange format for many web, cloud, and IoT systems due to its simplicity, human readability, and widespread support. However, applications must first parse and convert the data to a native in-memory format before being able to perform useful computations. Many big data applications with high performance requirements convert JSON data to Apache Arrow RecordBatches, the latter being a widely-used columnar in-memory format for large tabular data sets used in data analytics. In this paper, we analyze the performance characteristics of such applications and show that JSON parsing represents a bottleneck in the system. Various strategies are explored to speed up JSON parsing on CPU and GPU as much as possible. Due to performance limitation of the CPU and GPU implementations, we furthermore present an FPGA accelerated implementation. We explain how hardware components that can parse variable-sized and nested structures can be combined to produce JSON parsers for any type of JSON document. Several fully integrated FPGA-accelerated JSON parser implementations are presented using the Intel Arria 10 GX and Xilinx VU37P devices, and compared to the performance of their respective host systems; an Intel Xeon and an IBM POWER9 system. Result show the accelerators achieve an end-to-end throughput close to 7 GB/s with the Arria 10 GX using PCIe, and close to 20 GB/s with the VU37P using OpenCAPI 3. Depending on the complexity of the JSON data to parse, the bandwidth is limited by the host-to-accelerator interface or available FPGA resources. Overall, this provides a throughput increase of up to 6x, compared to the baseline application. Also, we observe a full system energy efficiency improvement of up to 59x more JSON data parsed per joule.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114182377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Exponential sine sweep measurement implementation targeting FPGA platforms 针对FPGA平台的指数正弦扫描测量实现
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609901
Alexander Klemd, P. Nowak, Piero Rivera Benois, Etienne Gerat, U. Zölzer, B. Klauer
In this paper a field programmable gate array (FPGA) is considered as a digital signal processing platform for the implementation of an exponential sine sweep measurement algorithm. Aiming at minimizing the required computational resources, two strategies are proposed. Firstly, an oscillator implemented with the coordinate rotation digital computer (CORDIC) algorithm is used to generate the exponential sine sweep. Secondly, only the calculations are performed that lead to the linear impulse response of the system for a desired length. Furthermore, aiming at minimizing the required memory resources, the measured impulse response is stored in the memory previously allocated to the recorded signal. In order to validate the proposed implementation, measurements of an acoustical system are performed using a platform that is equipped with an FPGA and a processor. In this way, the results achieved by the FPGA fixed-point implementation can be compared to reference results achieved using a floating-point MATLAB implementation running on the processor. This comparison corroborates the validity of the proposed implementation.
本文将现场可编程门阵列(FPGA)作为实现指数正弦扫描测量算法的数字信号处理平台。以最小化所需计算资源为目标,提出了两种策略。首先,利用坐标旋转数字计算机(CORDIC)算法实现振荡器产生指数正弦扫描;其次,只有计算被执行,导致系统的线性脉冲响应的期望长度。此外,为了最小化所需的内存资源,测量的脉冲响应被存储在先前分配给记录信号的内存中。为了验证所提出的实现,声学系统的测量使用配备了FPGA和处理器的平台进行。这样,可以将FPGA定点实现的结果与运行在处理器上的浮点MATLAB实现的参考结果进行比较。这一比较证实了所提出的实施方案的有效性。
{"title":"Exponential sine sweep measurement implementation targeting FPGA platforms","authors":"Alexander Klemd, P. Nowak, Piero Rivera Benois, Etienne Gerat, U. Zölzer, B. Klauer","doi":"10.1109/ICFPT52863.2021.9609901","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609901","url":null,"abstract":"In this paper a field programmable gate array (FPGA) is considered as a digital signal processing platform for the implementation of an exponential sine sweep measurement algorithm. Aiming at minimizing the required computational resources, two strategies are proposed. Firstly, an oscillator implemented with the coordinate rotation digital computer (CORDIC) algorithm is used to generate the exponential sine sweep. Secondly, only the calculations are performed that lead to the linear impulse response of the system for a desired length. Furthermore, aiming at minimizing the required memory resources, the measured impulse response is stored in the memory previously allocated to the recorded signal. In order to validate the proposed implementation, measurements of an acoustical system are performed using a platform that is equipped with an FPGA and a processor. In this way, the results achieved by the FPGA fixed-point implementation can be compared to reference results achieved using a floating-point MATLAB implementation running on the processor. This comparison corroborates the validity of the proposed implementation.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123390291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-performance pipeline architecture for packet classification accelerator in DPU DPU中包分类加速器的高性能管道结构
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609841
Jing Tan, Gaofeng Lv, Yanni Ma, Guanjie Qiao
Packet classification is a fundamental problem in the network. With the rapid growth of network bandwidth, wire-speed packet classification has become a key challenge for next-generation network processors. In this paper, we propose a decision-tree-based, multi-pipeline architecture for packet classification accelerator in Data Processing Unit (DPU). Our solution is based on MBitTree, a memory-efficient decision tree algorithm for packet classification. First, we present a parallel architecture composed of multiple linear pipelines for efficiently mapping the decision tree built by MBitTree. Second, a special logic is designed to quickly traverse the decision tree, reducing the logic delay of the pipeline stage. Finally, several pipeline optimization techniques are proposed to improve the performance of the architecture. The implementation results show that our architecture can achieve more than 250 Gbps throughput for the 64-byte minimum Ethernet packets, and can store 100K rules in the on-chip memory of a single NetFPGA_SUME.
分组分类是网络中的一个基本问题。随着网络带宽的快速增长,线速分组分类已成为下一代网络处理器面临的关键挑战。本文提出了一种基于决策树的多管道结构,用于数据处理单元(Data Processing Unit, DPU)中的分组分类加速器。我们的解决方案是基于MBitTree,一种内存高效的数据包分类决策树算法。首先,我们提出了一个由多个线性管道组成的并行架构,用于有效地映射由MBitTree构建的决策树。其次,设计了一种特殊的逻辑来快速遍历决策树,减少了流水线阶段的逻辑延迟。最后,提出了几种管道优化技术来提高体系结构的性能。实现结果表明,我们的架构可以在最小64字节的以太网数据包中实现超过250 Gbps的吞吐量,并且可以在单个NetFPGA_SUME的片上存储器中存储100K的规则。
{"title":"High-performance pipeline architecture for packet classification accelerator in DPU","authors":"Jing Tan, Gaofeng Lv, Yanni Ma, Guanjie Qiao","doi":"10.1109/ICFPT52863.2021.9609841","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609841","url":null,"abstract":"Packet classification is a fundamental problem in the network. With the rapid growth of network bandwidth, wire-speed packet classification has become a key challenge for next-generation network processors. In this paper, we propose a decision-tree-based, multi-pipeline architecture for packet classification accelerator in Data Processing Unit (DPU). Our solution is based on MBitTree, a memory-efficient decision tree algorithm for packet classification. First, we present a parallel architecture composed of multiple linear pipelines for efficiently mapping the decision tree built by MBitTree. Second, a special logic is designed to quickly traverse the decision tree, reducing the logic delay of the pipeline stage. Finally, several pipeline optimization techniques are proposed to improve the performance of the architecture. The implementation results show that our architecture can achieve more than 250 Gbps throughput for the 64-byte minimum Ethernet packets, and can store 100K rules in the on-chip memory of a single NetFPGA_SUME.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127607773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Fast controling autonomous vehicle based on real time image processing 基于实时图像处理的自动驾驶汽车快速控制
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609946
Hossein Borhanifar, Hamed Jani, Mohammad Mahdi Gohari, Amir Hossein Heydarian, Mostafa Lashkari, Mohammad Reza Lashkari
In this paper, a method for Autonomous Vehicle is presented based on real time image processing. The system detects the road by sobel edge detection, and it recognizes the obstacles, humans, and traffic lights by heuristic techniques. For this aim, some features are defined based on key points. One important technique is that every frame is divided in some sections which significantly affect time of processing. The system is able to analyze 30 frames per second to get the best decision for controlling the vehicle. The achieved structure is optimized on accuracy and the number of logic cells. The algorithms completely describes in hardware by VHDL, then implemented on DE1-SOC board which uses cyclone V FPGA.
本文提出了一种基于实时图像处理的自动驾驶汽车定位方法。该系统通过索贝尔边缘检测来检测道路,并通过启发式技术来识别障碍物、人类和交通灯。为此,基于关键点定义了一些特征。一个重要的技术是,每一帧被分成一些部分,这显著影响处理时间。该系统能够每秒分析30帧,以获得控制车辆的最佳决策。所实现的结构在精度和逻辑单元数量上进行了优化。在硬件上用VHDL对算法进行了完整的描述,然后用cyclone V FPGA在DE1-SOC板上实现。
{"title":"Fast controling autonomous vehicle based on real time image processing","authors":"Hossein Borhanifar, Hamed Jani, Mohammad Mahdi Gohari, Amir Hossein Heydarian, Mostafa Lashkari, Mohammad Reza Lashkari","doi":"10.1109/ICFPT52863.2021.9609946","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609946","url":null,"abstract":"In this paper, a method for Autonomous Vehicle is presented based on real time image processing. The system detects the road by sobel edge detection, and it recognizes the obstacles, humans, and traffic lights by heuristic techniques. For this aim, some features are defined based on key points. One important technique is that every frame is divided in some sections which significantly affect time of processing. The system is able to analyze 30 frames per second to get the best decision for controlling the vehicle. The achieved structure is optimized on accuracy and the number of logic cells. The algorithms completely describes in hardware by VHDL, then implemented on DE1-SOC board which uses cyclone V FPGA.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"113 12","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120992547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ac2SLAM: FPGA Accelerated High-Accuracy SLAM with Heapsort and Parallel Keypoint Extractor ac2SLAM: FPGA加速高精度SLAM与堆排序和并行关键点提取器
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609808
Cheng Wang, Yingkun Liu, Kedai Zuo, Jianming Tong, Yan Ding, Pengju Ren
In order to fulfill the rich functions of the application layer, robust and accurate Simultaneous Localization and Mapping (SLAM) technique is very critical for robotics. However, due to the lack of sufficient computing power and storage capacity, it is challenging to delpoy high-accuracy SLAM in embedded devices efficiently. In this work, we propose a complete acceleration scheme, termed ac2SLAM, based on the ORB-SLAM2 algorithm including both front and back ends, and implement it on an FPGA platform. Specifically, the proposed ac2SLAM features with: 1) a scalable and parallel ORB extractor to extract sufficient keypoints and scores for throughput matching with 4% error, 2) a PingPong heapsort component (pp-heapsort) to select the significant keypoints, that could achieve single-cycle initiation interval to reduce the amount of data transfer between accelerator and the host CPU, and 3) the potential parallel acceleration strategies for the back-end optimization. Compared with running ORB-SLAM2 on the ARM processor, ac2SLAM achieves 2.1 × and 2.7 × faster in the TUM and KITTI datasets, while maintaining 10% error of SOTA eSLAM. In addition, the FPGA accelerated front-end achieves 4.55 × and 40 × faster than eSLAM and ARM. The ac2SLAM is fully open-sourced at https://github.com/SLAM-Hardware/acSLAM.
为了实现应用层的丰富功能,鲁棒和精确的同步定位和映射技术对机器人技术至关重要。然而,由于缺乏足够的计算能力和存储容量,在嵌入式设备中高效部署高精度SLAM是一个挑战。在这项工作中,我们提出了一个完整的加速方案,称为ac2SLAM,基于ORB-SLAM2算法,包括前端和后端,并在FPGA平台上实现。具体而言,所提出的ac2SLAM具有以下特点:1)可扩展的并行ORB提取器,用于在4%的误差下提取足够的关键点和分数,2)乒乓堆排序组件(pp-heapsort)用于选择重要关键点,可以实现单周期启动间隔,以减少加速器与主机CPU之间的数据传输量,3)潜在的并行加速策略用于后端优化。与在ARM处理器上运行ORB-SLAM2相比,ac2SLAM在TUM和KITTI数据集上的速度分别提高了2.1倍和2.7倍,同时保持了SOTA eSLAM 10%的误差。此外,FPGA加速前端比eSLAM和ARM分别快4.55倍和40倍。ac2SLAM在https://github.com/SLAM-Hardware/acSLAM上是完全开源的。
{"title":"ac2SLAM: FPGA Accelerated High-Accuracy SLAM with Heapsort and Parallel Keypoint Extractor","authors":"Cheng Wang, Yingkun Liu, Kedai Zuo, Jianming Tong, Yan Ding, Pengju Ren","doi":"10.1109/ICFPT52863.2021.9609808","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609808","url":null,"abstract":"In order to fulfill the rich functions of the application layer, robust and accurate Simultaneous Localization and Mapping (SLAM) technique is very critical for robotics. However, due to the lack of sufficient computing power and storage capacity, it is challenging to delpoy high-accuracy SLAM in embedded devices efficiently. In this work, we propose a complete acceleration scheme, termed ac2SLAM, based on the ORB-SLAM2 algorithm including both front and back ends, and implement it on an FPGA platform. Specifically, the proposed ac2SLAM features with: 1) a scalable and parallel ORB extractor to extract sufficient keypoints and scores for throughput matching with 4% error, 2) a PingPong heapsort component (pp-heapsort) to select the significant keypoints, that could achieve single-cycle initiation interval to reduce the amount of data transfer between accelerator and the host CPU, and 3) the potential parallel acceleration strategies for the back-end optimization. Compared with running ORB-SLAM2 on the ARM processor, ac2SLAM achieves 2.1 × and 2.7 × faster in the TUM and KITTI datasets, while maintaining 10% error of SOTA eSLAM. In addition, the FPGA accelerated front-end achieves 4.55 × and 40 × faster than eSLAM and ARM. The ac2SLAM is fully open-sourced at https://github.com/SLAM-Hardware/acSLAM.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125888617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A modular RFSoC-based approach to interface superconducting quantum bits 基于模块化rfsoc的接口超导量子比特方法
R. Gebauer, N. Karcher, Mehmed Güler, O. Sander
Quantum computers will be a revolutionary extension of the heterogeneous computing world. They consist of many quantum bits (qubits) and require a careful design of the interface between the classical computer architecture and the quantum processor. Even single nanosecond variations of the interaction may have an influence on the quantum state. In this paper, we present the modular design of the FPGA firmware which is part of our qubit control electronics. It features so-called digital unit cells where each cell contains all the logic necessary to interact with a single superconducting qubit. The cell includes a custom-built RISC-V-based sequencer, as well as two signal generators and a signal recorder. Internal communication within the cell is handled using a modified Wishbone bus with custom 2-to-N interconnect and deterministic broadcast functionality. We furthermore provide the resource utilization of our design and demonstrate its correct operation using an actual superconducting five qubit chip.
量子计算机将是异构计算世界的革命性延伸。它们由许多量子比特组成,需要仔细设计经典计算机体系结构和量子处理器之间的接口。即使是单个纳秒的相互作用变化也可能对量子态产生影响。在本文中,我们提出了FPGA固件的模块化设计,这是我们量子比特控制电子的一部分。它的特点是所谓的数字单元,其中每个单元包含与单个超导量子比特相互作用所需的所有逻辑。该单元包括一个定制的基于risc - v的测序器,以及两个信号发生器和一个信号记录器。单元内的内部通信使用改进的Wishbone总线处理,该总线具有自定义的2对n互连和确定性广播功能。我们进一步提供了我们设计的资源利用率,并使用实际的超导五量子比特芯片演示了其正确操作。
{"title":"A modular RFSoC-based approach to interface superconducting quantum bits","authors":"R. Gebauer, N. Karcher, Mehmed Güler, O. Sander","doi":"10.1145/3571820","DOIUrl":"https://doi.org/10.1145/3571820","url":null,"abstract":"Quantum computers will be a revolutionary extension of the heterogeneous computing world. They consist of many quantum bits (qubits) and require a careful design of the interface between the classical computer architecture and the quantum processor. Even single nanosecond variations of the interaction may have an influence on the quantum state. In this paper, we present the modular design of the FPGA firmware which is part of our qubit control electronics. It features so-called digital unit cells where each cell contains all the logic necessary to interact with a single superconducting qubit. The cell includes a custom-built RISC-V-based sequencer, as well as two signal generators and a signal recorder. Internal communication within the cell is handled using a modified Wishbone bus with custom 2-to-N interconnect and deterministic broadcast functionality. We furthermore provide the resource utilization of our design and demonstrate its correct operation using an actual superconducting five qubit chip.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132218858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
On the Performance Effect of Loop Trace Window Size on Scheduling for Configurable Coarse Grain Loop Accelerators 循环跟踪窗口大小对可配置粗粒环加速器调度性能的影响
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609868
Tiago Santos, N. Paulino, João Bispo, João M. P. Cardoso, João C. Ferreira
By using Dynamic Binary Translation, instruction traces from pre-compiled applications can be offloaded, at runtime, to FPGA-based accelerators, such as Coarse-Grained Loop Accelerators, in a transparent way. However, scheduling onto coarse-grain accelerators is challenging, with two of current known issues being the density of computations that can be mapped, and the effects of memory accesses on performance. Using an in-house framework for analysis of instruction traces, we explore the effect of different window sizes when applying list scheduling, to map the window operations to a coarse-grain loop accelerator model that has been previously experimentally validated. For all window sizes, we vary the number of ALUs and memory ports available in the model, and comment how these parameters affect the resulting latency. For a set of benchmarks taken from the PolyBench suite, compiled for the 32-bit MicroBlaze softcore, we have achieved an average iteration speedup of 5.10x for a basic block repeated 5 times and scheduled with 8 ALUs and memory ports, and an average speedup of 5.46x when not considering resource constraints. We also identify which benchmarks contribute to the difference between these two speedups, and breakdown their limiting factors. Finally, we reflect on the impact memory dependencies have on scheduling.
通过使用动态二进制转换,可以在运行时将来自预编译应用程序的指令跟踪以透明的方式卸载到基于fpga的加速器上,例如粗粒度循环加速器。然而,在粗粒度加速器上调度是具有挑战性的,目前已知的两个问题是可以映射的计算密度,以及内存访问对性能的影响。使用内部框架来分析指令跟踪,我们探索了应用列表调度时不同窗口大小的影响,将窗口操作映射到先前已经过实验验证的粗粒度循环加速器模型。对于所有窗口大小,我们改变模型中可用的alu和内存端口的数量,并评论这些参数如何影响产生的延迟。对于一组来自PolyBench套件的基准测试,为32位MicroBlaze软核编译,我们已经实现了一个基本块重复5次的平均迭代加速5.10倍,并安排了8个alu和内存端口,在不考虑资源限制的情况下,平均加速5.46倍。我们还确定了哪些基准测试导致了这两种加速之间的差异,并分解了它们的限制因素。最后,我们考虑内存依赖对调度的影响。
{"title":"On the Performance Effect of Loop Trace Window Size on Scheduling for Configurable Coarse Grain Loop Accelerators","authors":"Tiago Santos, N. Paulino, João Bispo, João M. P. Cardoso, João C. Ferreira","doi":"10.1109/ICFPT52863.2021.9609868","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609868","url":null,"abstract":"By using Dynamic Binary Translation, instruction traces from pre-compiled applications can be offloaded, at runtime, to FPGA-based accelerators, such as Coarse-Grained Loop Accelerators, in a transparent way. However, scheduling onto coarse-grain accelerators is challenging, with two of current known issues being the density of computations that can be mapped, and the effects of memory accesses on performance. Using an in-house framework for analysis of instruction traces, we explore the effect of different window sizes when applying list scheduling, to map the window operations to a coarse-grain loop accelerator model that has been previously experimentally validated. For all window sizes, we vary the number of ALUs and memory ports available in the model, and comment how these parameters affect the resulting latency. For a set of benchmarks taken from the PolyBench suite, compiled for the 32-bit MicroBlaze softcore, we have achieved an average iteration speedup of 5.10x for a basic block repeated 5 times and scheduled with 8 ALUs and memory ports, and an average speedup of 5.46x when not considering resource constraints. We also identify which benchmarks contribute to the difference between these two speedups, and breakdown their limiting factors. Finally, we reflect on the impact memory dependencies have on scheduling.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121108068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Low Precision Networks for Efficient Inference on FPGAs fpga上高效推理的低精度网络
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609837
R. Abra, Dmitry Denisenko, Richard Allen, Tim Vanderhoek, Sarah Wolstencroft, Peter M. Gibson
Block Floating Point (BFP) is a type of quantization that combines high dynamic range with low-cost inference. BFP can be implemented efficiently on FPGA hardware and, at low precision, halves the logic footprint versus blocked FP16 while maintaining accuracy. Moving to very low precision halves the logic footprint again and retraining allows the recovery of any accuracy lost in transition. This paper describes our approach to achieving target accuracy and FPGA resource usage in a low-precision end-to-end AI solution. We go on to investigate the effects of retraining with our software model that replicates the low-level implementation of BFP on FPGA. Our solution allows efficacy testing for the quantization of custom networks and provides accuracy indications and resource usage for the final application. Using our solution, we were able to quantize ResNet 50, SSD300 and UNet to int5/4bfp precision without losing accuracy while reducing FPGA resources and improving performance.
块浮点(BFP)是一种结合了高动态范围和低成本推理的量化方法。BFP可以在FPGA硬件上高效实现,并且在低精度下,与阻塞的FP16相比,在保持精度的同时减少了一半的逻辑占用。移动到非常低的精度可以再次减少一半的逻辑占用,并且重新训练可以恢复转换中丢失的任何精度。本文描述了我们在低精度端到端人工智能解决方案中实现目标精度和FPGA资源使用的方法。我们继续用我们的软件模型来研究再训练的效果,该模型在FPGA上复制了BFP的低级实现。我们的解决方案允许对自定义网络的量化进行功效测试,并为最终应用提供准确性指示和资源使用情况。使用我们的解决方案,我们能够将ResNet 50, SSD300和UNet量化到int5/4bfp精度,同时减少FPGA资源并提高性能。
{"title":"Low Precision Networks for Efficient Inference on FPGAs","authors":"R. Abra, Dmitry Denisenko, Richard Allen, Tim Vanderhoek, Sarah Wolstencroft, Peter M. Gibson","doi":"10.1109/ICFPT52863.2021.9609837","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609837","url":null,"abstract":"Block Floating Point (BFP) is a type of quantization that combines high dynamic range with low-cost inference. BFP can be implemented efficiently on FPGA hardware and, at low precision, halves the logic footprint versus blocked FP16 while maintaining accuracy. Moving to very low precision halves the logic footprint again and retraining allows the recovery of any accuracy lost in transition. This paper describes our approach to achieving target accuracy and FPGA resource usage in a low-precision end-to-end AI solution. We go on to investigate the effects of retraining with our software model that replicates the low-level implementation of BFP on FPGA. Our solution allows efficacy testing for the quantization of custom networks and provides accuracy indications and resource usage for the final application. Using our solution, we were able to quantize ResNet 50, SSD300 and UNet to int5/4bfp precision without losing accuracy while reducing FPGA resources and improving performance.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126303537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Autonomous Driving System implemented on Robot Car using SoC FPGA 基于SoC FPGA的机器人汽车自动驾驶系统实现
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609855
A. Kojima
We are developing an FPGA-based robot car for the FPT'21 FPGA design competition. The FPGA we are using is Xilinx UltraScale+ MPSoC, which is an SoC type FPGA and mainly consists of a processor part and a programmable logic part. Among the functions of the autonomous driving system, lane-keeping, localization, driving planning, and obstacle avoidance are implemented as software in the processor part. For object detection, Yolo, a machine learning algorithm, is implemented as hardware in the programmable logic part using the DPU IP provided by Xilinx. The PWM circuits for controlling the DC motor and servo motor are also implemented as the original hardware.
我们正在为FPT'21 FPGA设计竞赛开发一款基于FPGA的机器人汽车。我们使用的FPGA是Xilinx UltraScale+ MPSoC,这是一种SoC类型的FPGA,主要由处理器部分和可编程逻辑部分组成。自动驾驶系统的车道保持、定位、行驶规划、避障等功能在处理器部分以软件形式实现。对于目标检测,使用Xilinx提供的DPU IP,在可编程逻辑部分作为硬件实现机器学习算法Yolo。控制直流电动机和伺服电机的PWM电路也作为原始硬件实现。
{"title":"Autonomous Driving System implemented on Robot Car using SoC FPGA","authors":"A. Kojima","doi":"10.1109/ICFPT52863.2021.9609855","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609855","url":null,"abstract":"We are developing an FPGA-based robot car for the FPT'21 FPGA design competition. The FPGA we are using is Xilinx UltraScale+ MPSoC, which is an SoC type FPGA and mainly consists of a processor part and a programmable logic part. Among the functions of the autonomous driving system, lane-keeping, localization, driving planning, and obstacle avoidance are implemented as software in the processor part. For object detection, Yolo, a machine learning algorithm, is implemented as hardware in the programmable logic part using the DPU IP provided by Xilinx. The PWM circuits for controlling the DC motor and servo motor are also implemented as the original hardware.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117045267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2021 International Conference on Field-Programmable Technology (ICFPT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1