首页 > 最新文献

2021 International Conference on Field-Programmable Technology (ICFPT)最新文献

英文 中文
FLOWER: A comprehensive dataflow compiler for high-level synthesis FLOWER:用于高级合成的综合数据流编译器
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609930
Puya Amiri, Arsène Pérard-Gayot, Richard Membarth, P. Slusallek, Roland Leißa, Sebastian Hack
FPGAs have found their way into data centers as accelerator cards, making reconfigurable computing more accessible for high-performance applications. At the same time, new high-level synthesis compilers like Xilinx Vitis and runtime libraries such as XRT attract software programmers into the reconfigurable domain. While software programmers are familiar with task-level and data-parallel programming, FPGAs often require different types of parallelism. For example, data-driven parallelism is mandatory to obtain satisfactory hardware designs for pipelined dataflow architectures. However, software programmers are often not acquainted with dataflow architectures— resulting in poor hardware designs. In this work we present FLOWER, a comprehensive compiler infrastructure that provides automatic canonical transformations for high-level synthesis from a domain-specific library. This allows programmers to focus on algorithm implementations rather than low-level optimizations for dataflow architectures. We show that FLOWER allows to synthesize efficient implementations for high-performance streaming applications targeting System-on-Chip and FPGA accelerator cards, in the context of image processing and computer vision.
fpga已经作为加速卡进入数据中心,使可重构计算更容易用于高性能应用。与此同时,新的高级合成编译器(如Xilinx Vitis)和运行时库(如XRT)吸引软件程序员进入可重构领域。虽然软件程序员熟悉任务级和数据并行编程,但fpga通常需要不同类型的并行性。例如,为了获得令人满意的流水线数据流架构硬件设计,数据驱动的并行性是必需的。然而,软件程序员通常不熟悉数据流架构,这导致了糟糕的硬件设计。在这项工作中,我们介绍了FLOWER,这是一个全面的编译器基础设施,它为来自特定领域库的高级合成提供了自动规范转换。这使得程序员可以专注于算法实现,而不是数据流架构的底层优化。我们表明,FLOWER可以在图像处理和计算机视觉的背景下,为针对片上系统和FPGA加速卡的高性能流应用程序合成有效的实现。
{"title":"FLOWER: A comprehensive dataflow compiler for high-level synthesis","authors":"Puya Amiri, Arsène Pérard-Gayot, Richard Membarth, P. Slusallek, Roland Leißa, Sebastian Hack","doi":"10.1109/ICFPT52863.2021.9609930","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609930","url":null,"abstract":"FPGAs have found their way into data centers as accelerator cards, making reconfigurable computing more accessible for high-performance applications. At the same time, new high-level synthesis compilers like Xilinx Vitis and runtime libraries such as XRT attract software programmers into the reconfigurable domain. While software programmers are familiar with task-level and data-parallel programming, FPGAs often require different types of parallelism. For example, data-driven parallelism is mandatory to obtain satisfactory hardware designs for pipelined dataflow architectures. However, software programmers are often not acquainted with dataflow architectures— resulting in poor hardware designs. In this work we present FLOWER, a comprehensive compiler infrastructure that provides automatic canonical transformations for high-level synthesis from a domain-specific library. This allows programmers to focus on algorithm implementations rather than low-level optimizations for dataflow architectures. We show that FLOWER allows to synthesize efficient implementations for high-performance streaming applications targeting System-on-Chip and FPGA accelerator cards, in the context of image processing and computer vision.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131059167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Energy-efficient FPGA-accelerated LiDAR-based SLAM for embedded robotics 嵌入式机器人的高效fpga加速激光雷达SLAM
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609934
M. Flottmann, Marc Eisoldt, Julian Gaal, Marc Rothmann, M. Tassemeier, T. Wiemann, Mario Porrmann
Being one of the fundamental problems in autonomous robotics, SLAM (Simultaneous Localization and Mapping) algorithms have gained a lot of attention. Although numerous approaches have been presented for determining 6D poses in 3D environments, one of the main challenges that remains is the required combination of real-time processing and high energy efficiency. In this paper, a combination of CPU and FPGA processing is used to tackle this problem, utilizing a reconfigurable SoC. We present a complete solution for embedded LiDAR-based SLAM that uses a global Truncated Signed Distance Function (TSDF) as map representation. A hardware-in-the-loop environment with ROS integration enables efficient evaluation of new variants of algorithms and implementations. Based on benchmark data sets and real-world environments, we show that our approach compares well to established SLAM algorithms. Compared to a software implementation on a state-of-the-art PC, the proposed implementation achieves a 7-fold speed-up and requires 18 times less energy when using a Xilinx UltraScale+ XCZU15EG.
SLAM (Simultaneous Localization and Mapping)算法作为自主机器人的基础问题之一,受到了广泛的关注。尽管已经提出了许多方法来确定3D环境中的6D姿势,但仍然存在的主要挑战之一是需要将实时处理和高能效结合起来。在本文中,使用CPU和FPGA处理的组合来解决这个问题,利用可重构的SoC。我们提出了一个基于嵌入式激光雷达的SLAM的完整解决方案,该解决方案使用全局截断签名距离函数(TSDF)作为地图表示。具有ROS集成的硬件在环环境可以有效地评估算法和实现的新变体。基于基准数据集和现实世界环境,我们证明了我们的方法与已建立的SLAM算法相比较。与最先进的PC上的软件实现相比,当使用赛灵思UltraScale+ XCZU15EG时,拟议的实现实现了7倍的加速,所需的能量减少了18倍。
{"title":"Energy-efficient FPGA-accelerated LiDAR-based SLAM for embedded robotics","authors":"M. Flottmann, Marc Eisoldt, Julian Gaal, Marc Rothmann, M. Tassemeier, T. Wiemann, Mario Porrmann","doi":"10.1109/ICFPT52863.2021.9609934","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609934","url":null,"abstract":"Being one of the fundamental problems in autonomous robotics, SLAM (Simultaneous Localization and Mapping) algorithms have gained a lot of attention. Although numerous approaches have been presented for determining 6D poses in 3D environments, one of the main challenges that remains is the required combination of real-time processing and high energy efficiency. In this paper, a combination of CPU and FPGA processing is used to tackle this problem, utilizing a reconfigurable SoC. We present a complete solution for embedded LiDAR-based SLAM that uses a global Truncated Signed Distance Function (TSDF) as map representation. A hardware-in-the-loop environment with ROS integration enables efficient evaluation of new variants of algorithms and implementations. Based on benchmark data sets and real-world environments, we show that our approach compares well to established SLAM algorithms. Compared to a software implementation on a state-of-the-art PC, the proposed implementation achieves a 7-fold speed-up and requires 18 times less energy when using a Xilinx UltraScale+ XCZU15EG.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"232 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115964792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
An autonomous driving system utilizing image processing accelerated by FPGA 基于FPGA的图像处理自动驾驶系统
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609937
Kazunari Takasaki, Kota Hisafuru, Ryotaro Negishi, Kazuki Yamashita, Keisuke Fukada, Tomoya Wakaizumi, N. Togawa
This paper presents an autonomous driving system utilizing FPGA-based image processing. We develop a robot that our system is implemented on Ultra96-V2, a board with programmable logic and processing system. We use ROS, a middleware framework for developing robots, to manage the system such as controlling hardware devices, localization and determination of the direction to go. We implement a neural network to detect road markings on the road on a programmable logic on the board. The robot with our system implemented drives autonomously along the specified route on a miniature road, recognizing edge line and road markings.
提出了一种基于fpga图像处理的自动驾驶系统。我们开发了一个机器人,我们的系统实现在Ultra96-V2板,一个可编程的逻辑和处理系统。我们使用ROS(开发机器人的中间件框架)来管理系统,如控制硬件设备,定位和确定前进方向。我们在电路板上的可编程逻辑上实现了一个神经网络来检测道路上的道路标记。我们的系统实现了机器人沿着指定路线在微型道路上自动驾驶,识别边缘线和道路标记。
{"title":"An autonomous driving system utilizing image processing accelerated by FPGA","authors":"Kazunari Takasaki, Kota Hisafuru, Ryotaro Negishi, Kazuki Yamashita, Keisuke Fukada, Tomoya Wakaizumi, N. Togawa","doi":"10.1109/ICFPT52863.2021.9609937","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609937","url":null,"abstract":"This paper presents an autonomous driving system utilizing FPGA-based image processing. We develop a robot that our system is implemented on Ultra96-V2, a board with programmable logic and processing system. We use ROS, a middleware framework for developing robots, to manage the system such as controlling hardware devices, localization and determination of the direction to go. We implement a neural network to detect road markings on the road on a programmable logic on the board. The robot with our system implemented drives autonomously along the specified route on a miniature road, recognizing edge line and road markings.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129833346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dataflow Systolic Array Implementations of Exploring Dual-Triangular Structure in QR Decomposition Using High-Level Synthesis 基于高级综合的数据流收缩阵列在QR分解中探索双三角结构
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609814
Siyang Jiang, Hsi-Wen Chen, Ming-Syan Chen
Tall and skinny QR (TSQR) decomposition is an essential matrix operation with various applications in edge computing, including data compression, subspace projection, and dimension reduction. As a critical component in TSQR, Dual-Triangular QR (DTQR) decomposition is solved by the Normal QR method in most works without utilizing the dual-triangular structure. Therefore, we propose a novel DTQR accelerator by recursively exploring the DT structure and propose three acceleration strategies with the systolic array to achieve higher parallelism. Experimental results manifest that our algorithm achieves 21.55x on average speedup compared with the baselines.
TSQR分解是一种重要的矩阵运算,在边缘计算中有着广泛的应用,包括数据压缩、子空间投影和降维等。双三角QR (dual- triangle QR, DTQR)分解是TSQR的关键组成部分,在大多数工作中,没有使用双三角结构,而是采用Normal QR方法求解。因此,我们提出了一种新的DTQR加速器,通过递归探索DT结构,并提出了三种具有收缩阵列的加速策略,以实现更高的并行性。实验结果表明,与基线相比,我们的算法实现了21.55倍的平均加速。
{"title":"Dataflow Systolic Array Implementations of Exploring Dual-Triangular Structure in QR Decomposition Using High-Level Synthesis","authors":"Siyang Jiang, Hsi-Wen Chen, Ming-Syan Chen","doi":"10.1109/ICFPT52863.2021.9609814","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609814","url":null,"abstract":"Tall and skinny QR (TSQR) decomposition is an essential matrix operation with various applications in edge computing, including data compression, subspace projection, and dimension reduction. As a critical component in TSQR, Dual-Triangular QR (DTQR) decomposition is solved by the Normal QR method in most works without utilizing the dual-triangular structure. Therefore, we propose a novel DTQR accelerator by recursively exploring the DT structure and propose three acceleration strategies with the systolic array to achieve higher parallelism. Experimental results manifest that our algorithm achieves 21.55x on average speedup compared with the baselines.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121724193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Development of Autonomous Driving System based on Image Recognition using Programmable SoCs 基于可编程soc的图像识别自动驾驶系统的开发
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609811
Ryohei Yamamoto, Yuki Izumi, Ryo Aono, Takumi Nagahara, Tomonari Tanaka, Wang Liao, Y. Mitsuyama
We design and implement an autonomous driving system based on image recognition using programmable SoCs. The proposed system equips two FPGA boards and three cameras. One FPGA board implements a driving control system, and the other FPGA board implements object detection and recognition using machine learning algorithms. Driving control is performed based on road edge line detection and road marking recognition using the canny edge detection. On the other hand, image detection and recognition of traffic lights are implemented using the random forest method with HOG features. In the development framework of programmable SoC of Zynq 7000, we adopt a Hardware/Software co-design to balance the design period and system performance required for real-time processing.
我们利用可编程的soc设计并实现了一个基于图像识别的自动驾驶系统。该系统配备2块FPGA板和3个摄像头。一块FPGA板实现驱动控制系统,另一块FPGA板使用机器学习算法实现目标检测和识别。基于道路边缘线检测实现驾驶控制,利用边缘检测实现道路标记识别。另一方面,利用HOG特征的随机森林方法实现了交通信号灯的图像检测和识别。在Zynq 7000可编程SoC的开发框架中,我们采用硬件/软件协同设计来平衡实时处理所需的设计周期和系统性能。
{"title":"Development of Autonomous Driving System based on Image Recognition using Programmable SoCs","authors":"Ryohei Yamamoto, Yuki Izumi, Ryo Aono, Takumi Nagahara, Tomonari Tanaka, Wang Liao, Y. Mitsuyama","doi":"10.1109/ICFPT52863.2021.9609811","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609811","url":null,"abstract":"We design and implement an autonomous driving system based on image recognition using programmable SoCs. The proposed system equips two FPGA boards and three cameras. One FPGA board implements a driving control system, and the other FPGA board implements object detection and recognition using machine learning algorithms. Driving control is performed based on road edge line detection and road marking recognition using the canny edge detection. On the other hand, image detection and recognition of traffic lights are implemented using the random forest method with HOG features. In the development framework of programmable SoC of Zynq 7000, we adopt a Hardware/Software co-design to balance the design period and system performance required for real-time processing.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122499745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Zytlebot : FPGA integrated ros-based autonomous mobile robot Zytlebot:基于FPGA集成ros的自主移动机器人
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609883
Ryota Miyagi, N. Takagi, Sho Kinoshista, M. Oda, Hideki Takase
The FPT 2021 Design Competition aims to improve the technology of utilizing FPGA and achieve level-5 autonomous driving. We developed FPGA Integrated ROS-Based autonomous mobile robot, ZytleBot, for the competition. ZytleBot collects environmental information with CMOS cameras, recognizes environments, decides its action on programmable SoC, and controls its actuator. As a result, ZytleBot can run road model courses, detect and adequately deal with traffic lights and obstacles. We used the robot development platform TurtleBot3 and the robot middleware ROS to develop the robot system quickly. In addition, we utilize FPGA to accelerate road-images processing and traffic lights recognition using the HOG feature and SVM classifier. As a result, traffic lights recognition with FPGA is 270 times faster than those only with CPU.
FPT 2021设计竞赛旨在改进利用FPGA的技术,实现5级自动驾驶。我们为比赛开发了基于FPGA集成ros的自主移动机器人ZytleBot。ZytleBot通过CMOS摄像头收集环境信息,识别环境,在可编程SoC上决定其行动,并控制其执行器。因此,ZytleBot可以运行道路模型课程,检测并充分处理交通灯和障碍物。我们使用机器人开发平台TurtleBot3和机器人中间件ROS来快速开发机器人系统。此外,我们利用FPGA利用HOG特征和SVM分类器加速道路图像处理和红绿灯识别。结果表明,使用FPGA的红绿灯识别速度比仅使用CPU的红绿灯识别速度快270倍。
{"title":"Zytlebot : FPGA integrated ros-based autonomous mobile robot","authors":"Ryota Miyagi, N. Takagi, Sho Kinoshista, M. Oda, Hideki Takase","doi":"10.1109/ICFPT52863.2021.9609883","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609883","url":null,"abstract":"The FPT 2021 Design Competition aims to improve the technology of utilizing FPGA and achieve level-5 autonomous driving. We developed FPGA Integrated ROS-Based autonomous mobile robot, ZytleBot, for the competition. ZytleBot collects environmental information with CMOS cameras, recognizes environments, decides its action on programmable SoC, and controls its actuator. As a result, ZytleBot can run road model courses, detect and adequately deal with traffic lights and obstacles. We used the robot development platform TurtleBot3 and the robot middleware ROS to develop the robot system quickly. In addition, we utilize FPGA to accelerate road-images processing and traffic lights recognition using the HOG feature and SVM classifier. As a result, traffic lights recognition with FPGA is 270 times faster than those only with CPU.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124787369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Real-time Implementation of Cyclostationary Analysis using FPGAs 利用fpga实时实现循环平稳分析
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609911
Jingyi Li
Cyclostationary analysis is an important tool for understanding periodic phenomenon and the spectral correlation density (SCD) function is commonly used in its characterisation. Due to its high computational requirements it is not commonly applied to real-time signals, despite the fact that efficient FFT-based techniques for estimation of the SCD exist. In this research, we aim to address this issue by developing high-performance cyclostationary analysis techniques through FPGA acceleration, and apply them to enable new applications. We will first explore the tradeoff between arithmetic precision and implementation area, applying statistics-based analysis techniques to understand how signal to quantisation noise is affected by wordlength in fixed and floating-point implementations. Next, high-speed FPGA-based systolic architectures for estimating the SCD will be studied. Finally, we apply our optimised arithmetic and architectures to real-time, radio frequency applications.
周期平稳分析是理解周期现象的重要工具,谱相关密度函数(SCD)常用于周期现象的表征。尽管基于fft的高效SCD估计技术已经存在,但由于其高计算要求,它通常不应用于实时信号。在本研究中,我们的目标是通过FPGA加速开发高性能循环平稳分析技术来解决这个问题,并将其应用于新的应用。我们将首先探讨算术精度和实现面积之间的权衡,应用基于统计的分析技术来理解在固定和浮点实现中,字长如何影响信号到量化噪声。接下来,将研究用于估计SCD的基于fpga的高速收缩架构。最后,我们将优化的算法和架构应用于实时射频应用。
{"title":"Real-time Implementation of Cyclostationary Analysis using FPGAs","authors":"Jingyi Li","doi":"10.1109/ICFPT52863.2021.9609911","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609911","url":null,"abstract":"Cyclostationary analysis is an important tool for understanding periodic phenomenon and the spectral correlation density (SCD) function is commonly used in its characterisation. Due to its high computational requirements it is not commonly applied to real-time signals, despite the fact that efficient FFT-based techniques for estimation of the SCD exist. In this research, we aim to address this issue by developing high-performance cyclostationary analysis techniques through FPGA acceleration, and apply them to enable new applications. We will first explore the tradeoff between arithmetic precision and implementation area, applying statistics-based analysis techniques to understand how signal to quantisation noise is affected by wordlength in fixed and floating-point implementations. Next, high-speed FPGA-based systolic architectures for estimating the SCD will be studied. Finally, we apply our optimised arithmetic and architectures to real-time, radio frequency applications.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133662088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A High-Performance and Flexible FPGA Inference Accelerator for Decision Forests Based on Prior Feature Space Partitioning 基于先验特征空间划分的决策森林高性能灵活FPGA推理加速器
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609699
Thiem Van Chu, Ryuichi Kitajima, Kazushi Kawamura, Jaehoon Yu, M. Motomura
Recent studies have demonstrated the potential of FPGAs for accelerating the inference computation of decision forests (DFs). However, designing a high-performance architecture that is flexible enough to be adopted in various scenarios of FPGA resource requirements remains a challenge. To address this, we propose a DF inference method that makes a transformation from traversing trees into traversing feature spaces. Specifically, as a preprocessing step, we partition each feature space into multiple regions based on thresholds. The inference task for an input data point is then conducted by (1) determining which region in each feature space the data point belongs to and (2) combining the inference information in these regions. The regularity of the computation allows us to design a DF inference architecture, called FT-DFP (Feature-space Traversing Decision Forest Processor), that can be flexibly configured for different performance and FPGA resource usage requirements. We prototype FT-DFP on a low-end FPGA (Artix-7) board and evaluate it using four real-world datasets. The evaluation results show that (1) the flexibility of FT-DFP allows us to fit a wide variety of DF models into low-end FPGA devices with limited resources; (2) FT-DFP's performance is comparable to the best of existing accelerators implemented on high-end FPGA devices and 3.04 × higher than Hummingbird, a state-of-the-art GPU-optimized implementation, running on a high-end GPU; and (3) FT-DFP is 130.96 × more energy-efficient than Hummingbird.
最近的研究已经证明了fpga在加速决策森林(DFs)推理计算方面的潜力。然而,设计一个足够灵活的高性能架构以适应FPGA资源需求的各种场景仍然是一个挑战。为了解决这个问题,我们提出了一种DF推理方法,该方法将遍历树转换为遍历特征空间。具体来说,作为预处理步骤,我们基于阈值将每个特征空间划分为多个区域。然后通过(1)确定数据点在每个特征空间中属于哪个区域以及(2)结合这些区域中的推理信息来进行输入数据点的推理任务。计算的规律性允许我们设计DF推理架构,称为FT-DFP(特征空间遍历决策森林处理器),可以灵活地配置不同的性能和FPGA资源使用要求。我们在低端FPGA (Artix-7)板上对FT-DFP进行了原型设计,并使用四个实际数据集对其进行了评估。评估结果表明:(1)FT-DFP的灵活性使我们能够在资源有限的低端FPGA器件中适应各种DF模型;(2) FT-DFP的性能与目前在高端FPGA器件上实现的最佳加速器相当,比在高端GPU上运行的最先进的GPU优化实现Hummingbird高3.04倍;(3) FT-DFP比Hummingbird节能130.96倍。
{"title":"A High-Performance and Flexible FPGA Inference Accelerator for Decision Forests Based on Prior Feature Space Partitioning","authors":"Thiem Van Chu, Ryuichi Kitajima, Kazushi Kawamura, Jaehoon Yu, M. Motomura","doi":"10.1109/ICFPT52863.2021.9609699","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609699","url":null,"abstract":"Recent studies have demonstrated the potential of FPGAs for accelerating the inference computation of decision forests (DFs). However, designing a high-performance architecture that is flexible enough to be adopted in various scenarios of FPGA resource requirements remains a challenge. To address this, we propose a DF inference method that makes a transformation from traversing trees into traversing feature spaces. Specifically, as a preprocessing step, we partition each feature space into multiple regions based on thresholds. The inference task for an input data point is then conducted by (1) determining which region in each feature space the data point belongs to and (2) combining the inference information in these regions. The regularity of the computation allows us to design a DF inference architecture, called FT-DFP (Feature-space Traversing Decision Forest Processor), that can be flexibly configured for different performance and FPGA resource usage requirements. We prototype FT-DFP on a low-end FPGA (Artix-7) board and evaluate it using four real-world datasets. The evaluation results show that (1) the flexibility of FT-DFP allows us to fit a wide variety of DF models into low-end FPGA devices with limited resources; (2) FT-DFP's performance is comparable to the best of existing accelerators implemented on high-end FPGA devices and 3.04 × higher than Hummingbird, a state-of-the-art GPU-optimized implementation, running on a high-end GPU; and (3) FT-DFP is 130.96 × more energy-efficient than Hummingbird.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114217379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Journal Track Papers 期刊跟踪论文
Pub Date : 2021-12-06 DOI: 10.1109/icfpt52863.2021.9609949
{"title":"Journal Track Papers","authors":"","doi":"10.1109/icfpt52863.2021.9609949","DOIUrl":"https://doi.org/10.1109/icfpt52863.2021.9609949","url":null,"abstract":"","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123972754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High performance lattice regression on FPGAs via a high level hardware description language 基于高级硬件描述语言的fpga上的高性能点阵回归
Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609893
Nathan Zhang, Matthew Feldman, K. Olukotun
Lattice regression-based models are highly-constrainable and interpretable machine learning models used in applications such as query classification and path length prediction for maps. To improve their performance and better serve these models to millions of consumers, we accelerate them using field programmable gate arrays. We adopt a library-based approach using a high level hardware description language (HLHDL) to support the broad family of lattice models. HLHDLs improve productivity by providing both control abstraction such as looping, reductions, and memory hierarchies, as well as automatically handling low-level tasks such as retiming. However, these abstractions can lead to performance bottlenecks if not carefully used. We characterize these bottlenecks and implement a lattice regression library using a streaming tensor abstraction which avoids them. On a pair of models trained for network anomaly detection, we achieve a ${166,-,256times}$ speedup over CPUs even with large batch sizes.
基于点阵回归的模型是高度约束和可解释的机器学习模型,用于查询分类和地图路径长度预测等应用。为了提高它们的性能并更好地为数百万消费者服务这些模型,我们使用现场可编程门阵列加速它们。我们采用基于库的方法,使用高级硬件描述语言(HLHDL)来支持广泛的晶格模型家族。hlhdl通过提供控制抽象(如循环、缩减和内存层次结构)以及自动处理低级任务(如重新计时)来提高生产率。然而,如果不小心使用,这些抽象可能会导致性能瓶颈。我们描述了这些瓶颈,并使用流张量抽象实现了晶格回归库,从而避免了这些瓶颈。在一对训练用于网络异常检测的模型上,我们实现了在cpu上的${166,-,256倍}$的加速,即使批量大小很大。
{"title":"High performance lattice regression on FPGAs via a high level hardware description language","authors":"Nathan Zhang, Matthew Feldman, K. Olukotun","doi":"10.1109/ICFPT52863.2021.9609893","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609893","url":null,"abstract":"Lattice regression-based models are highly-constrainable and interpretable machine learning models used in applications such as query classification and path length prediction for maps. To improve their performance and better serve these models to millions of consumers, we accelerate them using field programmable gate arrays. We adopt a library-based approach using a high level hardware description language (HLHDL) to support the broad family of lattice models. HLHDLs improve productivity by providing both control abstraction such as looping, reductions, and memory hierarchies, as well as automatically handling low-level tasks such as retiming. However, these abstractions can lead to performance bottlenecks if not carefully used. We characterize these bottlenecks and implement a lattice regression library using a streaming tensor abstraction which avoids them. On a pair of models trained for network anomaly detection, we achieve a ${166,-,256times}$ speedup over CPUs even with large batch sizes.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130247085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2021 International Conference on Field-Programmable Technology (ICFPT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1