Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609930
Puya Amiri, Arsène Pérard-Gayot, Richard Membarth, P. Slusallek, Roland Leißa, Sebastian Hack
FPGAs have found their way into data centers as accelerator cards, making reconfigurable computing more accessible for high-performance applications. At the same time, new high-level synthesis compilers like Xilinx Vitis and runtime libraries such as XRT attract software programmers into the reconfigurable domain. While software programmers are familiar with task-level and data-parallel programming, FPGAs often require different types of parallelism. For example, data-driven parallelism is mandatory to obtain satisfactory hardware designs for pipelined dataflow architectures. However, software programmers are often not acquainted with dataflow architectures— resulting in poor hardware designs. In this work we present FLOWER, a comprehensive compiler infrastructure that provides automatic canonical transformations for high-level synthesis from a domain-specific library. This allows programmers to focus on algorithm implementations rather than low-level optimizations for dataflow architectures. We show that FLOWER allows to synthesize efficient implementations for high-performance streaming applications targeting System-on-Chip and FPGA accelerator cards, in the context of image processing and computer vision.
{"title":"FLOWER: A comprehensive dataflow compiler for high-level synthesis","authors":"Puya Amiri, Arsène Pérard-Gayot, Richard Membarth, P. Slusallek, Roland Leißa, Sebastian Hack","doi":"10.1109/ICFPT52863.2021.9609930","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609930","url":null,"abstract":"FPGAs have found their way into data centers as accelerator cards, making reconfigurable computing more accessible for high-performance applications. At the same time, new high-level synthesis compilers like Xilinx Vitis and runtime libraries such as XRT attract software programmers into the reconfigurable domain. While software programmers are familiar with task-level and data-parallel programming, FPGAs often require different types of parallelism. For example, data-driven parallelism is mandatory to obtain satisfactory hardware designs for pipelined dataflow architectures. However, software programmers are often not acquainted with dataflow architectures— resulting in poor hardware designs. In this work we present FLOWER, a comprehensive compiler infrastructure that provides automatic canonical transformations for high-level synthesis from a domain-specific library. This allows programmers to focus on algorithm implementations rather than low-level optimizations for dataflow architectures. We show that FLOWER allows to synthesize efficient implementations for high-performance streaming applications targeting System-on-Chip and FPGA accelerator cards, in the context of image processing and computer vision.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131059167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609934
M. Flottmann, Marc Eisoldt, Julian Gaal, Marc Rothmann, M. Tassemeier, T. Wiemann, Mario Porrmann
Being one of the fundamental problems in autonomous robotics, SLAM (Simultaneous Localization and Mapping) algorithms have gained a lot of attention. Although numerous approaches have been presented for determining 6D poses in 3D environments, one of the main challenges that remains is the required combination of real-time processing and high energy efficiency. In this paper, a combination of CPU and FPGA processing is used to tackle this problem, utilizing a reconfigurable SoC. We present a complete solution for embedded LiDAR-based SLAM that uses a global Truncated Signed Distance Function (TSDF) as map representation. A hardware-in-the-loop environment with ROS integration enables efficient evaluation of new variants of algorithms and implementations. Based on benchmark data sets and real-world environments, we show that our approach compares well to established SLAM algorithms. Compared to a software implementation on a state-of-the-art PC, the proposed implementation achieves a 7-fold speed-up and requires 18 times less energy when using a Xilinx UltraScale+ XCZU15EG.
SLAM (Simultaneous Localization and Mapping)算法作为自主机器人的基础问题之一,受到了广泛的关注。尽管已经提出了许多方法来确定3D环境中的6D姿势,但仍然存在的主要挑战之一是需要将实时处理和高能效结合起来。在本文中,使用CPU和FPGA处理的组合来解决这个问题,利用可重构的SoC。我们提出了一个基于嵌入式激光雷达的SLAM的完整解决方案,该解决方案使用全局截断签名距离函数(TSDF)作为地图表示。具有ROS集成的硬件在环环境可以有效地评估算法和实现的新变体。基于基准数据集和现实世界环境,我们证明了我们的方法与已建立的SLAM算法相比较。与最先进的PC上的软件实现相比,当使用赛灵思UltraScale+ XCZU15EG时,拟议的实现实现了7倍的加速,所需的能量减少了18倍。
{"title":"Energy-efficient FPGA-accelerated LiDAR-based SLAM for embedded robotics","authors":"M. Flottmann, Marc Eisoldt, Julian Gaal, Marc Rothmann, M. Tassemeier, T. Wiemann, Mario Porrmann","doi":"10.1109/ICFPT52863.2021.9609934","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609934","url":null,"abstract":"Being one of the fundamental problems in autonomous robotics, SLAM (Simultaneous Localization and Mapping) algorithms have gained a lot of attention. Although numerous approaches have been presented for determining 6D poses in 3D environments, one of the main challenges that remains is the required combination of real-time processing and high energy efficiency. In this paper, a combination of CPU and FPGA processing is used to tackle this problem, utilizing a reconfigurable SoC. We present a complete solution for embedded LiDAR-based SLAM that uses a global Truncated Signed Distance Function (TSDF) as map representation. A hardware-in-the-loop environment with ROS integration enables efficient evaluation of new variants of algorithms and implementations. Based on benchmark data sets and real-world environments, we show that our approach compares well to established SLAM algorithms. Compared to a software implementation on a state-of-the-art PC, the proposed implementation achieves a 7-fold speed-up and requires 18 times less energy when using a Xilinx UltraScale+ XCZU15EG.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"232 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115964792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609937
Kazunari Takasaki, Kota Hisafuru, Ryotaro Negishi, Kazuki Yamashita, Keisuke Fukada, Tomoya Wakaizumi, N. Togawa
This paper presents an autonomous driving system utilizing FPGA-based image processing. We develop a robot that our system is implemented on Ultra96-V2, a board with programmable logic and processing system. We use ROS, a middleware framework for developing robots, to manage the system such as controlling hardware devices, localization and determination of the direction to go. We implement a neural network to detect road markings on the road on a programmable logic on the board. The robot with our system implemented drives autonomously along the specified route on a miniature road, recognizing edge line and road markings.
{"title":"An autonomous driving system utilizing image processing accelerated by FPGA","authors":"Kazunari Takasaki, Kota Hisafuru, Ryotaro Negishi, Kazuki Yamashita, Keisuke Fukada, Tomoya Wakaizumi, N. Togawa","doi":"10.1109/ICFPT52863.2021.9609937","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609937","url":null,"abstract":"This paper presents an autonomous driving system utilizing FPGA-based image processing. We develop a robot that our system is implemented on Ultra96-V2, a board with programmable logic and processing system. We use ROS, a middleware framework for developing robots, to manage the system such as controlling hardware devices, localization and determination of the direction to go. We implement a neural network to detect road markings on the road on a programmable logic on the board. The robot with our system implemented drives autonomously along the specified route on a miniature road, recognizing edge line and road markings.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129833346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609814
Siyang Jiang, Hsi-Wen Chen, Ming-Syan Chen
Tall and skinny QR (TSQR) decomposition is an essential matrix operation with various applications in edge computing, including data compression, subspace projection, and dimension reduction. As a critical component in TSQR, Dual-Triangular QR (DTQR) decomposition is solved by the Normal QR method in most works without utilizing the dual-triangular structure. Therefore, we propose a novel DTQR accelerator by recursively exploring the DT structure and propose three acceleration strategies with the systolic array to achieve higher parallelism. Experimental results manifest that our algorithm achieves 21.55x on average speedup compared with the baselines.
{"title":"Dataflow Systolic Array Implementations of Exploring Dual-Triangular Structure in QR Decomposition Using High-Level Synthesis","authors":"Siyang Jiang, Hsi-Wen Chen, Ming-Syan Chen","doi":"10.1109/ICFPT52863.2021.9609814","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609814","url":null,"abstract":"Tall and skinny QR (TSQR) decomposition is an essential matrix operation with various applications in edge computing, including data compression, subspace projection, and dimension reduction. As a critical component in TSQR, Dual-Triangular QR (DTQR) decomposition is solved by the Normal QR method in most works without utilizing the dual-triangular structure. Therefore, we propose a novel DTQR accelerator by recursively exploring the DT structure and propose three acceleration strategies with the systolic array to achieve higher parallelism. Experimental results manifest that our algorithm achieves 21.55x on average speedup compared with the baselines.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121724193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609811
Ryohei Yamamoto, Yuki Izumi, Ryo Aono, Takumi Nagahara, Tomonari Tanaka, Wang Liao, Y. Mitsuyama
We design and implement an autonomous driving system based on image recognition using programmable SoCs. The proposed system equips two FPGA boards and three cameras. One FPGA board implements a driving control system, and the other FPGA board implements object detection and recognition using machine learning algorithms. Driving control is performed based on road edge line detection and road marking recognition using the canny edge detection. On the other hand, image detection and recognition of traffic lights are implemented using the random forest method with HOG features. In the development framework of programmable SoC of Zynq 7000, we adopt a Hardware/Software co-design to balance the design period and system performance required for real-time processing.
{"title":"Development of Autonomous Driving System based on Image Recognition using Programmable SoCs","authors":"Ryohei Yamamoto, Yuki Izumi, Ryo Aono, Takumi Nagahara, Tomonari Tanaka, Wang Liao, Y. Mitsuyama","doi":"10.1109/ICFPT52863.2021.9609811","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609811","url":null,"abstract":"We design and implement an autonomous driving system based on image recognition using programmable SoCs. The proposed system equips two FPGA boards and three cameras. One FPGA board implements a driving control system, and the other FPGA board implements object detection and recognition using machine learning algorithms. Driving control is performed based on road edge line detection and road marking recognition using the canny edge detection. On the other hand, image detection and recognition of traffic lights are implemented using the random forest method with HOG features. In the development framework of programmable SoC of Zynq 7000, we adopt a Hardware/Software co-design to balance the design period and system performance required for real-time processing.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122499745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609883
Ryota Miyagi, N. Takagi, Sho Kinoshista, M. Oda, Hideki Takase
The FPT 2021 Design Competition aims to improve the technology of utilizing FPGA and achieve level-5 autonomous driving. We developed FPGA Integrated ROS-Based autonomous mobile robot, ZytleBot, for the competition. ZytleBot collects environmental information with CMOS cameras, recognizes environments, decides its action on programmable SoC, and controls its actuator. As a result, ZytleBot can run road model courses, detect and adequately deal with traffic lights and obstacles. We used the robot development platform TurtleBot3 and the robot middleware ROS to develop the robot system quickly. In addition, we utilize FPGA to accelerate road-images processing and traffic lights recognition using the HOG feature and SVM classifier. As a result, traffic lights recognition with FPGA is 270 times faster than those only with CPU.
{"title":"Zytlebot : FPGA integrated ros-based autonomous mobile robot","authors":"Ryota Miyagi, N. Takagi, Sho Kinoshista, M. Oda, Hideki Takase","doi":"10.1109/ICFPT52863.2021.9609883","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609883","url":null,"abstract":"The FPT 2021 Design Competition aims to improve the technology of utilizing FPGA and achieve level-5 autonomous driving. We developed FPGA Integrated ROS-Based autonomous mobile robot, ZytleBot, for the competition. ZytleBot collects environmental information with CMOS cameras, recognizes environments, decides its action on programmable SoC, and controls its actuator. As a result, ZytleBot can run road model courses, detect and adequately deal with traffic lights and obstacles. We used the robot development platform TurtleBot3 and the robot middleware ROS to develop the robot system quickly. In addition, we utilize FPGA to accelerate road-images processing and traffic lights recognition using the HOG feature and SVM classifier. As a result, traffic lights recognition with FPGA is 270 times faster than those only with CPU.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124787369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609911
Jingyi Li
Cyclostationary analysis is an important tool for understanding periodic phenomenon and the spectral correlation density (SCD) function is commonly used in its characterisation. Due to its high computational requirements it is not commonly applied to real-time signals, despite the fact that efficient FFT-based techniques for estimation of the SCD exist. In this research, we aim to address this issue by developing high-performance cyclostationary analysis techniques through FPGA acceleration, and apply them to enable new applications. We will first explore the tradeoff between arithmetic precision and implementation area, applying statistics-based analysis techniques to understand how signal to quantisation noise is affected by wordlength in fixed and floating-point implementations. Next, high-speed FPGA-based systolic architectures for estimating the SCD will be studied. Finally, we apply our optimised arithmetic and architectures to real-time, radio frequency applications.
{"title":"Real-time Implementation of Cyclostationary Analysis using FPGAs","authors":"Jingyi Li","doi":"10.1109/ICFPT52863.2021.9609911","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609911","url":null,"abstract":"Cyclostationary analysis is an important tool for understanding periodic phenomenon and the spectral correlation density (SCD) function is commonly used in its characterisation. Due to its high computational requirements it is not commonly applied to real-time signals, despite the fact that efficient FFT-based techniques for estimation of the SCD exist. In this research, we aim to address this issue by developing high-performance cyclostationary analysis techniques through FPGA acceleration, and apply them to enable new applications. We will first explore the tradeoff between arithmetic precision and implementation area, applying statistics-based analysis techniques to understand how signal to quantisation noise is affected by wordlength in fixed and floating-point implementations. Next, high-speed FPGA-based systolic architectures for estimating the SCD will be studied. Finally, we apply our optimised arithmetic and architectures to real-time, radio frequency applications.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133662088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609699
Thiem Van Chu, Ryuichi Kitajima, Kazushi Kawamura, Jaehoon Yu, M. Motomura
Recent studies have demonstrated the potential of FPGAs for accelerating the inference computation of decision forests (DFs). However, designing a high-performance architecture that is flexible enough to be adopted in various scenarios of FPGA resource requirements remains a challenge. To address this, we propose a DF inference method that makes a transformation from traversing trees into traversing feature spaces. Specifically, as a preprocessing step, we partition each feature space into multiple regions based on thresholds. The inference task for an input data point is then conducted by (1) determining which region in each feature space the data point belongs to and (2) combining the inference information in these regions. The regularity of the computation allows us to design a DF inference architecture, called FT-DFP (Feature-space Traversing Decision Forest Processor), that can be flexibly configured for different performance and FPGA resource usage requirements. We prototype FT-DFP on a low-end FPGA (Artix-7) board and evaluate it using four real-world datasets. The evaluation results show that (1) the flexibility of FT-DFP allows us to fit a wide variety of DF models into low-end FPGA devices with limited resources; (2) FT-DFP's performance is comparable to the best of existing accelerators implemented on high-end FPGA devices and 3.04 × higher than Hummingbird, a state-of-the-art GPU-optimized implementation, running on a high-end GPU; and (3) FT-DFP is 130.96 × more energy-efficient than Hummingbird.
{"title":"A High-Performance and Flexible FPGA Inference Accelerator for Decision Forests Based on Prior Feature Space Partitioning","authors":"Thiem Van Chu, Ryuichi Kitajima, Kazushi Kawamura, Jaehoon Yu, M. Motomura","doi":"10.1109/ICFPT52863.2021.9609699","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609699","url":null,"abstract":"Recent studies have demonstrated the potential of FPGAs for accelerating the inference computation of decision forests (DFs). However, designing a high-performance architecture that is flexible enough to be adopted in various scenarios of FPGA resource requirements remains a challenge. To address this, we propose a DF inference method that makes a transformation from traversing trees into traversing feature spaces. Specifically, as a preprocessing step, we partition each feature space into multiple regions based on thresholds. The inference task for an input data point is then conducted by (1) determining which region in each feature space the data point belongs to and (2) combining the inference information in these regions. The regularity of the computation allows us to design a DF inference architecture, called FT-DFP (Feature-space Traversing Decision Forest Processor), that can be flexibly configured for different performance and FPGA resource usage requirements. We prototype FT-DFP on a low-end FPGA (Artix-7) board and evaluate it using four real-world datasets. The evaluation results show that (1) the flexibility of FT-DFP allows us to fit a wide variety of DF models into low-end FPGA devices with limited resources; (2) FT-DFP's performance is comparable to the best of existing accelerators implemented on high-end FPGA devices and 3.04 × higher than Hummingbird, a state-of-the-art GPU-optimized implementation, running on a high-end GPU; and (3) FT-DFP is 130.96 × more energy-efficient than Hummingbird.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114217379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609893
Nathan Zhang, Matthew Feldman, K. Olukotun
Lattice regression-based models are highly-constrainable and interpretable machine learning models used in applications such as query classification and path length prediction for maps. To improve their performance and better serve these models to millions of consumers, we accelerate them using field programmable gate arrays. We adopt a library-based approach using a high level hardware description language (HLHDL) to support the broad family of lattice models. HLHDLs improve productivity by providing both control abstraction such as looping, reductions, and memory hierarchies, as well as automatically handling low-level tasks such as retiming. However, these abstractions can lead to performance bottlenecks if not carefully used. We characterize these bottlenecks and implement a lattice regression library using a streaming tensor abstraction which avoids them. On a pair of models trained for network anomaly detection, we achieve a ${166,-,256times}$ speedup over CPUs even with large batch sizes.
{"title":"High performance lattice regression on FPGAs via a high level hardware description language","authors":"Nathan Zhang, Matthew Feldman, K. Olukotun","doi":"10.1109/ICFPT52863.2021.9609893","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609893","url":null,"abstract":"Lattice regression-based models are highly-constrainable and interpretable machine learning models used in applications such as query classification and path length prediction for maps. To improve their performance and better serve these models to millions of consumers, we accelerate them using field programmable gate arrays. We adopt a library-based approach using a high level hardware description language (HLHDL) to support the broad family of lattice models. HLHDLs improve productivity by providing both control abstraction such as looping, reductions, and memory hierarchies, as well as automatically handling low-level tasks such as retiming. However, these abstractions can lead to performance bottlenecks if not carefully used. We characterize these bottlenecks and implement a lattice regression library using a streaming tensor abstraction which avoids them. On a pair of models trained for network anomaly detection, we achieve a ${166,-,256times}$ speedup over CPUs even with large batch sizes.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130247085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}