2021 International Conference on Field-Programmable Technology (ICFPT)最新文献

英文中文

Efficient Stride 2 Winograd Convolution Method Using Unified Transformation Matrices on FPGA 基于FPGA的统一变换矩阵的高效跨步Winograd卷积方法

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609907

Chengcheng Huang, Xiaoxiao Dong, Zhao Li, Tengteng Song, Zhenguo Liu, Lele Dong

Winograd algorithm can effectively reduce the computational complexity of convolution operation. Effectively using the parallelism of Winograd convolution algorithm can effectively improve the performance of accelerator architectures on FPGA. The stride represents the number of elements that the window slides when filter is scanned on the input feature map. The Winograd algorithm with the stride of 2 implemented in previous studies divided the input feature maps into multiple groups of Winograd algorithms to complete the operations, resulting in additional precomputation and hardware resource overhead. In this paper, we propose a new Winograd convolution algorithm with the stride of 2. This method uses the unified Winograd transformation matrices instead of the grouping method to complete the calculation. Therefore, the method proposed in this paper can realize 2D Winograd convolution and 3D Winograd convolution by nested 1D Winograd convolution, just like the Winograd convolution algorithm with the stride of 1. In this paper, Winograd transformation matrices with kernel size of 3, 5, and 7 are provided. In particular, for convolution with the kernel of 3, this method reduces the addition operations of Winograd algorithm by 30.0%-31.5% and removes unnecessary shift operations completely. In addition, we implement Winograd convolution algorithm with the stride of 2 through template design, and realize pipeline and data reuse. Compared to the state-of-the-art implementation, the proposed method results in a speedup of 1.24 and reduces resource usage.

Winograd算法可以有效地降低卷积运算的计算复杂度。有效地利用Winograd卷积算法的并行性可以有效地提高FPGA上加速器架构的性能。步幅表示在输入特征映射上扫描过滤器时窗口滑动的元素数量。以往研究中实现的跨步为2的Winograd算法将输入的特征映射分成多组Winograd算法来完成操作，导致额外的预计算和硬件资源开销。本文提出了一种新的Winograd卷积算法，其步幅为2。该方法采用统一的Winograd变换矩阵代替分组方法来完成计算。因此，本文提出的方法可以像步长为1的Winograd卷积算法一样，通过嵌套的1D Winograd卷积实现2D Winograd卷积和3D Winograd卷积。本文给出了核大小为3、5、7的Winograd变换矩阵。特别是对于核数为3的卷积，该方法将Winograd算法的加法运算减少了30.0%-31.5%，并且完全消除了不必要的移位运算。此外，通过模板设计实现了步长为2的Winograd卷积算法，实现了流水线和数据复用。与最先进的实现相比，所提出的方法的加速提高了1.24，并减少了资源使用。

{"title":"Efficient Stride 2 Winograd Convolution Method Using Unified Transformation Matrices on FPGA","authors":"Chengcheng Huang, Xiaoxiao Dong, Zhao Li, Tengteng Song, Zhenguo Liu, Lele Dong","doi":"10.1109/ICFPT52863.2021.9609907","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609907","url":null,"abstract":"Winograd algorithm can effectively reduce the computational complexity of convolution operation. Effectively using the parallelism of Winograd convolution algorithm can effectively improve the performance of accelerator architectures on FPGA. The stride represents the number of elements that the window slides when filter is scanned on the input feature map. The Winograd algorithm with the stride of 2 implemented in previous studies divided the input feature maps into multiple groups of Winograd algorithms to complete the operations, resulting in additional precomputation and hardware resource overhead. In this paper, we propose a new Winograd convolution algorithm with the stride of 2. This method uses the unified Winograd transformation matrices instead of the grouping method to complete the calculation. Therefore, the method proposed in this paper can realize 2D Winograd convolution and 3D Winograd convolution by nested 1D Winograd convolution, just like the Winograd convolution algorithm with the stride of 1. In this paper, Winograd transformation matrices with kernel size of 3, 5, and 7 are provided. In particular, for convolution with the kernel of 3, this method reduces the addition operations of Winograd algorithm by 30.0%-31.5% and removes unnecessary shift operations completely. In addition, we implement Winograd convolution algorithm with the stride of 2 through template design, and realize pipeline and data reuse. Compared to the state-of-the-art implementation, the proposed method results in a speedup of 1.24 and reduces resource usage.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"262 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134072165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

FPGAs as General-Purpose Accelerators for Non-Experts via HLS: The Graph Analysis Example fpga作为非专家通过HLS的通用加速器:图分析的例子

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609832

P. Silva, João Bispo, N. Paulino

We discuss the concept of FPGA-unfriendliness, the property of certain algorithms, programs, or domains which may limit their applicability to FPGAs. Specifically, we look at graph analysis, which has recently seen increased interest in combination with High-Level Synthesis, but has yet to find great success compared to established acceleration mechanisms. To this end, we make use of Xilinx's Vitis Graph Library to implement Single-Source Shortest Paths (SSSP) and PageRank (PR), and present a custom kernel written from the ground up for Distinctiveness Centrality (DC, a novel graph centrality measure). We use public datasets to test these implementations, and analyse power consumption and execution time. Our comparisons against published data for GPU and CPU execution show FPGA slowdowns in execution time between around 18.5x and 328x for SSSP, and around 1.8x and 195x for PR, respectively. In some instances, we obtained FPGA speedups versus CPU of up to 2.5x for PR. Regarding DC, results show speedups from 0.1x to 3.5x, and energy efficiency increases from 0.8x to 6x. Lastly, we provide some insights regarding the applicability of FPGAs in FPGA-unfriendly domains, and comment on the future as FPGA and HLS technology advances.

我们讨论了fpga不友好的概念，某些算法、程序或领域的性质，这些可能限制它们对fpga的适用性。具体来说，我们着眼于图形分析，最近人们对它与高级合成的结合越来越感兴趣，但与已建立的加速机制相比，尚未取得巨大成功。为此，我们利用Xilinx的Vitis图库来实现单源最短路径(SSSP)和PageRank (PR)，并提出了一个自定义内核，从头开始编写独特性中心性(DC，一种新颖的图中心性度量)。我们使用公共数据集来测试这些实现，并分析功耗和执行时间。我们对GPU和CPU执行的公布数据进行了比较，结果显示FPGA在SSSP的执行时间下降了18.5倍到328倍，在PR的执行时间下降了1.8倍到195倍。在某些情况下，我们获得了FPGA相对于CPU的加速高达2.5倍的PR。关于DC，结果显示速度从0.1倍提高到3.5倍，能效从0.8倍提高到6倍。最后，我们就FPGA在FPGA不友好领域的适用性提供了一些见解，并对FPGA和HLS技术的未来进行了评论。

{"title":"FPGAs as General-Purpose Accelerators for Non-Experts via HLS: The Graph Analysis Example","authors":"P. Silva, João Bispo, N. Paulino","doi":"10.1109/ICFPT52863.2021.9609832","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609832","url":null,"abstract":"We discuss the concept of FPGA-unfriendliness, the property of certain algorithms, programs, or domains which may limit their applicability to FPGAs. Specifically, we look at graph analysis, which has recently seen increased interest in combination with High-Level Synthesis, but has yet to find great success compared to established acceleration mechanisms. To this end, we make use of Xilinx's Vitis Graph Library to implement Single-Source Shortest Paths (SSSP) and PageRank (PR), and present a custom kernel written from the ground up for Distinctiveness Centrality (DC, a novel graph centrality measure). We use public datasets to test these implementations, and analyse power consumption and execution time. Our comparisons against published data for GPU and CPU execution show FPGA slowdowns in execution time between around 18.5x and 328x for SSSP, and around 1.8x and 195x for PR, respectively. In some instances, we obtained FPGA speedups versus CPU of up to 2.5x for PR. Regarding DC, results show speedups from 0.1x to 3.5x, and energy efficiency increases from 0.8x to 6x. Lastly, we provide some insights regarding the applicability of FPGAs in FPGA-unfriendly domains, and comment on the future as FPGA and HLS technology advances.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133458510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

AMAH-Flex: A Modular and Highly Flexible Tool for Generating Relocatable Systems on FPGAs AMAH-Flex:用于在fpga上生成可重新定位系统的模块化和高度灵活的工具

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609948

Najdet Charaf, C. Tietz, Michael Raitza, Akash Kumar, D. Göhringer

In this work, we present a solution to a common problem encountered when using FPGAs in dynamic, ever-changing environments. Even when using dynamic function exchange to accommodate changing workloads, partial bitstreams are typically not relocatable. So the runtime environment needs to store all reconfigurable partition/reconfigurable module combinations as separate bitstreams. We present a modular and highly flexible tool (AMAH-Flex) that converts any static and reconfigurable system into a 2 dimensional dynamically relocatable system. It also features a fully automated floorplanning phase, closing the automation gap between synthesis and bitstream relocation. It integrates with the Xilinx Vivado toolchain and supports both FPGA architectures, the 7-Series and the UltraScale+. In addition, AMAH-Flex can be ported to any Xilinx FPGA family, starting with the 7-Series. We demonstrate the functionality of our tool in several reconfiguration scenarios on four different FPGA families and show that AMAH-Flex saves up to 80% of partial bitstreams.

在这项工作中，我们提出了一种解决fpga在动态、不断变化的环境中使用时遇到的常见问题的方法。即使使用动态函数交换来适应不断变化的工作负载，部分比特流通常也是不可重定位的。因此，运行时环境需要将所有可重构分区/可重构模块组合存储为单独的位流。我们提出了一个模块化和高度灵活的工具(AMAH-Flex)，可以将任何静态和可重构的系统转换为二维动态可重新定位的系统。它还具有完全自动化的平面规划阶段，缩小了合成和位流重新定位之间的自动化差距。它集成了Xilinx Vivado工具链，支持FPGA架构，7系列和UltraScale+。此外，AMAH-Flex可以移植到任何赛灵思FPGA系列，从7系列开始。我们在四种不同FPGA系列的几种重新配置场景中演示了我们的工具的功能，并表明AMAH-Flex节省了高达80%的部分比特流。

引用次数: 2

Efficient Queue-Balancing Switch for FPGAs fpga的高效队列平衡开关

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609867

Philippos Papaphilippou, K. Sano, B. Adhi, W. Luk

This paper presents a novel FPGA-based switch design that achieves high algorithmic performance and an efficient FPGA implementation. Crossbar switches based on virtual output queues (VOQs) and variations have been rather popular for implementing switches on FPGAs, with applications to network-on-chip (NoC) routers and network switches. The efficiency of VOQs is well-documented on ASICs, though we show that their disadvantages can outweigh their advantages on FPGAs. Our proposed design uses an output-queued switch internally for simplifying scheduling, and a queue balancing technique to avoid queue fragmentation and reduce the need for memory-sharing VOQs. Our implementation approaches the scheduling performance of the state-of-the-art, while requiring considerably fewer FPGA resources.

本文提出了一种新颖的基于FPGA的开关设计，该设计实现了高算法性能和高效的FPGA实现。基于虚拟输出队列(voq)和变体的Crossbar交换机在fpga上实现交换机时非常流行，并应用于片上网络(NoC)路由器和网络交换机。voq的效率在asic上得到了很好的证明，尽管我们表明它们在fpga上的缺点可能超过它们的优点。我们提出的设计在内部使用输出排队开关来简化调度，并使用队列平衡技术来避免队列碎片并减少对内存共享voq的需求。我们的实现接近最先进的调度性能，同时需要更少的FPGA资源。

引用次数: 1

A Hexagon-Based Honeycomb Routing Architecture for FPGA 一种基于六边形的FPGA蜂窝路由结构

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609805

Kaichuang Shi, Hao Zhou, Lingli Wang

Field Programmable Gate Arrays (FPGAs) are widely used for their flexibility and short time to market. FPGA routing architecture design is the key problem due to the fact that it plays a dominant role in the area, delay and power. Most of modern FPGAs are island-style which provide abundant vertical and horizontal tracks to guarantee the circuit designs can be routed successfully. Most connections in placed netlists are diagonal which may lead to passing through extra turning switches, resulting in increased delay cost and high routing density. In this paper, we propose a hexagon-based honeycomb FPGA routing architecture to improve the routability and performance. In honeycomb architecture, there are three kinds of routing channels which can provide more freedom to decrease the turning switches on the routing paths. In addition, the router lookahead algorithm is enhanced to support the honeycomb architecture which is then evaluated by the enhanced VTR with provided benchmarks. The experimental results show that the honeycomb architecture can improve the minimum routing channel width by 7.7% compared with traditional rectangular architecture with length-1 wires. In addition, the honeycomb architecture can achieve 9.9% improvement on the routed wirelength, 11.5% on the critical path delay and 12.4% on the area-delay product.

现场可编程门阵列(fpga)因其灵活性和上市时间短而得到广泛应用。FPGA的路由结构设计是关键问题，因为它在面积、时延和功耗方面起着主导作用。现代fpga大多采用岛式设计，提供了丰富的纵横轨迹，保证了电路设计的顺利布线。放置网表中的大多数连接都是对角的，这可能导致通过额外的转向交换机，从而增加延迟成本和高路由密度。为了提高可达性和性能，本文提出了一种基于六边形的蜂窝FPGA路由架构。在蜂窝结构中，有三种路由通道可以提供更大的自由度，以减少路由路径上的转弯开关。此外，路由器前瞻算法被增强以支持蜂窝架构，然后由增强的VTR根据提供的基准进行评估。实验结果表明，与传统的长度为1线的矩形结构相比，蜂窝结构可以将最小路由通道宽度提高7.7%。此外，蜂窝结构的路由长度提高了9.9%，关键路径延迟提高了11.5%，区域延迟提高了12.4%。

{"title":"A Hexagon-Based Honeycomb Routing Architecture for FPGA","authors":"Kaichuang Shi, Hao Zhou, Lingli Wang","doi":"10.1109/ICFPT52863.2021.9609805","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609805","url":null,"abstract":"Field Programmable Gate Arrays (FPGAs) are widely used for their flexibility and short time to market. FPGA routing architecture design is the key problem due to the fact that it plays a dominant role in the area, delay and power. Most of modern FPGAs are island-style which provide abundant vertical and horizontal tracks to guarantee the circuit designs can be routed successfully. Most connections in placed netlists are diagonal which may lead to passing through extra turning switches, resulting in increased delay cost and high routing density. In this paper, we propose a hexagon-based honeycomb FPGA routing architecture to improve the routability and performance. In honeycomb architecture, there are three kinds of routing channels which can provide more freedom to decrease the turning switches on the routing paths. In addition, the router lookahead algorithm is enhanced to support the honeycomb architecture which is then evaluated by the enhanced VTR with provided benchmarks. The experimental results show that the honeycomb architecture can improve the minimum routing channel width by 7.7% compared with traditional rectangular architecture with length-1 wires. In addition, the honeycomb architecture can achieve 9.9% improvement on the routed wirelength, 11.5% on the critical path delay and 12.4% on the area-delay product.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127795738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Algorithm-Hardware Co-Optimization for Energy-Efficient Drone Detection on Resource-Constrained FPGA 基于FPGA的节能无人机检测算法-硬件协同优化

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1145/3583074

Han-Sok Suh, Jian Meng, Ty Nguyen, S. Venkataramanaiah, Vijay Kumar, Yu Cao, Jae-sun Seo

Convolutional neural network (CNN) based object detection has achieved very high accuracy, e.g. single-shot multi-box detectors (SSD) can efficiently detect and localize various objects in an input image. However, they require a high amount of computation and memory storage, which makes it difficult to perform efficient inference on resource-constrained hardware devices such as drones or unmanned aerial vehicles (UAVs). Drone/UAV detection is an important task for applications including surveillance, defense, and multi-drone self-localization and formation control. In this paper, we designed and co-optimized algorithm and hardware for energy-efficient drone detection on resource-constrained FPGA devices. We trained SSD object detection algorithm with a custom drone dataset. For inference, we employed low-precision quantization and adapted the width of the SSD CNN model. To improve throughput, we use dual-data rate operations for DSPs to effectively double the throughput with limited DSP counts. For different SSD algorithm models, we analyze accuracy or mean average precision (mAP) and evaluate the corresponding FPGA hardware utilization, DRAM communication, throughput optimization. Our proposed design achieves a high mAP of 88.42% on the multi-drone dataset, with a high energy-efficiency of 79 GOPS/W and throughput of 158 GOPS using Xilinx Zynq ZU3EG FPGA device on the Open Vision Computer version 3 (OVC3) platform. Our design achieves 2.7X higher energy efficiency than prior works using the same FPGA device, at a low-power consumption of 1.98 W.

基于卷积神经网络(CNN)的目标检测已经达到了非常高的精度，例如单镜头多盒检测器(SSD)可以有效地检测和定位输入图像中的各种目标。然而，它们需要大量的计算和内存存储，这使得难以在资源受限的硬件设备(如无人机或无人驾驶飞行器(uav))上执行有效的推理。无人机/无人机检测是监视、防御、多无人机自定位和编队控制等应用的重要任务。在本文中，我们设计并协同优化了在资源受限的FPGA设备上进行节能无人机检测的算法和硬件。我们使用自定义无人机数据集训练SSD目标检测算法。对于推理，我们采用了低精度量化，并调整了SSD CNN模型的宽度。为了提高吞吐量，我们对DSP使用双数据速率操作，在有限的DSP计数下有效地将吞吐量提高一倍。对于不同的SSD算法模型，我们分析了精度或平均精度(mAP)，并评估了相应的FPGA硬件利用率，DRAM通信，吞吐量优化。我们的设计在开放式视觉计算机版本3 (OVC3)平台上使用Xilinx Zynq ZU3EG FPGA器件，在多无人机数据集上实现了88.42%的高mAP, 79 GOPS/W的高能效和158 GOPS的吞吐量。我们的设计在1.98 W的低功耗下，实现了比使用相同FPGA器件的先前工作高2.7倍的能效。

{"title":"Algorithm-Hardware Co-Optimization for Energy-Efficient Drone Detection on Resource-Constrained FPGA","authors":"Han-Sok Suh, Jian Meng, Ty Nguyen, S. Venkataramanaiah, Vijay Kumar, Yu Cao, Jae-sun Seo","doi":"10.1145/3583074","DOIUrl":"https://doi.org/10.1145/3583074","url":null,"abstract":"Convolutional neural network (CNN) based object detection has achieved very high accuracy, e.g. single-shot multi-box detectors (SSD) can efficiently detect and localize various objects in an input image. However, they require a high amount of computation and memory storage, which makes it difficult to perform efficient inference on resource-constrained hardware devices such as drones or unmanned aerial vehicles (UAVs). Drone/UAV detection is an important task for applications including surveillance, defense, and multi-drone self-localization and formation control. In this paper, we designed and co-optimized algorithm and hardware for energy-efficient drone detection on resource-constrained FPGA devices. We trained SSD object detection algorithm with a custom drone dataset. For inference, we employed low-precision quantization and adapted the width of the SSD CNN model. To improve throughput, we use dual-data rate operations for DSPs to effectively double the throughput with limited DSP counts. For different SSD algorithm models, we analyze accuracy or mean average precision (mAP) and evaluate the corresponding FPGA hardware utilization, DRAM communication, throughput optimization. Our proposed design achieves a high mAP of 88.42% on the multi-drone dataset, with a high energy-efficiency of 79 GOPS/W and throughput of 158 GOPS using Xilinx Zynq ZU3EG FPGA device on the Open Vision Computer version 3 (OVC3) platform. Our design achieves 2.7X higher energy efficiency than prior works using the same FPGA device, at a low-power consumption of 1.98 W.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125218993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

An FPGA-Based Image Recognition with Remote Update Functions for Autonomous Driving on “ad-refkit” 基于fpga的“ad-refkit”自动驾驶图像识别与远程更新

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609852

Hyuga Hashimoto, Ryoko Naka, Yasutaka Wada

This paper explains the ongoing development of an FPGA-based image recognition system for an autonomous driving robot car for the FPT'21 Design Competition. We are developing an FPGA-based image recognition on “ad-refkit” equipped with a Zybo-Z7 board to realize two main functionalities: 1) to provide recognition results in the actual environment of the contest through the Internet and 2) to be updated remotely based on the results. With these functionalities, we can realize high-performance and low-power systems using an FPGA.

本文介绍了正在进行的基于fpga的图像识别系统的开发，该系统用于FPT'21设计竞赛的自动驾驶机器人汽车。我们正在开发一种基于fpga的图像识别系统，该系统在“ad-refkit”上安装了Zybo-Z7板，主要实现两个功能:1)通过互联网在实际比赛环境中提供识别结果，2)根据结果远程更新。利用这些功能，我们可以使用FPGA实现高性能和低功耗的系统。

引用次数: 0

LETA: A lightweight exchangeable-track accelerator for efficientnet based on FPGA LETA:一种基于FPGA的高效网络轻量级可交换轨道加速器

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609919

Jingbo Gao, Yu Qian, Yihan Hu, Xitian Fan, W. Luk, Wei Cao, Lingli Wang

Lightweight convolutional neural networks (CNNs) have become increasingly popular due to their lower computational complexity and fewer memory accesses with equivalent accuracy compared to previous CNN models. However, the newly proposed networks bring new challenges to efficient hardware design, such as, in EfficientNet, depthwise convolution, squeeze-and-excitation (SE) module, and swish/sigmoid functions. Although individual engine architecture could achieve a high computing efficiency for the standard convolution or the depth-wise convolution, it is still not efficient for EfficientNet because the workload imbalance between two types of convolutional engines causes inevitable idling. To overcome this problem, we present a lightweight reconfigurable computational kernel based on FPGA with an exchangeable-track datapath scheme. In addition, a low-accuracy-loss function replacement strategy is proposed for swish/sigmoid functions. Furthermore, the low-cost hardware architecture to implement the replaced functions is designed. The proposed accelerator (LETA) can implement EfficientNet on Xilinx XCVU37P with a 300 MHz system clock and a 600 MHz kernel clock. The linear growth of resource usage in the 4-kernel implementation in 1 super logic region (SLR) with the same clock frequencies justifies the scalability of LETA. The experimental results show that LETA can achieve 2× throughput/DSP compared to the latest FPGA-based accelerator with 1.6% (0.7%) top-1 (top-5) accuracy loss on EfficientNet-B3.

与以前的CNN模型相比，轻量级卷积神经网络(CNN)由于其较低的计算复杂度和较少的内存访问而变得越来越流行。然而，新提出的网络给高效的硬件设计带来了新的挑战，例如，在EfficientNet中，深度卷积、挤压和激励(SE)模块和swish/sigmoid函数。尽管单个引擎架构对于标准卷积或深度卷积可以实现很高的计算效率，但对于EfficientNet来说仍然效率不高，因为两种卷积引擎之间的工作负载不平衡导致不可避免的空转。为了克服这一问题，我们提出了一种基于FPGA的轻量级可重构计算内核，采用可交换轨道数据路径方案。此外，针对swish/sigmoid函数，提出了一种低精度损失的函数替换策略。在此基础上，设计了实现替代功能的低成本硬件架构。所提出的加速器(LETA)可以在Xilinx XCVU37P上实现效率网络，系统时钟为300 MHz，内核时钟为600 MHz。在具有相同时钟频率的1个超级逻辑区域(SLR)中的4内核实现中，资源使用的线性增长证明了LETA的可扩展性。实验结果表明，与最新的基于fpga的加速器相比，LETA可以实现2倍的吞吐量/DSP，在effentnet - b3上的精度损失为1.6% (0.7%)top-1 (top-5)。

{"title":"LETA: A lightweight exchangeable-track accelerator for efficientnet based on FPGA","authors":"Jingbo Gao, Yu Qian, Yihan Hu, Xitian Fan, W. Luk, Wei Cao, Lingli Wang","doi":"10.1109/ICFPT52863.2021.9609919","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609919","url":null,"abstract":"Lightweight convolutional neural networks (CNNs) have become increasingly popular due to their lower computational complexity and fewer memory accesses with equivalent accuracy compared to previous CNN models. However, the newly proposed networks bring new challenges to efficient hardware design, such as, in EfficientNet, depthwise convolution, squeeze-and-excitation (SE) module, and swish/sigmoid functions. Although individual engine architecture could achieve a high computing efficiency for the standard convolution or the depth-wise convolution, it is still not efficient for EfficientNet because the workload imbalance between two types of convolutional engines causes inevitable idling. To overcome this problem, we present a lightweight reconfigurable computational kernel based on FPGA with an exchangeable-track datapath scheme. In addition, a low-accuracy-loss function replacement strategy is proposed for swish/sigmoid functions. Furthermore, the low-cost hardware architecture to implement the replaced functions is designed. The proposed accelerator (LETA) can implement EfficientNet on Xilinx XCVU37P with a 300 MHz system clock and a 600 MHz kernel clock. The linear growth of resource usage in the 4-kernel implementation in 1 super logic region (SLR) with the same clock frequencies justifies the scalability of LETA. The experimental results show that LETA can achieve 2× throughput/DSP compared to the latest FPGA-based accelerator with 1.6% (0.7%) top-1 (top-5) accuracy loss on EfficientNet-B3.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"603 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116327636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

SoC FPGA implementation of an unmanned mobile vehicle with an image transmission system over VNC SoC FPGA实现的一种带有VNC图像传输系统的无人驾驶移动车辆

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609904

Keigo Motoyoshi, Yuta Imamura, Taichi Saikai, Koki Fujita, Daiki Furukawa, Masatomo Matsuda, Tatsuma Mori, Yasutoshi Araki, Takehiro Miura, Keizo Yamashita, Haruto Ikehara, Kaito Ohira, Katsuaki Kamimae, Takuho Kawazu, Masahiro Nishimura, Shintaro Matsui, Koki Tomonaga, Taito Manabe, Yuichiro Shibata

We are developing the unmanned mobile vehicle implemented on SoC FPGA for the FPGA design competition. For highly productive development of image-based self-driving mobile vehicles, a remote verification and debugging environment with real-time image transmission is important. This paper presents an image transmission system with which we can monitor onboard camera images of the vehicle and feature detection results over VNC Wi-Fi connection. We implemented the whole system on a Xilinx Zynq-7000 with a maximum operating frequency of 125 MHz. The evaluation of the system showed that the resolution of 640×720 is the most beneficial for VNC in this experiment in terms of the performance ratio of VNC to SSH X11 forwarding. We also shortly describe other components to be used to develop the autonomous driving system in this paper.

我们正在为FPGA设计竞赛开发基于SoC FPGA的无人驾驶移动车辆。为了高效地开发基于图像的自动驾驶汽车，具有实时图像传输的远程验证和调试环境非常重要。本文介绍了一种图像传输系统，该系统可以通过VNC Wi-Fi连接监控车载摄像头图像和特征检测结果。我们在Xilinx Zynq-7000上实现了整个系统，最大工作频率为125 MHz。系统评估表明，在本次实验中，从VNC对SSH X11转发的性能比来看，640×720的分辨率是最有利于VNC的。本文还简要介绍了用于开发自动驾驶系统的其他组件。

{"title":"SoC FPGA implementation of an unmanned mobile vehicle with an image transmission system over VNC","authors":"Keigo Motoyoshi, Yuta Imamura, Taichi Saikai, Koki Fujita, Daiki Furukawa, Masatomo Matsuda, Tatsuma Mori, Yasutoshi Araki, Takehiro Miura, Keizo Yamashita, Haruto Ikehara, Kaito Ohira, Katsuaki Kamimae, Takuho Kawazu, Masahiro Nishimura, Shintaro Matsui, Koki Tomonaga, Taito Manabe, Yuichiro Shibata","doi":"10.1109/ICFPT52863.2021.9609904","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609904","url":null,"abstract":"We are developing the unmanned mobile vehicle implemented on SoC FPGA for the FPGA design competition. For highly productive development of image-based self-driving mobile vehicles, a remote verification and debugging environment with real-time image transmission is important. This paper presents an image transmission system with which we can monitor onboard camera images of the vehicle and feature detection results over VNC Wi-Fi connection. We implemented the whole system on a Xilinx Zynq-7000 with a maximum operating frequency of 125 MHz. The evaluation of the system showed that the resolution of 640×720 is the most beneficial for VNC in this experiment in terms of the performance ratio of VNC to SSH X11 forwarding. We also shortly describe other components to be used to develop the autonomous driving system in this paper.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114448970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Total-ionizing-dose tolerance evaluation of an optoelectronic field programmable gate array VLSI during operation 光电场可编程门阵列VLSI运行时的总电离剂量耐受评估

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609910

Hirotoshi Ito, Minoru Watanabe

This paper presents the total-ionizing-dose tolerance evaluation of an optoelectronic field programmable gate array (FPGA) during operation. The optoelectronic FPGA was fabricated using 0.18 ${mu m}$ standard complementary metal oxide semiconductor (CMOS) process technology. An experiment assessing the total-ionizing-dose tolerance of the optoelectronic FPGA was conducted at a 2.27–2.28 kGy/h dose rate using a60 Co gamma radiation source. Results clarified that the optoelectronic FPGA can function correctly under a 2.27–2.28 kGy/h dose rate and that the total-ionizing-dose tolerance of the optoelectronic FPGA is greater than 80 Mrad during operation. The total-ionizing-dose tolerance result is 80 times higher than that of typical radiation-hardened very large scale integrated circuits (VLSIs) and typical radiation-hardened FPGAs.

本文介绍了光电场可编程门阵列(FPGA)在工作时的总电离剂量耐受评估。该光电FPGA采用0.18 ${ μ m}$标准互补金属氧化物半导体(CMOS)工艺技术制作。利用60 Co γ辐射源，在2.27 ~ 2.28 kGy/h的剂量率下，对光电FPGA进行了总电离剂量耐受试验。结果表明，该光电FPGA在2.27 ~ 2.28 kGy/h的剂量率下能够正常工作，且工作时的总电离剂量耐受大于80 Mrad。总电离剂量耐受结果比典型的抗辐射超大规模集成电路(vlsi)和典型的抗辐射fpga高80倍。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2021 International Conference on Field-Programmable Technology (ICFPT)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀