Winograd algorithm can effectively reduce the computational complexity of convolution operation. Effectively using the parallelism of Winograd convolution algorithm can effectively improve the performance of accelerator architectures on FPGA. The stride represents the number of elements that the window slides when filter is scanned on the input feature map. The Winograd algorithm with the stride of 2 implemented in previous studies divided the input feature maps into multiple groups of Winograd algorithms to complete the operations, resulting in additional precomputation and hardware resource overhead. In this paper, we propose a new Winograd convolution algorithm with the stride of 2. This method uses the unified Winograd transformation matrices instead of the grouping method to complete the calculation. Therefore, the method proposed in this paper can realize 2D Winograd convolution and 3D Winograd convolution by nested 1D Winograd convolution, just like the Winograd convolution algorithm with the stride of 1. In this paper, Winograd transformation matrices with kernel size of 3, 5, and 7 are provided. In particular, for convolution with the kernel of 3, this method reduces the addition operations of Winograd algorithm by 30.0%-31.5% and removes unnecessary shift operations completely. In addition, we implement Winograd convolution algorithm with the stride of 2 through template design, and realize pipeline and data reuse. Compared to the state-of-the-art implementation, the proposed method results in a speedup of 1.24 and reduces resource usage.
{"title":"Efficient Stride 2 Winograd Convolution Method Using Unified Transformation Matrices on FPGA","authors":"Chengcheng Huang, Xiaoxiao Dong, Zhao Li, Tengteng Song, Zhenguo Liu, Lele Dong","doi":"10.1109/ICFPT52863.2021.9609907","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609907","url":null,"abstract":"Winograd algorithm can effectively reduce the computational complexity of convolution operation. Effectively using the parallelism of Winograd convolution algorithm can effectively improve the performance of accelerator architectures on FPGA. The stride represents the number of elements that the window slides when filter is scanned on the input feature map. The Winograd algorithm with the stride of 2 implemented in previous studies divided the input feature maps into multiple groups of Winograd algorithms to complete the operations, resulting in additional precomputation and hardware resource overhead. In this paper, we propose a new Winograd convolution algorithm with the stride of 2. This method uses the unified Winograd transformation matrices instead of the grouping method to complete the calculation. Therefore, the method proposed in this paper can realize 2D Winograd convolution and 3D Winograd convolution by nested 1D Winograd convolution, just like the Winograd convolution algorithm with the stride of 1. In this paper, Winograd transformation matrices with kernel size of 3, 5, and 7 are provided. In particular, for convolution with the kernel of 3, this method reduces the addition operations of Winograd algorithm by 30.0%-31.5% and removes unnecessary shift operations completely. In addition, we implement Winograd convolution algorithm with the stride of 2 through template design, and realize pipeline and data reuse. Compared to the state-of-the-art implementation, the proposed method results in a speedup of 1.24 and reduces resource usage.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"262 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134072165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609832
P. Silva, João Bispo, N. Paulino
We discuss the concept of FPGA-unfriendliness, the property of certain algorithms, programs, or domains which may limit their applicability to FPGAs. Specifically, we look at graph analysis, which has recently seen increased interest in combination with High-Level Synthesis, but has yet to find great success compared to established acceleration mechanisms. To this end, we make use of Xilinx's Vitis Graph Library to implement Single-Source Shortest Paths (SSSP) and PageRank (PR), and present a custom kernel written from the ground up for Distinctiveness Centrality (DC, a novel graph centrality measure). We use public datasets to test these implementations, and analyse power consumption and execution time. Our comparisons against published data for GPU and CPU execution show FPGA slowdowns in execution time between around 18.5x and 328x for SSSP, and around 1.8x and 195x for PR, respectively. In some instances, we obtained FPGA speedups versus CPU of up to 2.5x for PR. Regarding DC, results show speedups from 0.1x to 3.5x, and energy efficiency increases from 0.8x to 6x. Lastly, we provide some insights regarding the applicability of FPGAs in FPGA-unfriendly domains, and comment on the future as FPGA and HLS technology advances.
{"title":"FPGAs as General-Purpose Accelerators for Non-Experts via HLS: The Graph Analysis Example","authors":"P. Silva, João Bispo, N. Paulino","doi":"10.1109/ICFPT52863.2021.9609832","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609832","url":null,"abstract":"We discuss the concept of FPGA-unfriendliness, the property of certain algorithms, programs, or domains which may limit their applicability to FPGAs. Specifically, we look at graph analysis, which has recently seen increased interest in combination with High-Level Synthesis, but has yet to find great success compared to established acceleration mechanisms. To this end, we make use of Xilinx's Vitis Graph Library to implement Single-Source Shortest Paths (SSSP) and PageRank (PR), and present a custom kernel written from the ground up for Distinctiveness Centrality (DC, a novel graph centrality measure). We use public datasets to test these implementations, and analyse power consumption and execution time. Our comparisons against published data for GPU and CPU execution show FPGA slowdowns in execution time between around 18.5x and 328x for SSSP, and around 1.8x and 195x for PR, respectively. In some instances, we obtained FPGA speedups versus CPU of up to 2.5x for PR. Regarding DC, results show speedups from 0.1x to 3.5x, and energy efficiency increases from 0.8x to 6x. Lastly, we provide some insights regarding the applicability of FPGAs in FPGA-unfriendly domains, and comment on the future as FPGA and HLS technology advances.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133458510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609948
Najdet Charaf, C. Tietz, Michael Raitza, Akash Kumar, D. Göhringer
In this work, we present a solution to a common problem encountered when using FPGAs in dynamic, ever-changing environments. Even when using dynamic function exchange to accommodate changing workloads, partial bitstreams are typically not relocatable. So the runtime environment needs to store all reconfigurable partition/reconfigurable module combinations as separate bitstreams. We present a modular and highly flexible tool (AMAH-Flex) that converts any static and reconfigurable system into a 2 dimensional dynamically relocatable system. It also features a fully automated floorplanning phase, closing the automation gap between synthesis and bitstream relocation. It integrates with the Xilinx Vivado toolchain and supports both FPGA architectures, the 7-Series and the UltraScale+. In addition, AMAH-Flex can be ported to any Xilinx FPGA family, starting with the 7-Series. We demonstrate the functionality of our tool in several reconfiguration scenarios on four different FPGA families and show that AMAH-Flex saves up to 80% of partial bitstreams.
{"title":"AMAH-Flex: A Modular and Highly Flexible Tool for Generating Relocatable Systems on FPGAs","authors":"Najdet Charaf, C. Tietz, Michael Raitza, Akash Kumar, D. Göhringer","doi":"10.1109/ICFPT52863.2021.9609948","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609948","url":null,"abstract":"In this work, we present a solution to a common problem encountered when using FPGAs in dynamic, ever-changing environments. Even when using dynamic function exchange to accommodate changing workloads, partial bitstreams are typically not relocatable. So the runtime environment needs to store all reconfigurable partition/reconfigurable module combinations as separate bitstreams. We present a modular and highly flexible tool (AMAH-Flex) that converts any static and reconfigurable system into a 2 dimensional dynamically relocatable system. It also features a fully automated floorplanning phase, closing the automation gap between synthesis and bitstream relocation. It integrates with the Xilinx Vivado toolchain and supports both FPGA architectures, the 7-Series and the UltraScale+. In addition, AMAH-Flex can be ported to any Xilinx FPGA family, starting with the 7-Series. We demonstrate the functionality of our tool in several reconfiguration scenarios on four different FPGA families and show that AMAH-Flex saves up to 80% of partial bitstreams.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133913788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609867
Philippos Papaphilippou, K. Sano, B. Adhi, W. Luk
This paper presents a novel FPGA-based switch design that achieves high algorithmic performance and an efficient FPGA implementation. Crossbar switches based on virtual output queues (VOQs) and variations have been rather popular for implementing switches on FPGAs, with applications to network-on-chip (NoC) routers and network switches. The efficiency of VOQs is well-documented on ASICs, though we show that their disadvantages can outweigh their advantages on FPGAs. Our proposed design uses an output-queued switch internally for simplifying scheduling, and a queue balancing technique to avoid queue fragmentation and reduce the need for memory-sharing VOQs. Our implementation approaches the scheduling performance of the state-of-the-art, while requiring considerably fewer FPGA resources.
{"title":"Efficient Queue-Balancing Switch for FPGAs","authors":"Philippos Papaphilippou, K. Sano, B. Adhi, W. Luk","doi":"10.1109/ICFPT52863.2021.9609867","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609867","url":null,"abstract":"This paper presents a novel FPGA-based switch design that achieves high algorithmic performance and an efficient FPGA implementation. Crossbar switches based on virtual output queues (VOQs) and variations have been rather popular for implementing switches on FPGAs, with applications to network-on-chip (NoC) routers and network switches. The efficiency of VOQs is well-documented on ASICs, though we show that their disadvantages can outweigh their advantages on FPGAs. Our proposed design uses an output-queued switch internally for simplifying scheduling, and a queue balancing technique to avoid queue fragmentation and reduce the need for memory-sharing VOQs. Our implementation approaches the scheduling performance of the state-of-the-art, while requiring considerably fewer FPGA resources.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114842910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609805
Kaichuang Shi, Hao Zhou, Lingli Wang
Field Programmable Gate Arrays (FPGAs) are widely used for their flexibility and short time to market. FPGA routing architecture design is the key problem due to the fact that it plays a dominant role in the area, delay and power. Most of modern FPGAs are island-style which provide abundant vertical and horizontal tracks to guarantee the circuit designs can be routed successfully. Most connections in placed netlists are diagonal which may lead to passing through extra turning switches, resulting in increased delay cost and high routing density. In this paper, we propose a hexagon-based honeycomb FPGA routing architecture to improve the routability and performance. In honeycomb architecture, there are three kinds of routing channels which can provide more freedom to decrease the turning switches on the routing paths. In addition, the router lookahead algorithm is enhanced to support the honeycomb architecture which is then evaluated by the enhanced VTR with provided benchmarks. The experimental results show that the honeycomb architecture can improve the minimum routing channel width by 7.7% compared with traditional rectangular architecture with length-1 wires. In addition, the honeycomb architecture can achieve 9.9% improvement on the routed wirelength, 11.5% on the critical path delay and 12.4% on the area-delay product.
{"title":"A Hexagon-Based Honeycomb Routing Architecture for FPGA","authors":"Kaichuang Shi, Hao Zhou, Lingli Wang","doi":"10.1109/ICFPT52863.2021.9609805","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609805","url":null,"abstract":"Field Programmable Gate Arrays (FPGAs) are widely used for their flexibility and short time to market. FPGA routing architecture design is the key problem due to the fact that it plays a dominant role in the area, delay and power. Most of modern FPGAs are island-style which provide abundant vertical and horizontal tracks to guarantee the circuit designs can be routed successfully. Most connections in placed netlists are diagonal which may lead to passing through extra turning switches, resulting in increased delay cost and high routing density. In this paper, we propose a hexagon-based honeycomb FPGA routing architecture to improve the routability and performance. In honeycomb architecture, there are three kinds of routing channels which can provide more freedom to decrease the turning switches on the routing paths. In addition, the router lookahead algorithm is enhanced to support the honeycomb architecture which is then evaluated by the enhanced VTR with provided benchmarks. The experimental results show that the honeycomb architecture can improve the minimum routing channel width by 7.7% compared with traditional rectangular architecture with length-1 wires. In addition, the honeycomb architecture can achieve 9.9% improvement on the routed wirelength, 11.5% on the critical path delay and 12.4% on the area-delay product.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127795738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Han-Sok Suh, Jian Meng, Ty Nguyen, S. Venkataramanaiah, Vijay Kumar, Yu Cao, Jae-sun Seo
Convolutional neural network (CNN) based object detection has achieved very high accuracy, e.g. single-shot multi-box detectors (SSD) can efficiently detect and localize various objects in an input image. However, they require a high amount of computation and memory storage, which makes it difficult to perform efficient inference on resource-constrained hardware devices such as drones or unmanned aerial vehicles (UAVs). Drone/UAV detection is an important task for applications including surveillance, defense, and multi-drone self-localization and formation control. In this paper, we designed and co-optimized algorithm and hardware for energy-efficient drone detection on resource-constrained FPGA devices. We trained SSD object detection algorithm with a custom drone dataset. For inference, we employed low-precision quantization and adapted the width of the SSD CNN model. To improve throughput, we use dual-data rate operations for DSPs to effectively double the throughput with limited DSP counts. For different SSD algorithm models, we analyze accuracy or mean average precision (mAP) and evaluate the corresponding FPGA hardware utilization, DRAM communication, throughput optimization. Our proposed design achieves a high mAP of 88.42% on the multi-drone dataset, with a high energy-efficiency of 79 GOPS/W and throughput of 158 GOPS using Xilinx Zynq ZU3EG FPGA device on the Open Vision Computer version 3 (OVC3) platform. Our design achieves 2.7X higher energy efficiency than prior works using the same FPGA device, at a low-power consumption of 1.98 W.
{"title":"Algorithm-Hardware Co-Optimization for Energy-Efficient Drone Detection on Resource-Constrained FPGA","authors":"Han-Sok Suh, Jian Meng, Ty Nguyen, S. Venkataramanaiah, Vijay Kumar, Yu Cao, Jae-sun Seo","doi":"10.1145/3583074","DOIUrl":"https://doi.org/10.1145/3583074","url":null,"abstract":"Convolutional neural network (CNN) based object detection has achieved very high accuracy, e.g. single-shot multi-box detectors (SSD) can efficiently detect and localize various objects in an input image. However, they require a high amount of computation and memory storage, which makes it difficult to perform efficient inference on resource-constrained hardware devices such as drones or unmanned aerial vehicles (UAVs). Drone/UAV detection is an important task for applications including surveillance, defense, and multi-drone self-localization and formation control. In this paper, we designed and co-optimized algorithm and hardware for energy-efficient drone detection on resource-constrained FPGA devices. We trained SSD object detection algorithm with a custom drone dataset. For inference, we employed low-precision quantization and adapted the width of the SSD CNN model. To improve throughput, we use dual-data rate operations for DSPs to effectively double the throughput with limited DSP counts. For different SSD algorithm models, we analyze accuracy or mean average precision (mAP) and evaluate the corresponding FPGA hardware utilization, DRAM communication, throughput optimization. Our proposed design achieves a high mAP of 88.42% on the multi-drone dataset, with a high energy-efficiency of 79 GOPS/W and throughput of 158 GOPS using Xilinx Zynq ZU3EG FPGA device on the Open Vision Computer version 3 (OVC3) platform. Our design achieves 2.7X higher energy efficiency than prior works using the same FPGA device, at a low-power consumption of 1.98 W.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125218993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609852
Hyuga Hashimoto, Ryoko Naka, Yasutaka Wada
This paper explains the ongoing development of an FPGA-based image recognition system for an autonomous driving robot car for the FPT'21 Design Competition. We are developing an FPGA-based image recognition on “ad-refkit” equipped with a Zybo-Z7 board to realize two main functionalities: 1) to provide recognition results in the actual environment of the contest through the Internet and 2) to be updated remotely based on the results. With these functionalities, we can realize high-performance and low-power systems using an FPGA.
{"title":"An FPGA-Based Image Recognition with Remote Update Functions for Autonomous Driving on “ad-refkit”","authors":"Hyuga Hashimoto, Ryoko Naka, Yasutaka Wada","doi":"10.1109/ICFPT52863.2021.9609852","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609852","url":null,"abstract":"This paper explains the ongoing development of an FPGA-based image recognition system for an autonomous driving robot car for the FPT'21 Design Competition. We are developing an FPGA-based image recognition on “ad-refkit” equipped with a Zybo-Z7 board to realize two main functionalities: 1) to provide recognition results in the actual environment of the contest through the Internet and 2) to be updated remotely based on the results. With these functionalities, we can realize high-performance and low-power systems using an FPGA.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114922114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609919
Jingbo Gao, Yu Qian, Yihan Hu, Xitian Fan, W. Luk, Wei Cao, Lingli Wang
Lightweight convolutional neural networks (CNNs) have become increasingly popular due to their lower computational complexity and fewer memory accesses with equivalent accuracy compared to previous CNN models. However, the newly proposed networks bring new challenges to efficient hardware design, such as, in EfficientNet, depthwise convolution, squeeze-and-excitation (SE) module, and swish/sigmoid functions. Although individual engine architecture could achieve a high computing efficiency for the standard convolution or the depth-wise convolution, it is still not efficient for EfficientNet because the workload imbalance between two types of convolutional engines causes inevitable idling. To overcome this problem, we present a lightweight reconfigurable computational kernel based on FPGA with an exchangeable-track datapath scheme. In addition, a low-accuracy-loss function replacement strategy is proposed for swish/sigmoid functions. Furthermore, the low-cost hardware architecture to implement the replaced functions is designed. The proposed accelerator (LETA) can implement EfficientNet on Xilinx XCVU37P with a 300 MHz system clock and a 600 MHz kernel clock. The linear growth of resource usage in the 4-kernel implementation in 1 super logic region (SLR) with the same clock frequencies justifies the scalability of LETA. The experimental results show that LETA can achieve 2× throughput/DSP compared to the latest FPGA-based accelerator with 1.6% (0.7%) top-1 (top-5) accuracy loss on EfficientNet-B3.
{"title":"LETA: A lightweight exchangeable-track accelerator for efficientnet based on FPGA","authors":"Jingbo Gao, Yu Qian, Yihan Hu, Xitian Fan, W. Luk, Wei Cao, Lingli Wang","doi":"10.1109/ICFPT52863.2021.9609919","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609919","url":null,"abstract":"Lightweight convolutional neural networks (CNNs) have become increasingly popular due to their lower computational complexity and fewer memory accesses with equivalent accuracy compared to previous CNN models. However, the newly proposed networks bring new challenges to efficient hardware design, such as, in EfficientNet, depthwise convolution, squeeze-and-excitation (SE) module, and swish/sigmoid functions. Although individual engine architecture could achieve a high computing efficiency for the standard convolution or the depth-wise convolution, it is still not efficient for EfficientNet because the workload imbalance between two types of convolutional engines causes inevitable idling. To overcome this problem, we present a lightweight reconfigurable computational kernel based on FPGA with an exchangeable-track datapath scheme. In addition, a low-accuracy-loss function replacement strategy is proposed for swish/sigmoid functions. Furthermore, the low-cost hardware architecture to implement the replaced functions is designed. The proposed accelerator (LETA) can implement EfficientNet on Xilinx XCVU37P with a 300 MHz system clock and a 600 MHz kernel clock. The linear growth of resource usage in the 4-kernel implementation in 1 super logic region (SLR) with the same clock frequencies justifies the scalability of LETA. The experimental results show that LETA can achieve 2× throughput/DSP compared to the latest FPGA-based accelerator with 1.6% (0.7%) top-1 (top-5) accuracy loss on EfficientNet-B3.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"603 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116327636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We are developing the unmanned mobile vehicle implemented on SoC FPGA for the FPGA design competition. For highly productive development of image-based self-driving mobile vehicles, a remote verification and debugging environment with real-time image transmission is important. This paper presents an image transmission system with which we can monitor onboard camera images of the vehicle and feature detection results over VNC Wi-Fi connection. We implemented the whole system on a Xilinx Zynq-7000 with a maximum operating frequency of 125 MHz. The evaluation of the system showed that the resolution of 640×720 is the most beneficial for VNC in this experiment in terms of the performance ratio of VNC to SSH X11 forwarding. We also shortly describe other components to be used to develop the autonomous driving system in this paper.
{"title":"SoC FPGA implementation of an unmanned mobile vehicle with an image transmission system over VNC","authors":"Keigo Motoyoshi, Yuta Imamura, Taichi Saikai, Koki Fujita, Daiki Furukawa, Masatomo Matsuda, Tatsuma Mori, Yasutoshi Araki, Takehiro Miura, Keizo Yamashita, Haruto Ikehara, Kaito Ohira, Katsuaki Kamimae, Takuho Kawazu, Masahiro Nishimura, Shintaro Matsui, Koki Tomonaga, Taito Manabe, Yuichiro Shibata","doi":"10.1109/ICFPT52863.2021.9609904","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609904","url":null,"abstract":"We are developing the unmanned mobile vehicle implemented on SoC FPGA for the FPGA design competition. For highly productive development of image-based self-driving mobile vehicles, a remote verification and debugging environment with real-time image transmission is important. This paper presents an image transmission system with which we can monitor onboard camera images of the vehicle and feature detection results over VNC Wi-Fi connection. We implemented the whole system on a Xilinx Zynq-7000 with a maximum operating frequency of 125 MHz. The evaluation of the system showed that the resolution of 640×720 is the most beneficial for VNC in this experiment in terms of the performance ratio of VNC to SSH X11 forwarding. We also shortly describe other components to be used to develop the autonomous driving system in this paper.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114448970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-06DOI: 10.1109/ICFPT52863.2021.9609910
Hirotoshi Ito, Minoru Watanabe
This paper presents the total-ionizing-dose tolerance evaluation of an optoelectronic field programmable gate array (FPGA) during operation. The optoelectronic FPGA was fabricated using 0.18 ${mu m}$ standard complementary metal oxide semiconductor (CMOS) process technology. An experiment assessing the total-ionizing-dose tolerance of the optoelectronic FPGA was conducted at a 2.27–2.28 kGy/h dose rate using a60 Co gamma radiation source. Results clarified that the optoelectronic FPGA can function correctly under a 2.27–2.28 kGy/h dose rate and that the total-ionizing-dose tolerance of the optoelectronic FPGA is greater than 80 Mrad during operation. The total-ionizing-dose tolerance result is 80 times higher than that of typical radiation-hardened very large scale integrated circuits (VLSIs) and typical radiation-hardened FPGAs.
{"title":"Total-ionizing-dose tolerance evaluation of an optoelectronic field programmable gate array VLSI during operation","authors":"Hirotoshi Ito, Minoru Watanabe","doi":"10.1109/ICFPT52863.2021.9609910","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609910","url":null,"abstract":"This paper presents the total-ionizing-dose tolerance evaluation of an optoelectronic field programmable gate array (FPGA) during operation. The optoelectronic FPGA was fabricated using 0.18 ${mu m}$ standard complementary metal oxide semiconductor (CMOS) process technology. An experiment assessing the total-ionizing-dose tolerance of the optoelectronic FPGA was conducted at a 2.27–2.28 kGy/h dose rate using a60 Co gamma radiation source. Results clarified that the optoelectronic FPGA can function correctly under a 2.27–2.28 kGy/h dose rate and that the total-ionizing-dose tolerance of the optoelectronic FPGA is greater than 80 Mrad during operation. The total-ionizing-dose tolerance result is 80 times higher than that of typical radiation-hardened very large scale integrated circuits (VLSIs) and typical radiation-hardened FPGAs.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134462933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}