FPGA上跨层CNN加速器的片上指令生成

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-15 DOI:10.1109/ISVLSI.2019.00011

Yiming Hu, Shuang Liang, Jincheng Yu, Yu Wang, Huazhong Yang

{"title":"FPGA上跨层CNN加速器的片上指令生成","authors":"Yiming Hu, Shuang Liang, Jincheng Yu, Yu Wang, Huazhong Yang","doi":"10.1109/ISVLSI.2019.00011","DOIUrl":null,"url":null,"abstract":"Convolutional neural networks (CNN) are gaining popularity in the field of computer vision. CNN-based methods are computational-intensive and resource-consuming, thus are hard to be integrated into embedded systems and applied to real-time task scenarios. Many FPGA based CNN accelerators have been proposed to get higher performance. Cross-layer CNN accelerator is designed to reduce the data transfer by fusing several layers. However, the instruction size that needs to be transferred is usually considerable, leading to a performance drop of cross-layer accelerators. In this study, we develop an on-chip instruction generation method based on the cross-layer accelerator to reduce the total instruction size transferred to the chip. We design the corresponding hardware module and modify existing object detection models according to the hardware structure to improve the accuracy of object detection tasks. The evaluation results show that in the same calculation process, our accelerator can achieve 35% data transfer reduction on the VGG16 network. The average instruction size and compilation time are reduced by 95% using our instruction generation method. The performance of the accelerator reaches 1414 GOP/s.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"37 1","pages":"7-12"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"On-Chip Instruction Generation for Cross-Layer CNN Accelerator on FPGA\",\"authors\":\"Yiming Hu, Shuang Liang, Jincheng Yu, Yu Wang, Huazhong Yang\",\"doi\":\"10.1109/ISVLSI.2019.00011\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Convolutional neural networks (CNN) are gaining popularity in the field of computer vision. CNN-based methods are computational-intensive and resource-consuming, thus are hard to be integrated into embedded systems and applied to real-time task scenarios. Many FPGA based CNN accelerators have been proposed to get higher performance. Cross-layer CNN accelerator is designed to reduce the data transfer by fusing several layers. However, the instruction size that needs to be transferred is usually considerable, leading to a performance drop of cross-layer accelerators. In this study, we develop an on-chip instruction generation method based on the cross-layer accelerator to reduce the total instruction size transferred to the chip. We design the corresponding hardware module and modify existing object detection models according to the hardware structure to improve the accuracy of object detection tasks. The evaluation results show that in the same calculation process, our accelerator can achieve 35% data transfer reduction on the VGG16 network. The average instruction size and compilation time are reduced by 95% using our instruction generation method. The performance of the accelerator reaches 1414 GOP/s.\",\"PeriodicalId\":6703,\"journal\":{\"name\":\"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)\",\"volume\":\"37 1\",\"pages\":\"7-12\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-07-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISVLSI.2019.00011\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISVLSI.2019.00011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

卷积神经网络(CNN)在计算机视觉领域越来越受欢迎。基于cnn的方法计算量大，资源消耗大，难以集成到嵌入式系统中，应用于实时任务场景。为了获得更高的性能，人们提出了许多基于FPGA的CNN加速器。跨层CNN加速器通过多层融合来减少数据传输。然而，需要传输的指令大小通常是相当大的，这导致跨层加速器的性能下降。在本研究中，我们开发了一种基于跨层加速器的片上指令生成方法，以减少传输到芯片的总指令大小。我们设计了相应的硬件模块，并根据硬件结构对现有的目标检测模型进行修改，以提高目标检测任务的准确性。评估结果表明，在相同的计算过程中，我们的加速器可以在VGG16网络上实现35%的数据传输减少。使用我们的指令生成方法，平均指令大小和编译时间减少了95%。加速器的性能达到1414 GOP/s。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

On-Chip Instruction Generation for Cross-Layer CNN Accelerator on FPGA

Convolutional neural networks (CNN) are gaining popularity in the field of computer vision. CNN-based methods are computational-intensive and resource-consuming, thus are hard to be integrated into embedded systems and applied to real-time task scenarios. Many FPGA based CNN accelerators have been proposed to get higher performance. Cross-layer CNN accelerator is designed to reduce the data transfer by fusing several layers. However, the instruction size that needs to be transferred is usually considerable, leading to a performance drop of cross-layer accelerators. In this study, we develop an on-chip instruction generation method based on the cross-layer accelerator to reduce the total instruction size transferred to the chip. We design the corresponding hardware module and modify existing object detection models according to the hardware structure to improve the accuracy of object detection tasks. The evaluation results show that in the same calculation process, our accelerator can achieve 35% data transfer reduction on the VGG16 network. The average instruction size and compilation time are reduced by 95% using our instruction generation method. The performance of the accelerator reaches 1414 GOP/s.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

自引率

0.00%

发文量

期刊最新文献

Ferroelectric FET Based TCAM Designs for Energy Efficient Computing Evaluation of Compilers Effects on OpenMP Soft Error Resiliency Towards Efficient Compact Network Training on Edge-Devices PageCmp: Bandwidth Efficient Page Deduplication through In-memory Page Comparison Improving Logic Optimization in Sequential Circuits using Majority-inverter Graphs