Efficient Stride 2 Winograd Convolution Method Using Unified Transformation Matrices on FPGA

2021 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2021-12-06 DOI:10.1109/ICFPT52863.2021.9609907

Chengcheng Huang, Xiaoxiao Dong, Zhao Li, Tengteng Song, Zhenguo Liu, Lele Dong

{"title":"Efficient Stride 2 Winograd Convolution Method Using Unified Transformation Matrices on FPGA","authors":"Chengcheng Huang, Xiaoxiao Dong, Zhao Li, Tengteng Song, Zhenguo Liu, Lele Dong","doi":"10.1109/ICFPT52863.2021.9609907","DOIUrl":null,"url":null,"abstract":"Winograd algorithm can effectively reduce the computational complexity of convolution operation. Effectively using the parallelism of Winograd convolution algorithm can effectively improve the performance of accelerator architectures on FPGA. The stride represents the number of elements that the window slides when filter is scanned on the input feature map. The Winograd algorithm with the stride of 2 implemented in previous studies divided the input feature maps into multiple groups of Winograd algorithms to complete the operations, resulting in additional precomputation and hardware resource overhead. In this paper, we propose a new Winograd convolution algorithm with the stride of 2. This method uses the unified Winograd transformation matrices instead of the grouping method to complete the calculation. Therefore, the method proposed in this paper can realize 2D Winograd convolution and 3D Winograd convolution by nested 1D Winograd convolution, just like the Winograd convolution algorithm with the stride of 1. In this paper, Winograd transformation matrices with kernel size of 3, 5, and 7 are provided. In particular, for convolution with the kernel of 3, this method reduces the addition operations of Winograd algorithm by 30.0%-31.5% and removes unnecessary shift operations completely. In addition, we implement Winograd convolution algorithm with the stride of 2 through template design, and realize pipeline and data reuse. Compared to the state-of-the-art implementation, the proposed method results in a speedup of 1.24 and reduces resource usage.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"262 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Field-Programmable Technology (ICFPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICFPT52863.2021.9609907","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Winograd algorithm can effectively reduce the computational complexity of convolution operation. Effectively using the parallelism of Winograd convolution algorithm can effectively improve the performance of accelerator architectures on FPGA. The stride represents the number of elements that the window slides when filter is scanned on the input feature map. The Winograd algorithm with the stride of 2 implemented in previous studies divided the input feature maps into multiple groups of Winograd algorithms to complete the operations, resulting in additional precomputation and hardware resource overhead. In this paper, we propose a new Winograd convolution algorithm with the stride of 2. This method uses the unified Winograd transformation matrices instead of the grouping method to complete the calculation. Therefore, the method proposed in this paper can realize 2D Winograd convolution and 3D Winograd convolution by nested 1D Winograd convolution, just like the Winograd convolution algorithm with the stride of 1. In this paper, Winograd transformation matrices with kernel size of 3, 5, and 7 are provided. In particular, for convolution with the kernel of 3, this method reduces the addition operations of Winograd algorithm by 30.0%-31.5% and removes unnecessary shift operations completely. In addition, we implement Winograd convolution algorithm with the stride of 2 through template design, and realize pipeline and data reuse. Compared to the state-of-the-art implementation, the proposed method results in a speedup of 1.24 and reduces resource usage.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于FPGA的统一变换矩阵的高效跨步Winograd卷积方法

Winograd算法可以有效地降低卷积运算的计算复杂度。有效地利用Winograd卷积算法的并行性可以有效地提高FPGA上加速器架构的性能。步幅表示在输入特征映射上扫描过滤器时窗口滑动的元素数量。以往研究中实现的跨步为2的Winograd算法将输入的特征映射分成多组Winograd算法来完成操作，导致额外的预计算和硬件资源开销。本文提出了一种新的Winograd卷积算法，其步幅为2。该方法采用统一的Winograd变换矩阵代替分组方法来完成计算。因此，本文提出的方法可以像步长为1的Winograd卷积算法一样，通过嵌套的1D Winograd卷积实现2D Winograd卷积和3D Winograd卷积。本文给出了核大小为3、5、7的Winograd变换矩阵。特别是对于核数为3的卷积，该方法将Winograd算法的加法运算减少了30.0%-31.5%，并且完全消除了不必要的移位运算。此外，通过模板设计实现了步长为2的Winograd卷积算法，实现了流水线和数据复用。与最先进的实现相比，所提出的方法的加速提高了1.24，并减少了资源使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2021 International Conference on Field-Programmable Technology (ICFPT)

自引率

0.00%

发文量