Ning Mao, Zhihong Huang, Xing Wei, He Zhao, Xinkai Di, Le Yu, Haigang Yang
{"title":"A Self-adaptation Method of Fitting Convolutional Neural Network into FPGA: Abstract Only)","authors":"Ning Mao, Zhihong Huang, Xing Wei, He Zhao, Xinkai Di, Le Yu, Haigang Yang","doi":"10.1145/3174243.3175003","DOIUrl":null,"url":null,"abstract":"In recent years, Convolutional Neural Networks (CNNs) have been used widely in many artificial intelligence (AI) related fields. Of many implementation platforms for CNNs, FPGA is regarded as an optimal platform because of its high power-efficiency and flexibility. Although various FPGA accelerators have been proposed to realize CNN, some of them are implemented by High-Level Synthesis such as in OpenCL. This may result in inefficiency in operation performance and resource utilization. Therefore, we propose to parameterize the RTL design at both algorithm and hardware implementation levels. Four types of parallelism are considered to model the parameterized design in terms of the input feature map, the output feature map, the layer and the convolution kernel. Meanwhile a library covering convolution layer, fully-connected layer, pooling layer, control module is established to cater for various CNN models. Further, an algorithm is proposed to find an optimal level of parallelism dedicated to limited resources. As a case study, four typical CNNs are implemented on Stratix III EP3SL110, taking up on-chip memory. Compared with some existing works using the automated design flow, the implementations obtained by the proposed approach have achieved up to 17.13× GOPS. To the best estimate, our design has also achieved 1.33× resource efficiency and 3.61× power efficiency.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3174243.3175003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, Convolutional Neural Networks (CNNs) have been used widely in many artificial intelligence (AI) related fields. Of many implementation platforms for CNNs, FPGA is regarded as an optimal platform because of its high power-efficiency and flexibility. Although various FPGA accelerators have been proposed to realize CNN, some of them are implemented by High-Level Synthesis such as in OpenCL. This may result in inefficiency in operation performance and resource utilization. Therefore, we propose to parameterize the RTL design at both algorithm and hardware implementation levels. Four types of parallelism are considered to model the parameterized design in terms of the input feature map, the output feature map, the layer and the convolution kernel. Meanwhile a library covering convolution layer, fully-connected layer, pooling layer, control module is established to cater for various CNN models. Further, an algorithm is proposed to find an optimal level of parallelism dedicated to limited resources. As a case study, four typical CNNs are implemented on Stratix III EP3SL110, taking up on-chip memory. Compared with some existing works using the automated design flow, the implementations obtained by the proposed approach have achieved up to 17.13× GOPS. To the best estimate, our design has also achieved 1.33× resource efficiency and 3.61× power efficiency.
近年来,卷积神经网络(cnn)在许多人工智能(AI)相关领域得到了广泛的应用。在众多的cnn实现平台中,FPGA因其高能效和灵活性被认为是最优的平台。虽然已经提出了各种FPGA加速器来实现CNN,但其中一些是通过高级合成(High-Level Synthesis)实现的,例如在OpenCL中。这可能导致操作性能和资源利用率低下。因此,我们建议在算法和硬件实现级别对RTL设计进行参数化。从输入特征映射、输出特征映射、层和卷积核四个方面考虑了四种并行性对参数化设计进行建模。同时建立了包含卷积层、全连接层、池化层、控制模块的库,以适应各种CNN模型。在此基础上,提出了一种针对有限资源的最优并行度算法。作为案例研究,在Stratix III EP3SL110上实现了四个典型的cnn,占用片上内存。与已有的自动化设计流程相比,该方法实现的GOPS可达17.13× GOPS。最乐观的估计,我们的设计也达到了1.33倍的资源效率和3.61倍的功率效率。