集合稀疏卷积神经网络的通用卷积核

Ryosuke Kuramochi, Youki Sada, Masayuki Shimoda, Shimpei Sato, Hiroki Nakahara
{"title":"集合稀疏卷积神经网络的通用卷积核","authors":"Ryosuke Kuramochi, Youki Sada, Masayuki Shimoda, Shimpei Sato, Hiroki Nakahara","doi":"10.1109/MCSoC.2019.00021","DOIUrl":null,"url":null,"abstract":"A convolutional neural network~(CNN) is one of the most successfully used neural networks and it is widely used for many embedded computer vision tasks. However, it requires a massive number of multiplication and accumulation (MAC) computations with high-power consumption to realize it, and higher recognition accuracy is desired for modern tasks. In the paper, we apply a sparseness technique to generate a weak classifier to build an ensemble CNN. There is a trade-off between recognition accuracy and inference speed, and we control sparse (zero weight) ratio to make an excellent performance and better recognition accuracy. We use P sparse weight CNNs with a dataflow pipeline architecture that hides the performance overhead for multiple CNN evaluation on the ensemble CNN. We set an adequate sparse ratio to adjust the number of operation cycles in each stage. The proposed ensemble CNN depends on the dataset quality and it has different layer configurations. We propose a universal convolution core to realize variations of modern convolutional operations, and extend it to many cores with pipelining architecture to achieve high-throughput operation. Therefore, while computing efficiency is poor on GPUs which is unsuitable for a sparseness convolution, on our universal convolution cores can realize an architecture with excellent pipeline efficiency. We measure the trade-off between recognition accuracy and inference speed using existing benchmark datasets and CNN models. By setting the sparsity ratio and the number of predictors appropriately, high-speed architectures are realized on the many universal covers while the recognition accuracy is improved compared to the conventional single CNN realization. We implemented the prototype of many universal convolution cores on the Xilinx Kintex UltraScale+ FPGA, and compared with the desktop GPU realization of the ensembling, the proposed many core based accelerator for the ensemble sparse CNN is 3.09 times faster, 4.20 times lower power, and 13.33 times better as for the performance per power. Therefore, by realizing the proposed ensemble method with many of universal convolution cores, a high-speed inference could be achieved while improving the recognition accuracy compared with the conventional dense weight CNN on the desktop GPU.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Many Universal Convolution Cores for Ensemble Sparse Convolutional Neural Networks\",\"authors\":\"Ryosuke Kuramochi, Youki Sada, Masayuki Shimoda, Shimpei Sato, Hiroki Nakahara\",\"doi\":\"10.1109/MCSoC.2019.00021\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A convolutional neural network~(CNN) is one of the most successfully used neural networks and it is widely used for many embedded computer vision tasks. However, it requires a massive number of multiplication and accumulation (MAC) computations with high-power consumption to realize it, and higher recognition accuracy is desired for modern tasks. In the paper, we apply a sparseness technique to generate a weak classifier to build an ensemble CNN. There is a trade-off between recognition accuracy and inference speed, and we control sparse (zero weight) ratio to make an excellent performance and better recognition accuracy. We use P sparse weight CNNs with a dataflow pipeline architecture that hides the performance overhead for multiple CNN evaluation on the ensemble CNN. We set an adequate sparse ratio to adjust the number of operation cycles in each stage. The proposed ensemble CNN depends on the dataset quality and it has different layer configurations. We propose a universal convolution core to realize variations of modern convolutional operations, and extend it to many cores with pipelining architecture to achieve high-throughput operation. Therefore, while computing efficiency is poor on GPUs which is unsuitable for a sparseness convolution, on our universal convolution cores can realize an architecture with excellent pipeline efficiency. We measure the trade-off between recognition accuracy and inference speed using existing benchmark datasets and CNN models. By setting the sparsity ratio and the number of predictors appropriately, high-speed architectures are realized on the many universal covers while the recognition accuracy is improved compared to the conventional single CNN realization. We implemented the prototype of many universal convolution cores on the Xilinx Kintex UltraScale+ FPGA, and compared with the desktop GPU realization of the ensembling, the proposed many core based accelerator for the ensemble sparse CNN is 3.09 times faster, 4.20 times lower power, and 13.33 times better as for the performance per power. Therefore, by realizing the proposed ensemble method with many of universal convolution cores, a high-speed inference could be achieved while improving the recognition accuracy compared with the conventional dense weight CNN on the desktop GPU.\",\"PeriodicalId\":104240,\"journal\":{\"name\":\"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MCSoC.2019.00021\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MCSoC.2019.00021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

卷积神经网络(CNN)是应用最成功的神经网络之一,它被广泛应用于许多嵌入式计算机视觉任务。然而,它需要大量的乘法累加(MAC)计算和高功耗来实现,并且现代任务需要更高的识别精度。在本文中,我们应用稀疏技术生成一个弱分类器来构建一个集成CNN。在识别精度和推理速度之间进行权衡,我们通过控制稀疏(零权)比来获得优异的性能和更好的识别精度。我们使用P稀疏权CNN和数据流管道架构,该架构隐藏了对集成CNN进行多个CNN评估的性能开销。我们设置了适当的稀疏比率来调整每个阶段的操作周期数。所提出的集成CNN取决于数据集质量,具有不同的层配置。我们提出了一种通用的卷积核来实现现代卷积运算的变化,并将其扩展到多核,采用流水线架构实现高吞吐量运算。因此,虽然gpu上的计算效率较差,不适合稀疏卷积,但在我们的通用卷积核上可以实现具有优秀流水线效率的架构。我们使用现有的基准数据集和CNN模型来衡量识别精度和推理速度之间的权衡。通过适当设置稀疏比和预测器的数量,在多个通用覆盖上实现了高速架构,与传统的单一CNN实现相比,识别精度得到了提高。我们在Xilinx Kintex UltraScale+ FPGA上实现了多通用卷积核的原型,与集成的桌面GPU实现相比,所提出的多核集成稀疏CNN加速器的速度提高了3.09倍,功耗降低了4.20倍,单位功率性能提高了13.33倍。因此,通过使用多个通用卷积核实现所提出的集成方法,与传统的桌面GPU上的密集权CNN相比,可以在提高识别精度的同时实现高速推理。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Many Universal Convolution Cores for Ensemble Sparse Convolutional Neural Networks
A convolutional neural network~(CNN) is one of the most successfully used neural networks and it is widely used for many embedded computer vision tasks. However, it requires a massive number of multiplication and accumulation (MAC) computations with high-power consumption to realize it, and higher recognition accuracy is desired for modern tasks. In the paper, we apply a sparseness technique to generate a weak classifier to build an ensemble CNN. There is a trade-off between recognition accuracy and inference speed, and we control sparse (zero weight) ratio to make an excellent performance and better recognition accuracy. We use P sparse weight CNNs with a dataflow pipeline architecture that hides the performance overhead for multiple CNN evaluation on the ensemble CNN. We set an adequate sparse ratio to adjust the number of operation cycles in each stage. The proposed ensemble CNN depends on the dataset quality and it has different layer configurations. We propose a universal convolution core to realize variations of modern convolutional operations, and extend it to many cores with pipelining architecture to achieve high-throughput operation. Therefore, while computing efficiency is poor on GPUs which is unsuitable for a sparseness convolution, on our universal convolution cores can realize an architecture with excellent pipeline efficiency. We measure the trade-off between recognition accuracy and inference speed using existing benchmark datasets and CNN models. By setting the sparsity ratio and the number of predictors appropriately, high-speed architectures are realized on the many universal covers while the recognition accuracy is improved compared to the conventional single CNN realization. We implemented the prototype of many universal convolution cores on the Xilinx Kintex UltraScale+ FPGA, and compared with the desktop GPU realization of the ensembling, the proposed many core based accelerator for the ensemble sparse CNN is 3.09 times faster, 4.20 times lower power, and 13.33 times better as for the performance per power. Therefore, by realizing the proposed ensemble method with many of universal convolution cores, a high-speed inference could be achieved while improving the recognition accuracy compared with the conventional dense weight CNN on the desktop GPU.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Algorithm to Determine Extended Edit Distance between Program Codes Smart Ontology-Based Event Identification Automatic Generation of Fill-in-the-Blank Programming Problems Prototype of FPGA Dynamic Reconfiguration Based-on Context-Oriented Programming An Efficient Implementation of a TAGE Branch Predictor for Soft Processors on FPGA
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1