集合稀疏卷积神经网络的通用卷积核

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) Pub Date : 2019-10-01 DOI:10.1109/MCSoC.2019.00021

Ryosuke Kuramochi, Youki Sada, Masayuki Shimoda, Shimpei Sato, Hiroki Nakahara

{"title":"集合稀疏卷积神经网络的通用卷积核","authors":"Ryosuke Kuramochi, Youki Sada, Masayuki Shimoda, Shimpei Sato, Hiroki Nakahara","doi":"10.1109/MCSoC.2019.00021","DOIUrl":null,"url":null,"abstract":"A convolutional neural network~(CNN) is one of the most successfully used neural networks and it is widely used for many embedded computer vision tasks. However, it requires a massive number of multiplication and accumulation (MAC) computations with high-power consumption to realize it, and higher recognition accuracy is desired for modern tasks. In the paper, we apply a sparseness technique to generate a weak classifier to build an ensemble CNN. There is a trade-off between recognition accuracy and inference speed, and we control sparse (zero weight) ratio to make an excellent performance and better recognition accuracy. We use P sparse weight CNNs with a dataflow pipeline architecture that hides the performance overhead for multiple CNN evaluation on the ensemble CNN. We set an adequate sparse ratio to adjust the number of operation cycles in each stage. The proposed ensemble CNN depends on the dataset quality and it has different layer configurations. We propose a universal convolution core to realize variations of modern convolutional operations, and extend it to many cores with pipelining architecture to achieve high-throughput operation. Therefore, while computing efficiency is poor on GPUs which is unsuitable for a sparseness convolution, on our universal convolution cores can realize an architecture with excellent pipeline efficiency. We measure the trade-off between recognition accuracy and inference speed using existing benchmark datasets and CNN models. By setting the sparsity ratio and the number of predictors appropriately, high-speed architectures are realized on the many universal covers while the recognition accuracy is improved compared to the conventional single CNN realization. We implemented the prototype of many universal convolution cores on the Xilinx Kintex UltraScale+ FPGA, and compared with the desktop GPU realization of the ensembling, the proposed many core based accelerator for the ensemble sparse CNN is 3.09 times faster, 4.20 times lower power, and 13.33 times better as for the performance per power. Therefore, by realizing the proposed ensemble method with many of universal convolution cores, a high-speed inference could be achieved while improving the recognition accuracy compared with the conventional dense weight CNN on the desktop GPU.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Many Universal Convolution Cores for Ensemble Sparse Convolutional Neural Networks\",\"authors\":\"Ryosuke Kuramochi, Youki Sada, Masayuki Shimoda, Shimpei Sato, Hiroki Nakahara\",\"doi\":\"10.1109/MCSoC.2019.00021\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A convolutional neural network~(CNN) is one of the most successfully used neural networks and it is widely used for many embedded computer vision tasks. However, it requires a massive number of multiplication and accumulation (MAC) computations with high-power consumption to realize it, and higher recognition accuracy is desired for modern tasks. In the paper, we apply a sparseness technique to generate a weak classifier to build an ensemble CNN. There is a trade-off between recognition accuracy and inference speed, and we control sparse (zero weight) ratio to make an excellent performance and better recognition accuracy. We use P sparse weight CNNs with a dataflow pipeline architecture that hides the performance overhead for multiple CNN evaluation on the ensemble CNN. We set an adequate sparse ratio to adjust the number of operation cycles in each stage. The proposed ensemble CNN depends on the dataset quality and it has different layer configurations. We propose a universal convolution core to realize variations of modern convolutional operations, and extend it to many cores with pipelining architecture to achieve high-throughput operation. Therefore, while computing efficiency is poor on GPUs which is unsuitable for a sparseness convolution, on our universal convolution cores can realize an architecture with excellent pipeline efficiency. We measure the trade-off between recognition accuracy and inference speed using existing benchmark datasets and CNN models. By setting the sparsity ratio and the number of predictors appropriately, high-speed architectures are realized on the many universal covers while the recognition accuracy is improved compared to the conventional single CNN realization. We implemented the prototype of many universal convolution cores on the Xilinx Kintex UltraScale+ FPGA, and compared with the desktop GPU realization of the ensembling, the proposed many core based accelerator for the ensemble sparse CNN is 3.09 times faster, 4.20 times lower power, and 13.33 times better as for the performance per power. Therefore, by realizing the proposed ensemble method with many of universal convolution cores, a high-speed inference could be achieved while improving the recognition accuracy compared with the conventional dense weight CNN on the desktop GPU.\",\"PeriodicalId\":104240,\"journal\":{\"name\":\"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MCSoC.2019.00021\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MCSoC.2019.00021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

卷积神经网络(CNN)是应用最成功的神经网络之一，它被广泛应用于许多嵌入式计算机视觉任务。然而，它需要大量的乘法累加(MAC)计算和高功耗来实现，并且现代任务需要更高的识别精度。在本文中，我们应用稀疏技术生成一个弱分类器来构建一个集成CNN。在识别精度和推理速度之间进行权衡，我们通过控制稀疏(零权)比来获得优异的性能和更好的识别精度。我们使用P稀疏权CNN和数据流管道架构，该架构隐藏了对集成CNN进行多个CNN评估的性能开销。我们设置了适当的稀疏比率来调整每个阶段的操作周期数。所提出的集成CNN取决于数据集质量，具有不同的层配置。我们提出了一种通用的卷积核来实现现代卷积运算的变化，并将其扩展到多核，采用流水线架构实现高吞吐量运算。因此，虽然gpu上的计算效率较差，不适合稀疏卷积，但在我们的通用卷积核上可以实现具有优秀流水线效率的架构。我们使用现有的基准数据集和CNN模型来衡量识别精度和推理速度之间的权衡。通过适当设置稀疏比和预测器的数量，在多个通用覆盖上实现了高速架构，与传统的单一CNN实现相比，识别精度得到了提高。我们在Xilinx Kintex UltraScale+ FPGA上实现了多通用卷积核的原型，与集成的桌面GPU实现相比，所提出的多核集成稀疏CNN加速器的速度提高了3.09倍，功耗降低了4.20倍，单位功率性能提高了13.33倍。因此，通过使用多个通用卷积核实现所提出的集成方法，与传统的桌面GPU上的密集权CNN相比，可以在提高识别精度的同时实现高速推理。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Many Universal Convolution Cores for Ensemble Sparse Convolutional Neural Networks

A convolutional neural network~(CNN) is one of the most successfully used neural networks and it is widely used for many embedded computer vision tasks. However, it requires a massive number of multiplication and accumulation (MAC) computations with high-power consumption to realize it, and higher recognition accuracy is desired for modern tasks. In the paper, we apply a sparseness technique to generate a weak classifier to build an ensemble CNN. There is a trade-off between recognition accuracy and inference speed, and we control sparse (zero weight) ratio to make an excellent performance and better recognition accuracy. We use P sparse weight CNNs with a dataflow pipeline architecture that hides the performance overhead for multiple CNN evaluation on the ensemble CNN. We set an adequate sparse ratio to adjust the number of operation cycles in each stage. The proposed ensemble CNN depends on the dataset quality and it has different layer configurations. We propose a universal convolution core to realize variations of modern convolutional operations, and extend it to many cores with pipelining architecture to achieve high-throughput operation. Therefore, while computing efficiency is poor on GPUs which is unsuitable for a sparseness convolution, on our universal convolution cores can realize an architecture with excellent pipeline efficiency. We measure the trade-off between recognition accuracy and inference speed using existing benchmark datasets and CNN models. By setting the sparsity ratio and the number of predictors appropriately, high-speed architectures are realized on the many universal covers while the recognition accuracy is improved compared to the conventional single CNN realization. We implemented the prototype of many universal convolution cores on the Xilinx Kintex UltraScale+ FPGA, and compared with the desktop GPU realization of the ensembling, the proposed many core based accelerator for the ensemble sparse CNN is 3.09 times faster, 4.20 times lower power, and 13.33 times better as for the performance per power. Therefore, by realizing the proposed ensemble method with many of universal convolution cores, a high-speed inference could be achieved while improving the recognition accuracy compared with the conventional dense weight CNN on the desktop GPU.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

自引率

0.00%

发文量

期刊最新文献

Algorithm to Determine Extended Edit Distance between Program Codes Smart Ontology-Based Event Identification Automatic Generation of Fill-in-the-Blank Programming Problems Prototype of FPGA Dynamic Reconfiguration Based-on Context-Oriented Programming An Efficient Implementation of a TAGE Branch Predictor for Soft Processors on FPGA