Storage-Efficient Batching for Minimizing Bandwidth of Fully-Connected Neural Network Layers (Abstract Only)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI:10.1145/3020078.3021795

Yongming Shen, M. Ferdman, Peter Milder

{"title":"Storage-Efficient Batching for Minimizing Bandwidth of Fully-Connected Neural Network Layers (Abstract Only)","authors":"Yongming Shen, M. Ferdman, Peter Milder","doi":"10.1145/3020078.3021795","DOIUrl":null,"url":null,"abstract":"Convolutional neural networks (CNNs) are used to solve many challenging machine learning problems. These networks typically use convolutional layers for feature extraction and fully-connected layers to perform classification using those features. Significant interest in improving the performance of CNNs has led to the design of CNN accelerators to improve their evaluation throughput and efficiency. However, work on CNN accelerators has mostly concentrated on accelerating the computationally-intensive convolutional layers, while a major bottleneck of the existing designs arises due to the data-intensive fully-connected layers. Unfortunately, the leading approaches to reducing bandwidth of the fully-connected layers are limited by the storage capacity of the on-chip buffers. We observe that, in addition to the possibility of reducing CNN weight transfer bandwidth by adding more on-chip buffers, it is also possible to reduce the size of the on-chip buffers at the cost of CNN input transfer. Paradoxically, shrinking the size of the on-chip buffers costs significantly less input bandwidth than the weight bandwidth saved by adding more buffers. Leveraging these observations, we develop a design methodology for fully-connected layer accelerators that require substantially less off-chip bandwidth by balancing between the input and weight transfers. Using 160KB of BRAM enables the prior work to reduce off-chip bandwidth by 5x on the most bandwidth-intensive fully-connected layers of the popular AlexNet and VGGNet-E networks. With our newly proposed methodology, using the same 160KB of BRAM produces a design with 71x bandwidth reduction on the same networks.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3020078.3021795","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Convolutional neural networks (CNNs) are used to solve many challenging machine learning problems. These networks typically use convolutional layers for feature extraction and fully-connected layers to perform classification using those features. Significant interest in improving the performance of CNNs has led to the design of CNN accelerators to improve their evaluation throughput and efficiency. However, work on CNN accelerators has mostly concentrated on accelerating the computationally-intensive convolutional layers, while a major bottleneck of the existing designs arises due to the data-intensive fully-connected layers. Unfortunately, the leading approaches to reducing bandwidth of the fully-connected layers are limited by the storage capacity of the on-chip buffers. We observe that, in addition to the possibility of reducing CNN weight transfer bandwidth by adding more on-chip buffers, it is also possible to reduce the size of the on-chip buffers at the cost of CNN input transfer. Paradoxically, shrinking the size of the on-chip buffers costs significantly less input bandwidth than the weight bandwidth saved by adding more buffers. Leveraging these observations, we develop a design methodology for fully-connected layer accelerators that require substantially less off-chip bandwidth by balancing between the input and weight transfers. Using 160KB of BRAM enables the prior work to reduce off-chip bandwidth by 5x on the most bandwidth-intensive fully-connected layers of the popular AlexNet and VGGNet-E networks. With our newly proposed methodology, using the same 160KB of BRAM produces a design with 71x bandwidth reduction on the same networks.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

最小化全连接神经网络层带宽的高效存储批处理(仅摘要)

卷积神经网络(cnn)被用于解决许多具有挑战性的机器学习问题。这些网络通常使用卷积层进行特征提取，并使用全连接层使用这些特征执行分类。对提高CNN性能的极大兴趣导致了CNN加速器的设计，以提高其评估吞吐量和效率。然而，CNN加速器的工作主要集中在加速计算密集型的卷积层，而现有设计的一个主要瓶颈是由于数据密集型的全连接层。不幸的是，减少全连接层带宽的主要方法受到片上缓冲区存储容量的限制。我们观察到，除了可以通过增加更多片上缓冲器来减少CNN权值传输带宽之外，还可以以牺牲CNN输入传输为代价来减小片上缓冲器的大小。矛盾的是，缩小片上缓冲器的大小所花费的输入带宽比增加更多缓冲器所节省的权重带宽要少得多。利用这些观察结果，我们开发了一种全连接层加速器的设计方法，通过平衡输入和权重传递，可以大大减少片外带宽。在流行的AlexNet和VGGNet-E网络中，在带宽最密集的全连接层上，使用160KB的BRAM可以将先前的工作减少5倍的片外带宽。使用我们新提出的方法，使用相同的160KB BRAM可以在相同的网络上产生减少71倍带宽的设计。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量