Storage-Efficient Batching for Minimizing Bandwidth of Fully-Connected Neural Network Layers (Abstract Only)

Yongming Shen, M. Ferdman, Peter Milder
{"title":"Storage-Efficient Batching for Minimizing Bandwidth of Fully-Connected Neural Network Layers (Abstract Only)","authors":"Yongming Shen, M. Ferdman, Peter Milder","doi":"10.1145/3020078.3021795","DOIUrl":null,"url":null,"abstract":"Convolutional neural networks (CNNs) are used to solve many challenging machine learning problems. These networks typically use convolutional layers for feature extraction and fully-connected layers to perform classification using those features. Significant interest in improving the performance of CNNs has led to the design of CNN accelerators to improve their evaluation throughput and efficiency. However, work on CNN accelerators has mostly concentrated on accelerating the computationally-intensive convolutional layers, while a major bottleneck of the existing designs arises due to the data-intensive fully-connected layers. Unfortunately, the leading approaches to reducing bandwidth of the fully-connected layers are limited by the storage capacity of the on-chip buffers. We observe that, in addition to the possibility of reducing CNN weight transfer bandwidth by adding more on-chip buffers, it is also possible to reduce the size of the on-chip buffers at the cost of CNN input transfer. Paradoxically, shrinking the size of the on-chip buffers costs significantly less input bandwidth than the weight bandwidth saved by adding more buffers. Leveraging these observations, we develop a design methodology for fully-connected layer accelerators that require substantially less off-chip bandwidth by balancing between the input and weight transfers. Using 160KB of BRAM enables the prior work to reduce off-chip bandwidth by 5x on the most bandwidth-intensive fully-connected layers of the popular AlexNet and VGGNet-E networks. With our newly proposed methodology, using the same 160KB of BRAM produces a design with 71x bandwidth reduction on the same networks.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3020078.3021795","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Convolutional neural networks (CNNs) are used to solve many challenging machine learning problems. These networks typically use convolutional layers for feature extraction and fully-connected layers to perform classification using those features. Significant interest in improving the performance of CNNs has led to the design of CNN accelerators to improve their evaluation throughput and efficiency. However, work on CNN accelerators has mostly concentrated on accelerating the computationally-intensive convolutional layers, while a major bottleneck of the existing designs arises due to the data-intensive fully-connected layers. Unfortunately, the leading approaches to reducing bandwidth of the fully-connected layers are limited by the storage capacity of the on-chip buffers. We observe that, in addition to the possibility of reducing CNN weight transfer bandwidth by adding more on-chip buffers, it is also possible to reduce the size of the on-chip buffers at the cost of CNN input transfer. Paradoxically, shrinking the size of the on-chip buffers costs significantly less input bandwidth than the weight bandwidth saved by adding more buffers. Leveraging these observations, we develop a design methodology for fully-connected layer accelerators that require substantially less off-chip bandwidth by balancing between the input and weight transfers. Using 160KB of BRAM enables the prior work to reduce off-chip bandwidth by 5x on the most bandwidth-intensive fully-connected layers of the popular AlexNet and VGGNet-E networks. With our newly proposed methodology, using the same 160KB of BRAM produces a design with 71x bandwidth reduction on the same networks.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
最小化全连接神经网络层带宽的高效存储批处理(仅摘要)
卷积神经网络(cnn)被用于解决许多具有挑战性的机器学习问题。这些网络通常使用卷积层进行特征提取,并使用全连接层使用这些特征执行分类。对提高CNN性能的极大兴趣导致了CNN加速器的设计,以提高其评估吞吐量和效率。然而,CNN加速器的工作主要集中在加速计算密集型的卷积层,而现有设计的一个主要瓶颈是由于数据密集型的全连接层。不幸的是,减少全连接层带宽的主要方法受到片上缓冲区存储容量的限制。我们观察到,除了可以通过增加更多片上缓冲器来减少CNN权值传输带宽之外,还可以以牺牲CNN输入传输为代价来减小片上缓冲器的大小。矛盾的是,缩小片上缓冲器的大小所花费的输入带宽比增加更多缓冲器所节省的权重带宽要少得多。利用这些观察结果,我们开发了一种全连接层加速器的设计方法,通过平衡输入和权重传递,可以大大减少片外带宽。在流行的AlexNet和VGGNet-E网络中,在带宽最密集的全连接层上,使用160KB的BRAM可以将先前的工作减少5倍的片外带宽。使用我们新提出的方法,使用相同的160KB BRAM可以在相同的网络上产生减少71倍带宽的设计。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Session details: CAD Tools CPU-FPGA Co-Optimization for Big Data Applications: A Case Study of In-Memory Samtool Sorting (Abstract Only) Session details: Graph Processing Applications ASAP: Accelerated Short Read Alignment on Programmable Hardware (Abstract Only) Learning Convolutional Neural Networks for Data-Flow Graph Mapping on Spatial Programmable Architectures (Abstract Only)
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1