Cnvlutin:无效-无神经元的深度神经网络计算

Jorge Albericio, Patrick Judd, Tayler H. Hetherington, Tor M. Aamodt, Natalie D. Enright Jerger, Andreas Moshovos
{"title":"Cnvlutin:无效-无神经元的深度神经网络计算","authors":"Jorge Albericio, Patrick Judd, Tayler H. Hetherington, Tor M. Aamodt, Natalie D. Enright Jerger, Andreas Moshovos","doi":"10.1145/3007787.3001138","DOIUrl":null,"url":null,"abstract":"This work observes that a large fraction of the computations performed by Deep Neural Networks (DNNs) are intrinsically ineffectual as they involve a multiplication where one of the inputs is zero. This observation motivates Cnvolutin (CNV), a value-based approach to hardware acceleration that eliminates most of these ineffectual operations, improving performance and energy over a state-of-the-art accelerator with no accuracy loss. CNV uses hierarchical data-parallel units, allowing groups of lanes to proceed mostly independently enabling them to skip over the ineffectual computations. A co-designed data storage format encodes the computation elimination decisions taking them off the critical path while avoiding control divergence in the data parallel units. Combined, the units and the data storage format result in a data-parallel architecture that maintains wide, aligned accesses to its memory hierarchy and that keeps its data lanes busy. By loosening the ineffectual computation identification criterion, CNV enables further performance and energy efficiency improvements, and more so if a loss in accuracy is acceptable. Experimental measurements over a set of state-of-the-art DNNs for image classification show that CNV improves performance over a state-of-the-art accelerator from 1.24× to 1.55× and by 1.37× on average without any loss in accuracy by removing zero-valued operand multiplications alone. While CNV incurs an area overhead of 4.49%, it improves overall EDP (Energy Delay Product) and ED2P (Energy Delay Squared Product) on average by 1.47× and 2.01×, respectively. The average performance improvements increase to 1.52× without any loss in accuracy with a broader ineffectual identification policy. Further improvements are demonstrated with a loss in accuracy.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"50 1","pages":"1-13"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"629","resultStr":"{\"title\":\"Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing\",\"authors\":\"Jorge Albericio, Patrick Judd, Tayler H. Hetherington, Tor M. Aamodt, Natalie D. Enright Jerger, Andreas Moshovos\",\"doi\":\"10.1145/3007787.3001138\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This work observes that a large fraction of the computations performed by Deep Neural Networks (DNNs) are intrinsically ineffectual as they involve a multiplication where one of the inputs is zero. This observation motivates Cnvolutin (CNV), a value-based approach to hardware acceleration that eliminates most of these ineffectual operations, improving performance and energy over a state-of-the-art accelerator with no accuracy loss. CNV uses hierarchical data-parallel units, allowing groups of lanes to proceed mostly independently enabling them to skip over the ineffectual computations. A co-designed data storage format encodes the computation elimination decisions taking them off the critical path while avoiding control divergence in the data parallel units. Combined, the units and the data storage format result in a data-parallel architecture that maintains wide, aligned accesses to its memory hierarchy and that keeps its data lanes busy. By loosening the ineffectual computation identification criterion, CNV enables further performance and energy efficiency improvements, and more so if a loss in accuracy is acceptable. Experimental measurements over a set of state-of-the-art DNNs for image classification show that CNV improves performance over a state-of-the-art accelerator from 1.24× to 1.55× and by 1.37× on average without any loss in accuracy by removing zero-valued operand multiplications alone. While CNV incurs an area overhead of 4.49%, it improves overall EDP (Energy Delay Product) and ED2P (Energy Delay Squared Product) on average by 1.47× and 2.01×, respectively. The average performance improvements increase to 1.52× without any loss in accuracy with a broader ineffectual identification policy. Further improvements are demonstrated with a loss in accuracy.\",\"PeriodicalId\":6634,\"journal\":{\"name\":\"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)\",\"volume\":\"50 1\",\"pages\":\"1-13\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-06-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"629\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3007787.3001138\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3007787.3001138","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 629

摘要

这项工作观察到,深度神经网络(dnn)执行的大部分计算本质上是无效的,因为它们涉及其中一个输入为零的乘法。这一观察结果激发了Cnvolutin (CNV)的灵感,这是一种基于价值的硬件加速方法,消除了大多数无效操作,在不损失精度的情况下,提高了最先进的加速器的性能和能量。CNV使用分层数据并行单元,允许车道组独立进行,使它们能够跳过无效的计算。协同设计的数据存储格式对计算消除决策进行编码,使其远离关键路径,同时避免了数据并行单元中的控制发散。这些单元和数据存储格式结合在一起,形成了一个数据并行架构,该架构保持了对其内存层次结构的宽、对齐访问,并使其数据通道保持繁忙。通过放宽无效计算识别标准,CNV可以进一步提高性能和能效,如果精度损失是可以接受的,则效果会更好。对一组最先进的dnn进行图像分类的实验测量表明,通过去除零值操作数乘法,CNV将最先进的加速器的性能从1.24倍提高到1.55倍,平均提高1.37倍,而精度没有任何损失。虽然CNV会产生4.49%的面积开销,但它将整体EDP(能量延迟积)和ED2P(能量延迟平方积)平均分别提高了1.47倍和2.01倍。使用更广泛的无效识别策略时,平均性能提高到1.52倍,而精度没有任何损失。进一步的改进证明了准确性的损失。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing
This work observes that a large fraction of the computations performed by Deep Neural Networks (DNNs) are intrinsically ineffectual as they involve a multiplication where one of the inputs is zero. This observation motivates Cnvolutin (CNV), a value-based approach to hardware acceleration that eliminates most of these ineffectual operations, improving performance and energy over a state-of-the-art accelerator with no accuracy loss. CNV uses hierarchical data-parallel units, allowing groups of lanes to proceed mostly independently enabling them to skip over the ineffectual computations. A co-designed data storage format encodes the computation elimination decisions taking them off the critical path while avoiding control divergence in the data parallel units. Combined, the units and the data storage format result in a data-parallel architecture that maintains wide, aligned accesses to its memory hierarchy and that keeps its data lanes busy. By loosening the ineffectual computation identification criterion, CNV enables further performance and energy efficiency improvements, and more so if a loss in accuracy is acceptable. Experimental measurements over a set of state-of-the-art DNNs for image classification show that CNV improves performance over a state-of-the-art accelerator from 1.24× to 1.55× and by 1.37× on average without any loss in accuracy by removing zero-valued operand multiplications alone. While CNV incurs an area overhead of 4.49%, it improves overall EDP (Energy Delay Product) and ED2P (Energy Delay Squared Product) on average by 1.47× and 2.01×, respectively. The average performance improvements increase to 1.52× without any loss in accuracy with a broader ineffectual identification policy. Further improvements are demonstrated with a loss in accuracy.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
RelaxFault Memory Repair Boosting Access Parallelism to PCM-Based Main Memory Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems Energy Efficient Architecture for Graph Analytics Accelerators
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1