SumMerge:权重重复感知深度神经网络推理的有效算法和实现

Rohan Baskar Prabhakar, Sachit Kuhar, R. Agrawal, C. Hughes, Christopher W. Fletcher
{"title":"SumMerge:权重重复感知深度神经网络推理的有效算法和实现","authors":"Rohan Baskar Prabhakar, Sachit Kuhar, R. Agrawal, C. Hughes, Christopher W. Fletcher","doi":"10.1145/3447818.3460375","DOIUrl":null,"url":null,"abstract":"Deep Neural Network (DNN) inference efficiency is a key concern across the myriad of domains now relying on Deep Learning. A recent promising direction to speed-up inference is to exploit \\emph{weight repetition}. The key observation is that due to DNN quantization schemes---which attempt to reduce DNN storage requirements by reducing the number of bits needed to represent each weight---the same weight is bound to repeat many times within and across filters. This enables a weight-repetition aware inference kernel to factorize and memoize out common sub-computations, reducing arithmetic per inference while still maintaining the compression benefits of quantization. Yet, significant challenges remain. For instance, weight repetition introduces significant irregularity in the inference operation and hence (up to this point) has required custom hardware accelerators to derive net benefit. This paper proposes SumMerge: a new algorithm and set of implementation techniques to make weight repetition practical on general-purpose devices such as CPUs. The key idea is to formulate inference as traversing a sequence of data-flow graphs \\emph{with weight-dependent structure}. We develop an offline heuristic to select a data-flow graph structure that minimizes arithmetic operations per inference (given trained weight values) and use an efficient online procedure to traverse each data-flow graph and compute the inference result given DNN inputs. We implement the above as an optimized C++ routine that runs on a commercial multicore processor with vector extensions and evaluate performance relative to Intel's optimized library oneDNN and the prior-art weight repetition algorithm (AGR). When applied on top of six different quantization schemes, SumMerge achieves a speedup of between 1.09x-2.05x and 1.04x-1.51x relative to oneDNN and AGR, respectively, while simultaneously compressing the DNN model by 8.7x to 15.4x.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"47 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"SumMerge: an efficient algorithm and implementation for weight repetition-aware DNN inference\",\"authors\":\"Rohan Baskar Prabhakar, Sachit Kuhar, R. Agrawal, C. Hughes, Christopher W. Fletcher\",\"doi\":\"10.1145/3447818.3460375\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep Neural Network (DNN) inference efficiency is a key concern across the myriad of domains now relying on Deep Learning. A recent promising direction to speed-up inference is to exploit \\\\emph{weight repetition}. The key observation is that due to DNN quantization schemes---which attempt to reduce DNN storage requirements by reducing the number of bits needed to represent each weight---the same weight is bound to repeat many times within and across filters. This enables a weight-repetition aware inference kernel to factorize and memoize out common sub-computations, reducing arithmetic per inference while still maintaining the compression benefits of quantization. Yet, significant challenges remain. For instance, weight repetition introduces significant irregularity in the inference operation and hence (up to this point) has required custom hardware accelerators to derive net benefit. This paper proposes SumMerge: a new algorithm and set of implementation techniques to make weight repetition practical on general-purpose devices such as CPUs. The key idea is to formulate inference as traversing a sequence of data-flow graphs \\\\emph{with weight-dependent structure}. We develop an offline heuristic to select a data-flow graph structure that minimizes arithmetic operations per inference (given trained weight values) and use an efficient online procedure to traverse each data-flow graph and compute the inference result given DNN inputs. We implement the above as an optimized C++ routine that runs on a commercial multicore processor with vector extensions and evaluate performance relative to Intel's optimized library oneDNN and the prior-art weight repetition algorithm (AGR). When applied on top of six different quantization schemes, SumMerge achieves a speedup of between 1.09x-2.05x and 1.04x-1.51x relative to oneDNN and AGR, respectively, while simultaneously compressing the DNN model by 8.7x to 15.4x.\",\"PeriodicalId\":73273,\"journal\":{\"name\":\"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing\",\"volume\":\"47 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3447818.3460375\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447818.3460375","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

摘要

深度神经网络(DNN)的推理效率是目前依赖深度学习的众多领域的一个关键问题。最近一个有希望的加速推理的方向是利用\emph{权重重复}。关键的观察结果是,由于DNN量化方案——试图通过减少表示每个权重所需的比特数来减少DNN存储需求——相同的权重必然会在过滤器内部和跨过滤器重复多次。这使得权重重复感知推理内核能够分解和记忆常见的子计算,减少每个推理的算术,同时仍然保持量化的压缩优势。然而,重大挑战依然存在。例如,权重重复在推理操作中引入了明显的不规则性,因此(到目前为止)需要定制硬件加速器来获得净收益。本文提出了SumMerge:一种新的算法和一套实现技术,使权重重复在cpu等通用设备上实现。关键思想是将推理表述为遍历\emph{具有权重相关结构}的数据流图序列。我们开发了一种离线启发式方法来选择一个数据流图结构,该结构可以最大限度地减少每个推理(给定训练过的权重值)的算术运算,并使用一个有效的在线过程来遍历每个数据流图并计算给定DNN输入的推理结果。我们将上述实现为一个优化的c++例程,该例程运行在带有矢量扩展的商用多核处理器上,并相对于英特尔优化的库oneDNN和现有技术的权重重复算法(AGR)评估性能。当应用于六种不同的量化方案时,SumMerge相对于oneDNN和AGR分别实现了1.09 -2.05倍和1.04 -1.51倍的加速,同时将DNN模型压缩了8.7倍至15.4倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SumMerge: an efficient algorithm and implementation for weight repetition-aware DNN inference
Deep Neural Network (DNN) inference efficiency is a key concern across the myriad of domains now relying on Deep Learning. A recent promising direction to speed-up inference is to exploit \emph{weight repetition}. The key observation is that due to DNN quantization schemes---which attempt to reduce DNN storage requirements by reducing the number of bits needed to represent each weight---the same weight is bound to repeat many times within and across filters. This enables a weight-repetition aware inference kernel to factorize and memoize out common sub-computations, reducing arithmetic per inference while still maintaining the compression benefits of quantization. Yet, significant challenges remain. For instance, weight repetition introduces significant irregularity in the inference operation and hence (up to this point) has required custom hardware accelerators to derive net benefit. This paper proposes SumMerge: a new algorithm and set of implementation techniques to make weight repetition practical on general-purpose devices such as CPUs. The key idea is to formulate inference as traversing a sequence of data-flow graphs \emph{with weight-dependent structure}. We develop an offline heuristic to select a data-flow graph structure that minimizes arithmetic operations per inference (given trained weight values) and use an efficient online procedure to traverse each data-flow graph and compute the inference result given DNN inputs. We implement the above as an optimized C++ routine that runs on a commercial multicore processor with vector extensions and evaluate performance relative to Intel's optimized library oneDNN and the prior-art weight repetition algorithm (AGR). When applied on top of six different quantization schemes, SumMerge achieves a speedup of between 1.09x-2.05x and 1.04x-1.51x relative to oneDNN and AGR, respectively, while simultaneously compressing the DNN model by 8.7x to 15.4x.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Accelerating BWA-MEM Read Mapping on GPUs. Dynamic Memory Management in Massively Parallel Systems: A Case on GPUs. Priority Algorithms with Advice for Disjoint Path Allocation Problems From Data of Internet of Things to Domain Knowledge: A Case Study of Exploration in Smart Agriculture On Two Variants of Induced Matchings
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1