Across the Stack Opportunities for Deep Learning Acceleration

V. Srinivasan, B. Fleischer, Sunil Shukla, M. Ziegler, J. Silberman, Jinwook Oh, Jungwook Choi, S. Mueller, A. Agrawal, Tina Babinsky, N. Cao, Chia-Yu Chen, P. Chuang, T. Fox, G. Gristede, Michael Guillorn, Howard Haynie, M. Klaiber, Dongsoo Lee, S. Lo, G. Maier, M. Scheuermann, Swagath Venkataramani, Christos Vezyrtzis, Naigang Wang, F. Yee, Ching Zhou, P. Lu, B. Curran, Leland Chang, K. Gopalakrishnan
{"title":"Across the Stack Opportunities for Deep Learning Acceleration","authors":"V. Srinivasan, B. Fleischer, Sunil Shukla, M. Ziegler, J. Silberman, Jinwook Oh, Jungwook Choi, S. Mueller, A. Agrawal, Tina Babinsky, N. Cao, Chia-Yu Chen, P. Chuang, T. Fox, G. Gristede, Michael Guillorn, Howard Haynie, M. Klaiber, Dongsoo Lee, S. Lo, G. Maier, M. Scheuermann, Swagath Venkataramani, Christos Vezyrtzis, Naigang Wang, F. Yee, Ching Zhou, P. Lu, B. Curran, Leland Chang, K. Gopalakrishnan","doi":"10.1145/3218603.3241339","DOIUrl":null,"url":null,"abstract":"The combination of growth in compute capabilities and availability of large datasets has led to a re-birth of deep learning. Deep Neural Networks (DNNs) have become state-of-the-art in a variety of machine learning tasks spanning domains across vision, speech, and machine translation. Deep Learning (DL) achieves high accuracy in these tasks at the expense of 100s of ExaOps of computation; posing significant challenges to efficient large-scale deployment in both resource-constrained environments and data centers. One of the key enablers to improve operational efficiency of DNNs is the observation that when extracting deep insight from vast quantities of structured and unstructured data the exactness imposed by traditional computing is not required. Relaxing the \"exactness\" constraint enables exploiting opportunities for approximate computing across all layers of the system stack. In this talk we present a multi-TOPS AI core [3] for acceleration of deep learning training and inference in systems from edge devices to data centers. We demonstrate that to derive high sustained utilization and energy efficiency from the AI core requires ground-up re-thinking to exploit approximate computing across the stack including algorithms, architecture, programmability, and hardware. Model accuracy is the fundamental measure of deep learning quality. The compute engine precision in our AI core is carefully calibrated to realize significant reduction in area and power while not compromising numerical accuracy. Our research at the DL algorithms/applications-level [2] shows that it is possible to carefully tune the precision of both weights and activations to as low as 2-bits for inference and was used to guide the choices of compute precision supported in the architecture and hardware for both training and inference. Similarly, distributed DL training's scalability is impacted by the communication overhead to exchange gradients and weights after each mini-batch. Our research on gradient compression [1] shows by selectively sending gradients larger than a threshold, and by further choosing the threshold based on the importance of the gradient we achieve achieve compression ratio of 40X for convolutional layers, and up to 200X for fully-connected layers of the network without losing model accuracy. These results guide the choice of interconnection network topology exploration for a system of accelerators built using the AI core. Overall, our work shows how the benefits from exploiting approximation using algorithm/application's robustness to tolerate reduced precision, and compressed data communication can be combined effectively with the architecture and hardware of the accelerator designed to support these reduced-precision computation and compressed data communication. Our results demonstate improved end-to-end efficiency of the DL accelerator across different metrics such as high sustained TOPs, high TOPs/watt and TOPs/mm2 catering to different operating environments for both training and inference.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3218603.3241339","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The combination of growth in compute capabilities and availability of large datasets has led to a re-birth of deep learning. Deep Neural Networks (DNNs) have become state-of-the-art in a variety of machine learning tasks spanning domains across vision, speech, and machine translation. Deep Learning (DL) achieves high accuracy in these tasks at the expense of 100s of ExaOps of computation; posing significant challenges to efficient large-scale deployment in both resource-constrained environments and data centers. One of the key enablers to improve operational efficiency of DNNs is the observation that when extracting deep insight from vast quantities of structured and unstructured data the exactness imposed by traditional computing is not required. Relaxing the "exactness" constraint enables exploiting opportunities for approximate computing across all layers of the system stack. In this talk we present a multi-TOPS AI core [3] for acceleration of deep learning training and inference in systems from edge devices to data centers. We demonstrate that to derive high sustained utilization and energy efficiency from the AI core requires ground-up re-thinking to exploit approximate computing across the stack including algorithms, architecture, programmability, and hardware. Model accuracy is the fundamental measure of deep learning quality. The compute engine precision in our AI core is carefully calibrated to realize significant reduction in area and power while not compromising numerical accuracy. Our research at the DL algorithms/applications-level [2] shows that it is possible to carefully tune the precision of both weights and activations to as low as 2-bits for inference and was used to guide the choices of compute precision supported in the architecture and hardware for both training and inference. Similarly, distributed DL training's scalability is impacted by the communication overhead to exchange gradients and weights after each mini-batch. Our research on gradient compression [1] shows by selectively sending gradients larger than a threshold, and by further choosing the threshold based on the importance of the gradient we achieve achieve compression ratio of 40X for convolutional layers, and up to 200X for fully-connected layers of the network without losing model accuracy. These results guide the choice of interconnection network topology exploration for a system of accelerators built using the AI core. Overall, our work shows how the benefits from exploiting approximation using algorithm/application's robustness to tolerate reduced precision, and compressed data communication can be combined effectively with the architecture and hardware of the accelerator designed to support these reduced-precision computation and compressed data communication. Our results demonstate improved end-to-end efficiency of the DL accelerator across different metrics such as high sustained TOPs, high TOPs/watt and TOPs/mm2 catering to different operating environments for both training and inference.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
计算能力的增长和大数据集的可用性的结合导致了深度学习的重生。深度神经网络(dnn)已经成为跨越视觉、语音和机器翻译领域的各种机器学习任务的最先进技术。深度学习(DL)在这些任务中实现了很高的准确性,但代价是要花费100亿次的计算;对资源受限环境和数据中心的高效大规模部署提出了重大挑战。提高深度神经网络运行效率的关键因素之一是,当从大量结构化和非结构化数据中提取深度洞察时,不需要传统计算所施加的准确性。放松“精确性”约束可以利用跨系统堆栈的所有层进行近似计算的机会。在这次演讲中,我们提出了一个多tops AI核心[3],用于加速从边缘设备到数据中心的系统中的深度学习训练和推理。我们证明,要从人工智能核心获得高持续利用率和能源效率,需要从头开始重新思考,以利用跨堆栈的近似计算,包括算法、架构、可编程性和硬件。模型精度是衡量深度学习质量的基本标准。我们AI核心的计算引擎精度经过仔细校准,以实现面积和功率的显着减少,同时不影响数值精度。我们在深度学习算法/应用程序级别的研究[2]表明,可以仔细地将权重和激活的精度调至低至2位的推理,并用于指导架构和硬件中支持的计算精度的选择,以用于训练和推理。类似地,分布式深度学习训练的可扩展性受到每个小批后交换梯度和权重的通信开销的影响。我们对梯度压缩的研究[1]表明,通过选择性地发送大于阈值的梯度,并根据梯度的重要性进一步选择阈值,我们可以在不损失模型精度的情况下,实现卷积层的压缩比为40X,网络的全连接层的压缩比高达200X。这些结果指导了使用AI核心构建的加速器系统互连网络拓扑探索的选择。总的来说,我们的工作显示了利用算法/应用程序的鲁棒性来容忍降低的精度和压缩数据通信的近似的好处,可以有效地与加速器的架构和硬件相结合,以支持这些降低精度的计算和压缩数据通信。我们的研究结果表明,DL加速器在不同指标(如高持续TOPs、高TOPs/watt和TOPs/mm2)上的端到端效率得到了提高,可用于训练和推理的不同操作环境。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Adiabatic and Clock-Powered Circuits Power Macro-Models for High-Level Power Estimation Stand-By Power Reduction for SRAM Memories Leakage in CMOS Nanometric Technologies Evolution of Deep Submicron Bulk and SOI Technologies
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1