7.7 LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16

Jinsu Lee, Juhyoung Lee, Donghyeon Han, Jinmook Lee, Gwangtae Park, H. Yoo
{"title":"7.7 LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16","authors":"Jinsu Lee, Juhyoung Lee, Donghyeon Han, Jinmook Lee, Gwangtae Park, H. Yoo","doi":"10.1109/ISSCC.2019.8662302","DOIUrl":null,"url":null,"abstract":"Recently, deep neural network (DNN) hardware accelerators have been reported for energy-efficient deep learning (DL) acceleration [1–6]. Most prior DNN inference accelerators are trained in the cloud using public datasets; parameters are then downloaded to implement AI [1–5]. However, local DNN learning with domain-specific and private data is required meet various user preferences on edge or mobile devices. Since edge and mobile devices contain only limited computation capability with battery power, an energy-efficient DNN learning processor is necessary. Only [6] supported on-chip DNN learning, but it was not energy-efficient, as it did not utilize sparsity which represents 37%-61% of the inputs for various CNNs, such as VGG16, AlexNet and ResNet-18, as shown in Fig. 7.7.1. Although [3–5] utilized the sparsity, they only considered the inference phase with inter-channel accumulation in Fig. 7.7.1, and did not support intra-channel accumulation for the weight-gradient generation (WG) step of the learning phase. Also, [6] adopted FP16, but it was not energy optimal because FP8 is enough for many input operands with 4× less energy than FP16.","PeriodicalId":265551,"journal":{"name":"2019 IEEE International Solid- State Circuits Conference - (ISSCC)","volume":"298 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"96","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Solid- State Circuits Conference - (ISSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC.2019.8662302","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 96

Abstract

Recently, deep neural network (DNN) hardware accelerators have been reported for energy-efficient deep learning (DL) acceleration [1–6]. Most prior DNN inference accelerators are trained in the cloud using public datasets; parameters are then downloaded to implement AI [1–5]. However, local DNN learning with domain-specific and private data is required meet various user preferences on edge or mobile devices. Since edge and mobile devices contain only limited computation capability with battery power, an energy-efficient DNN learning processor is necessary. Only [6] supported on-chip DNN learning, but it was not energy-efficient, as it did not utilize sparsity which represents 37%-61% of the inputs for various CNNs, such as VGG16, AlexNet and ResNet-18, as shown in Fig. 7.7.1. Although [3–5] utilized the sparsity, they only considered the inference phase with inter-channel accumulation in Fig. 7.7.1, and did not support intra-channel accumulation for the weight-gradient generation (WG) step of the learning phase. Also, [6] adopted FP16, but it was not energy optimal because FP8 is enough for many input operands with 4× less energy than FP16.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
7.7 LNPU: 25.3TFLOPS/W的稀疏深度神经网络学习处理器,具有FP8-FP16的细粒度混合精度
最近,深度神经网络(DNN)硬件加速器被报道用于节能深度学习(DL)加速[1-6]。大多数先前的DNN推理加速器是使用公共数据集在云中训练的;然后下载参数来实现AI[1-5]。然而,需要使用特定领域和私有数据进行局部深度神经网络学习,以满足边缘或移动设备上的各种用户偏好。由于边缘设备和移动设备只有有限的电池计算能力,因此需要一个节能的深度神经网络学习处理器。只有[6]支持片上DNN学习,但它并不节能,因为它没有利用稀疏性,稀疏性占各种cnn(如VGG16, AlexNet和ResNet-18)输入的37%-61%,如图7.7.1所示。虽然[3-5]利用了稀疏性,但他们只考虑了图7.7.1中具有通道间积累的推理阶段,而不支持学习阶段的权重梯度生成(WG)步骤的通道内积累。[6]也采用了FP16,但它不是能量最优的,因为FP8可以满足许多输入操作数,而能量比FP16少4倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
27.2 An Adiabatic Sense and Set Rectifier for Improved Maximum-Power-Point Tracking in Piezoelectric Harvesting with 541% Energy Extraction Gain 22.7 A Programmable Wireless EEG Monitoring SoC with Open/Closed-Loop Optogenetic and Electrical Stimulation for Epilepsy Control 2.5 A 40×40 Four-Neighbor Time-Based In-Memory Computing Graph ASIC Chip Featuring Wavefront Expansion and 2D Gradient Control 11.2 A CMOS Biosensor Array with 1024 3-Electrode Voltammetry Pixels and 93dB Dynamic Range 11.3 A Capacitive Biosensor for Cancer Diagnosis Using a Functionalized Microneedle and a 13.7b-Resolution Capacitance-to-Digital Converter from 1 to 100nF
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1