7.7 LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16

2019 IEEE International Solid- State Circuits Conference - (ISSCC) Pub Date : 2019-02-01 DOI:10.1109/ISSCC.2019.8662302

Jinsu Lee, Juhyoung Lee, Donghyeon Han, Jinmook Lee, Gwangtae Park, H. Yoo

{"title":"7.7 LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16","authors":"Jinsu Lee, Juhyoung Lee, Donghyeon Han, Jinmook Lee, Gwangtae Park, H. Yoo","doi":"10.1109/ISSCC.2019.8662302","DOIUrl":null,"url":null,"abstract":"Recently, deep neural network (DNN) hardware accelerators have been reported for energy-efficient deep learning (DL) acceleration [1–6]. Most prior DNN inference accelerators are trained in the cloud using public datasets; parameters are then downloaded to implement AI [1–5]. However, local DNN learning with domain-specific and private data is required meet various user preferences on edge or mobile devices. Since edge and mobile devices contain only limited computation capability with battery power, an energy-efficient DNN learning processor is necessary. Only [6] supported on-chip DNN learning, but it was not energy-efficient, as it did not utilize sparsity which represents 37%-61% of the inputs for various CNNs, such as VGG16, AlexNet and ResNet-18, as shown in Fig. 7.7.1. Although [3–5] utilized the sparsity, they only considered the inference phase with inter-channel accumulation in Fig. 7.7.1, and did not support intra-channel accumulation for the weight-gradient generation (WG) step of the learning phase. Also, [6] adopted FP16, but it was not energy optimal because FP8 is enough for many input operands with 4× less energy than FP16.","PeriodicalId":265551,"journal":{"name":"2019 IEEE International Solid- State Circuits Conference - (ISSCC)","volume":"298 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"96","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Solid- State Circuits Conference - (ISSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC.2019.8662302","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 96

Abstract

Recently, deep neural network (DNN) hardware accelerators have been reported for energy-efficient deep learning (DL) acceleration [1–6]. Most prior DNN inference accelerators are trained in the cloud using public datasets; parameters are then downloaded to implement AI [1–5]. However, local DNN learning with domain-specific and private data is required meet various user preferences on edge or mobile devices. Since edge and mobile devices contain only limited computation capability with battery power, an energy-efficient DNN learning processor is necessary. Only [6] supported on-chip DNN learning, but it was not energy-efficient, as it did not utilize sparsity which represents 37%-61% of the inputs for various CNNs, such as VGG16, AlexNet and ResNet-18, as shown in Fig. 7.7.1. Although [3–5] utilized the sparsity, they only considered the inference phase with inter-channel accumulation in Fig. 7.7.1, and did not support intra-channel accumulation for the weight-gradient generation (WG) step of the learning phase. Also, [6] adopted FP16, but it was not energy optimal because FP8 is enough for many input operands with 4× less energy than FP16.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

7.7 LNPU: 25.3TFLOPS/W的稀疏深度神经网络学习处理器，具有FP8-FP16的细粒度混合精度

最近，深度神经网络(DNN)硬件加速器被报道用于节能深度学习(DL)加速[1-6]。大多数先前的DNN推理加速器是使用公共数据集在云中训练的;然后下载参数来实现AI[1-5]。然而，需要使用特定领域和私有数据进行局部深度神经网络学习，以满足边缘或移动设备上的各种用户偏好。由于边缘设备和移动设备只有有限的电池计算能力，因此需要一个节能的深度神经网络学习处理器。只有[6]支持片上DNN学习，但它并不节能，因为它没有利用稀疏性，稀疏性占各种cnn(如VGG16, AlexNet和ResNet-18)输入的37%-61%，如图7.7.1所示。虽然[3-5]利用了稀疏性，但他们只考虑了图7.7.1中具有通道间积累的推理阶段，而不支持学习阶段的权重梯度生成(WG)步骤的通道内积累。[6]也采用了FP16，但它不是能量最优的，因为FP8可以满足许多输入操作数，而能量比FP16少4倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 IEEE International Solid- State Circuits Conference - (ISSCC)

自引率

0.00%

发文量