深度神经网络的对数块浮点算法

2020 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS) Pub Date : 2020-12-08 DOI:10.1109/APCCAS50809.2020.9301687

Chao Ni, Jinming Lu, Jun Lin, Zhongfeng Wang

{"title":"深度神经网络的对数块浮点算法","authors":"Chao Ni, Jinming Lu, Jun Lin, Zhongfeng Wang","doi":"10.1109/APCCAS50809.2020.9301687","DOIUrl":null,"url":null,"abstract":"Fixed-point quantization techniques have attracted considerable attention in deep neural network (DNN) inference acceleration. Nevertheless, they often require time-consuming fine-tuning or retraining to keep the accuracy of a quantized model. Besides, DNNs involve massive multiplication operations, which are of much higher computational complexities compared with addition operations. To deal with the two problems, we propose an improved numerical format named logarithmic block floating point (LBFP) for post-training quantization. Firstly, logarithmic arithmetic is employed to convert multiplication operations to addition and shift operations. Then, Kullback-Leibler divergence is used to determine the shared exponent before inference. Thus, LBFP can significantly reduce the hard-ware complexity with negligible performance loss. Moreover, an efficient hardware architecture is designed to support the computation of LBFP. Hardware synthesis results show that our 8-bit LBFP multiplier can reduce power and area by 53% and 45%, respectively, compared with the 8-bit traditional fixed-point multiplier. Finally, a software library is developed with the CUDA-C language to evaluate the inference accuracy of LBFP. Without retraining, the accuracy of the selected DNN models with the 8-bit LBFP representation is comparable to that of the corresponding 32-bit floating-point baselines, showing the great potential in efficient DNN inference acceleration.","PeriodicalId":127075,"journal":{"name":"2020 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2020-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"LBFP: Logarithmic Block Floating Point Arithmetic for Deep Neural Networks\",\"authors\":\"Chao Ni, Jinming Lu, Jun Lin, Zhongfeng Wang\",\"doi\":\"10.1109/APCCAS50809.2020.9301687\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Fixed-point quantization techniques have attracted considerable attention in deep neural network (DNN) inference acceleration. Nevertheless, they often require time-consuming fine-tuning or retraining to keep the accuracy of a quantized model. Besides, DNNs involve massive multiplication operations, which are of much higher computational complexities compared with addition operations. To deal with the two problems, we propose an improved numerical format named logarithmic block floating point (LBFP) for post-training quantization. Firstly, logarithmic arithmetic is employed to convert multiplication operations to addition and shift operations. Then, Kullback-Leibler divergence is used to determine the shared exponent before inference. Thus, LBFP can significantly reduce the hard-ware complexity with negligible performance loss. Moreover, an efficient hardware architecture is designed to support the computation of LBFP. Hardware synthesis results show that our 8-bit LBFP multiplier can reduce power and area by 53% and 45%, respectively, compared with the 8-bit traditional fixed-point multiplier. Finally, a software library is developed with the CUDA-C language to evaluate the inference accuracy of LBFP. Without retraining, the accuracy of the selected DNN models with the 8-bit LBFP representation is comparable to that of the corresponding 32-bit floating-point baselines, showing the great potential in efficient DNN inference acceleration.\",\"PeriodicalId\":127075,\"journal\":{\"name\":\"2020 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/APCCAS50809.2020.9301687\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APCCAS50809.2020.9301687","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

不动点量化技术是深度神经网络推理加速研究的热点之一。然而，它们通常需要耗时的微调或再训练来保持量子化模型的准确性。此外，深度神经网络涉及大量乘法运算，与加法运算相比，其计算复杂度要高得多。为了解决这两个问题，我们提出了一种改进的数字格式，即对数块浮点(LBFP)，用于训练后量化。首先，采用对数算法将乘法运算转化为加法和移位运算。然后，在进行推理之前，利用Kullback-Leibler散度确定共享指数。因此，LBFP可以显著降低硬件复杂性，而性能损失可以忽略不计。此外，设计了一种高效的硬件架构来支持LBFP的计算。硬件综合结果表明，与传统的8位定点乘法器相比，我们的8位LBFP乘法器的功耗和面积分别降低了53%和45%。最后，利用CUDA-C语言开发了一个软件库来评估LBFP的推理精度。在不进行再训练的情况下，采用8位LBFP表示的DNN模型的精度与对应的32位浮点基线相当，显示了在高效DNN推理加速方面的巨大潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

LBFP: Logarithmic Block Floating Point Arithmetic for Deep Neural Networks

Fixed-point quantization techniques have attracted considerable attention in deep neural network (DNN) inference acceleration. Nevertheless, they often require time-consuming fine-tuning or retraining to keep the accuracy of a quantized model. Besides, DNNs involve massive multiplication operations, which are of much higher computational complexities compared with addition operations. To deal with the two problems, we propose an improved numerical format named logarithmic block floating point (LBFP) for post-training quantization. Firstly, logarithmic arithmetic is employed to convert multiplication operations to addition and shift operations. Then, Kullback-Leibler divergence is used to determine the shared exponent before inference. Thus, LBFP can significantly reduce the hard-ware complexity with negligible performance loss. Moreover, an efficient hardware architecture is designed to support the computation of LBFP. Hardware synthesis results show that our 8-bit LBFP multiplier can reduce power and area by 53% and 45%, respectively, compared with the 8-bit traditional fixed-point multiplier. Finally, a software library is developed with the CUDA-C language to evaluate the inference accuracy of LBFP. Without retraining, the accuracy of the selected DNN models with the 8-bit LBFP representation is comparable to that of the corresponding 32-bit floating-point baselines, showing the great potential in efficient DNN inference acceleration.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)

自引率

0.00%

发文量