Performance Trade-offs in Weight Quantization for Memory-Efficient Inference

Pablo M. Tostado, B. Pedroni, G. Cauwenberghs
{"title":"Performance Trade-offs in Weight Quantization for Memory-Efficient Inference","authors":"Pablo M. Tostado, B. Pedroni, G. Cauwenberghs","doi":"10.1109/AICAS.2019.8771473","DOIUrl":null,"url":null,"abstract":"Over the past decade, Deep Neural Networks (DNNs) trained using Deep Learning (DL) frameworks have become the workhorse to solve a wide variety of computational tasks in big data environments. To date, DL DNNs have relied on large amounts of computational power to reach peak performance, typically relying on the high computational bandwidth of GPUs, while straining available memory bandwidth and capacity. With ever increasing data complexity and more stringent energy constraints in Internet-of-Things (IoT) application environments, there has been a growing interest in the development of more efficient DNN inference methods that economize on random-access memory usage in weight access. Herein, we present a systematic analysis of the performance trade-offs of quantized weight representations at variable bit length for memory-efficient inference in pre-trained DNN models. In this work, we vary the mantissa and exponent bit lengths in the representation of the network parameters and examine the effect of DropOut regularization during pre-training and the impact of two different weight truncation mechanisms: stochastic and deterministic rounding. We show drastic reduction in the memory need, down to 4 bits per weight, while maintaining near-optimal test performance of low-complexity DNNs pre-trained on the MNIST and CIFAR-10 datasets. These results offer a simple methodology to achieve high memory and computation efficiency of inference in DNN dedicated low-power hardware for IoT, directly from pre-trained, high-resolution DNNs using standard DL algorithms.","PeriodicalId":273095,"journal":{"name":"2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICAS.2019.8771473","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Over the past decade, Deep Neural Networks (DNNs) trained using Deep Learning (DL) frameworks have become the workhorse to solve a wide variety of computational tasks in big data environments. To date, DL DNNs have relied on large amounts of computational power to reach peak performance, typically relying on the high computational bandwidth of GPUs, while straining available memory bandwidth and capacity. With ever increasing data complexity and more stringent energy constraints in Internet-of-Things (IoT) application environments, there has been a growing interest in the development of more efficient DNN inference methods that economize on random-access memory usage in weight access. Herein, we present a systematic analysis of the performance trade-offs of quantized weight representations at variable bit length for memory-efficient inference in pre-trained DNN models. In this work, we vary the mantissa and exponent bit lengths in the representation of the network parameters and examine the effect of DropOut regularization during pre-training and the impact of two different weight truncation mechanisms: stochastic and deterministic rounding. We show drastic reduction in the memory need, down to 4 bits per weight, while maintaining near-optimal test performance of low-complexity DNNs pre-trained on the MNIST and CIFAR-10 datasets. These results offer a simple methodology to achieve high memory and computation efficiency of inference in DNN dedicated low-power hardware for IoT, directly from pre-trained, high-resolution DNNs using standard DL algorithms.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于内存效率推断的权重量化的性能权衡
在过去的十年中,使用深度学习(DL)框架训练的深度神经网络(dnn)已经成为解决大数据环境中各种计算任务的主力。迄今为止,深度深度神经网络依赖于大量的计算能力来达到峰值性能,通常依赖于gpu的高计算带宽,同时使可用的内存带宽和容量紧张。随着物联网(IoT)应用环境中不断增加的数据复杂性和更严格的能量限制,人们对开发更有效的DNN推理方法越来越感兴趣,这些方法可以在权重访问中节省随机访问内存的使用。在此,我们系统地分析了在预训练的DNN模型中,可变位长的量化权重表示用于内存高效推理的性能权衡。在这项工作中,我们改变了网络参数表示中的尾数和指数位长度,并检查了预训练期间DropOut正则化的影响以及两种不同权重截断机制(随机和确定性舍入)的影响。在MNIST和CIFAR-10数据集上预训练的低复杂度dnn保持近乎最佳的测试性能的同时,我们显示了内存需求的大幅减少,每个权重降至4位。这些结果提供了一种简单的方法,可以直接从使用标准深度学习算法的预训练的高分辨率DNN中实现用于物联网的DNN专用低功耗硬件的高内存和计算效率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Artificial Intelligence of Things Wearable System for Cardiac Disease Detection Fast event-driven incremental learning of hand symbols Accelerating CNN-RNN Based Machine Health Monitoring on FPGA Neuromorphic networks on the SpiNNaker platform Complexity Reduction on HEVC Intra Mode Decision with modified LeNet-5
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1