Performance Trade-offs in Weight Quantization for Memory-Efficient Inference

2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS) Pub Date : 2019-03-01 DOI:10.1109/AICAS.2019.8771473

Pablo M. Tostado, B. Pedroni, G. Cauwenberghs

{"title":"Performance Trade-offs in Weight Quantization for Memory-Efficient Inference","authors":"Pablo M. Tostado, B. Pedroni, G. Cauwenberghs","doi":"10.1109/AICAS.2019.8771473","DOIUrl":null,"url":null,"abstract":"Over the past decade, Deep Neural Networks (DNNs) trained using Deep Learning (DL) frameworks have become the workhorse to solve a wide variety of computational tasks in big data environments. To date, DL DNNs have relied on large amounts of computational power to reach peak performance, typically relying on the high computational bandwidth of GPUs, while straining available memory bandwidth and capacity. With ever increasing data complexity and more stringent energy constraints in Internet-of-Things (IoT) application environments, there has been a growing interest in the development of more efficient DNN inference methods that economize on random-access memory usage in weight access. Herein, we present a systematic analysis of the performance trade-offs of quantized weight representations at variable bit length for memory-efficient inference in pre-trained DNN models. In this work, we vary the mantissa and exponent bit lengths in the representation of the network parameters and examine the effect of DropOut regularization during pre-training and the impact of two different weight truncation mechanisms: stochastic and deterministic rounding. We show drastic reduction in the memory need, down to 4 bits per weight, while maintaining near-optimal test performance of low-complexity DNNs pre-trained on the MNIST and CIFAR-10 datasets. These results offer a simple methodology to achieve high memory and computation efficiency of inference in DNN dedicated low-power hardware for IoT, directly from pre-trained, high-resolution DNNs using standard DL algorithms.","PeriodicalId":273095,"journal":{"name":"2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICAS.2019.8771473","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Over the past decade, Deep Neural Networks (DNNs) trained using Deep Learning (DL) frameworks have become the workhorse to solve a wide variety of computational tasks in big data environments. To date, DL DNNs have relied on large amounts of computational power to reach peak performance, typically relying on the high computational bandwidth of GPUs, while straining available memory bandwidth and capacity. With ever increasing data complexity and more stringent energy constraints in Internet-of-Things (IoT) application environments, there has been a growing interest in the development of more efficient DNN inference methods that economize on random-access memory usage in weight access. Herein, we present a systematic analysis of the performance trade-offs of quantized weight representations at variable bit length for memory-efficient inference in pre-trained DNN models. In this work, we vary the mantissa and exponent bit lengths in the representation of the network parameters and examine the effect of DropOut regularization during pre-training and the impact of two different weight truncation mechanisms: stochastic and deterministic rounding. We show drastic reduction in the memory need, down to 4 bits per weight, while maintaining near-optimal test performance of low-complexity DNNs pre-trained on the MNIST and CIFAR-10 datasets. These results offer a simple methodology to achieve high memory and computation efficiency of inference in DNN dedicated low-power hardware for IoT, directly from pre-trained, high-resolution DNNs using standard DL algorithms.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于内存效率推断的权重量化的性能权衡

在过去的十年中，使用深度学习(DL)框架训练的深度神经网络(dnn)已经成为解决大数据环境中各种计算任务的主力。迄今为止，深度深度神经网络依赖于大量的计算能力来达到峰值性能，通常依赖于gpu的高计算带宽，同时使可用的内存带宽和容量紧张。随着物联网(IoT)应用环境中不断增加的数据复杂性和更严格的能量限制，人们对开发更有效的DNN推理方法越来越感兴趣，这些方法可以在权重访问中节省随机访问内存的使用。在此，我们系统地分析了在预训练的DNN模型中，可变位长的量化权重表示用于内存高效推理的性能权衡。在这项工作中，我们改变了网络参数表示中的尾数和指数位长度，并检查了预训练期间DropOut正则化的影响以及两种不同权重截断机制(随机和确定性舍入)的影响。在MNIST和CIFAR-10数据集上预训练的低复杂度dnn保持近乎最佳的测试性能的同时，我们显示了内存需求的大幅减少，每个权重降至4位。这些结果提供了一种简单的方法，可以直接从使用标准深度学习算法的预训练的高分辨率DNN中实现用于物联网的DNN专用低功耗硬件的高内存和计算效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)

自引率

0.00%

发文量

期刊最新文献

Artificial Intelligence of Things Wearable System for Cardiac Disease Detection Fast event-driven incremental learning of hand symbols Accelerating CNN-RNN Based Machine Health Monitoring on FPGA Neuromorphic networks on the SpiNNaker platform Complexity Reduction on HEVC Intra Mode Decision with modified LeNet-5