Vertical Layering of Quantized Neural Networks for Heterogeneous Inference

IF 18.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IEEE Transactions on Pattern Analysis and Machine Intelligence Pub Date : 2022-12-10 DOI:10.48550/arXiv.2212.05326

Hai Wu, Ruifei He, Hao Hao Tan, Xiaojuan Qi, Kaibin Huang

{"title":"Vertical Layering of Quantized Neural Networks for Heterogeneous Inference","authors":"Hai Wu, Ruifei He, Hao Hao Tan, Xiaojuan Qi, Kaibin Huang","doi":"10.48550/arXiv.2212.05326","DOIUrl":null,"url":null,"abstract":"Although considerable progress has been obtained in neural network quantization for efficient inference, existing methods are not scalable to heterogeneous devices as one dedicated model needs to be trained, transmitted, and stored for one specific hardware setting, incurring considerable costs in model training and maintenance. In this paper, we study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one. It represents weights as a group of bits (vertical layers) organized from the most significant bit (also called the basic layer) to less significant bits (enhance layers). Hence, a neural network with an arbitrary quantization precision can be obtained by adding corresponding enhance layers to the basic layer. However, we empirically find that models obtained with existing quantization methods suffer severe performance degradation if adapted to vertical-layered weight representation. To this end, we propose a simple once quantization-aware training (QAT) scheme for obtaining high-performance vertical-layered models. Our design incorporates a cascade downsampling mechanism with the multi-objective optimization employed to train the shared source model weights such that they can be updated simultaneously, considering the performance of all networks. After the model is trained, to construct a vertical-layered network, the lowest bit-width quantized weights become the basic layer, and every bit dropped along the downsampling process act as an enhance layer. Our design is extensively evaluated on CIFAR-100 and ImageNet datasets. Experiments show that the proposed vertical-layered representation and developed once QAT scheme are effective in embodying multiple quantized networks into a single one and allow one-time training, and it delivers comparable performance as that of quantized models tailored to any specific bit-width.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":" ","pages":""},"PeriodicalIF":18.6000,"publicationDate":"2022-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.48550/arXiv.2212.05326","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Although considerable progress has been obtained in neural network quantization for efficient inference, existing methods are not scalable to heterogeneous devices as one dedicated model needs to be trained, transmitted, and stored for one specific hardware setting, incurring considerable costs in model training and maintenance. In this paper, we study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one. It represents weights as a group of bits (vertical layers) organized from the most significant bit (also called the basic layer) to less significant bits (enhance layers). Hence, a neural network with an arbitrary quantization precision can be obtained by adding corresponding enhance layers to the basic layer. However, we empirically find that models obtained with existing quantization methods suffer severe performance degradation if adapted to vertical-layered weight representation. To this end, we propose a simple once quantization-aware training (QAT) scheme for obtaining high-performance vertical-layered models. Our design incorporates a cascade downsampling mechanism with the multi-objective optimization employed to train the shared source model weights such that they can be updated simultaneously, considering the performance of all networks. After the model is trained, to construct a vertical-layered network, the lowest bit-width quantized weights become the basic layer, and every bit dropped along the downsampling process act as an enhance layer. Our design is extensively evaluated on CIFAR-100 and ImageNet datasets. Experiments show that the proposed vertical-layered representation and developed once QAT scheme are effective in embodying multiple quantized networks into a single one and allow one-time training, and it delivers comparable performance as that of quantized models tailored to any specific bit-width.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于异构推理的量化神经网络的垂直分层

尽管神经网络量化在高效推理方面取得了相当大的进展，但现有方法不能扩展到异构设备，因为需要针对特定的硬件设置训练、传输和存储一个专用模型，这在模型训练和维护方面产生了相当大的成本。在本文中，我们研究了一种新的神经网络权重的垂直分层表示，用于将所有量化模型封装为单个模型。它将权重表示为一组位(垂直层)，从最重要的位(也称为基础层)到不重要的位(增强层)进行组织。因此，通过在基本层上增加相应的增强层，可以得到任意量化精度的神经网络。然而，我们的经验发现，使用现有量化方法获得的模型如果适应垂直分层权重表示，则会出现严重的性能下降。为此，我们提出了一种简单的一次性量化感知训练(QAT)方案来获得高性能的垂直分层模型。我们的设计结合了级联下采样机制和多目标优化，用于训练共享源模型权重，使它们可以同时更新，同时考虑到所有网络的性能。模型训练完成后，构建垂直分层网络，将比特宽度最小的量化权值作为基础层，下采样过程中丢失的每一个比特作为增强层。我们的设计在CIFAR-100和ImageNet数据集上进行了广泛的评估。实验表明，所提出的垂直分层表示和开发的一次QAT方案可以有效地将多个量化网络体现为单个网络，并允许一次性训练，并且其性能与针对任何特定位宽定制的量化模型相当。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Pattern Analysis and Machine Intelligence 工程技术-工程：电子与电气

CiteScore

28.40

自引率

3.00%

发文量

885

审稿时长

8.5 months

期刊介绍： The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.