Latency-aware automatic CNN channel pruning with GPU runtime analysis

Jiaqiang Liu, Jingwei Sun, Zhongtian Xu, Guangzhong Sun
{"title":"Latency-aware automatic CNN channel pruning with GPU runtime analysis","authors":"Jiaqiang Liu,&nbsp;Jingwei Sun,&nbsp;Zhongtian Xu,&nbsp;Guangzhong Sun","doi":"10.1016/j.tbench.2021.100009","DOIUrl":null,"url":null,"abstract":"<div><p>The huge storage and computation cost of convolutional neural networks (CNN) make them challenging to meet the real-time inference requirement in many applications. Existing channel pruning methods mainly focus on removing unimportant channels in a CNN model based on rule-of-thumb designs, using reduced floating-point operations (FLOPs) and parameter numbers to measure the pruning quality. The inference latency of pruned models is often overlooked. In this paper, we propose a latency-aware automatic CNN channel pruning method (LACP), which aims to search low latency and accurate pruned network structure automatically. We evaluate the inaccuracy of measuring pruning quality by FLOPs and the number of parameters, and use the model inference latency as the direct optimization metric. To bridge model pruning and inference acceleration, we analyze the inference latency of convolutional layers on GPU. Results show that the inference latency of convolutional layers exhibits a staircase pattern along with channel number due to the GPU tail effect. Based on that observation, we greatly shrink the search space of network structures. Then we apply an evolutionary procedure to search a computationally efficient pruned network structure, which reduces the inference latency and maintains the model accuracy. Experiments and comparisons with state-of-the-art methods on three image classification datasets show that our method can achieve better inference acceleration with less accuracy loss.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"1 1","pages":"Article 100009"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772485921000090/pdfft?md5=e3e618453811eda67a8549ff2a96e500&pid=1-s2.0-S2772485921000090-main.pdf","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772485921000090","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

The huge storage and computation cost of convolutional neural networks (CNN) make them challenging to meet the real-time inference requirement in many applications. Existing channel pruning methods mainly focus on removing unimportant channels in a CNN model based on rule-of-thumb designs, using reduced floating-point operations (FLOPs) and parameter numbers to measure the pruning quality. The inference latency of pruned models is often overlooked. In this paper, we propose a latency-aware automatic CNN channel pruning method (LACP), which aims to search low latency and accurate pruned network structure automatically. We evaluate the inaccuracy of measuring pruning quality by FLOPs and the number of parameters, and use the model inference latency as the direct optimization metric. To bridge model pruning and inference acceleration, we analyze the inference latency of convolutional layers on GPU. Results show that the inference latency of convolutional layers exhibits a staircase pattern along with channel number due to the GPU tail effect. Based on that observation, we greatly shrink the search space of network structures. Then we apply an evolutionary procedure to search a computationally efficient pruned network structure, which reduces the inference latency and maintains the model accuracy. Experiments and comparisons with state-of-the-art methods on three image classification datasets show that our method can achieve better inference acceleration with less accuracy loss.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
延迟感知自动CNN频道修剪与GPU运行时分析
卷积神经网络(CNN)巨大的存储和计算成本使其难以满足许多应用对实时推理的要求。现有的信道修剪方法主要集中在基于经验法则设计的CNN模型中去除不重要的信道,使用减少的浮点运算(FLOPs)和参数数来衡量修剪质量。精简模型的推理延迟常常被忽略。本文提出了一种延迟感知的CNN频道自动剪枝方法(LACP),该方法旨在自动搜索低延迟和精确剪枝的网络结构。我们评估了用FLOPs和参数数量来衡量修剪质量的不准确性,并使用模型推理延迟作为直接优化度量。为了桥接模型修剪和推理加速,我们分析了卷积层在GPU上的推理延迟。结果表明,由于GPU的尾部效应,卷积层的推理延迟随通道数呈阶梯状分布。在此基础上,我们大大缩小了网络结构的搜索空间。然后应用进化过程搜索计算效率高的剪枝网络结构,减少了推理延迟并保持了模型的准确性。在三个图像分类数据集上的实验和与现有方法的比较表明,我们的方法可以在较小的精度损失下获得更好的推理加速。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
4.80
自引率
0.00%
发文量
0
期刊最新文献
Evaluation of mechanical properties of natural fiber based polymer composite Could bibliometrics reveal top science and technology achievements and researchers? The case for evaluatology-based science and technology evaluation Table of Contents BinCodex: A comprehensive and multi-level dataset for evaluating binary code similarity detection techniques Analyzing the impact of opportunistic maintenance optimization on manufacturing industries in Bangladesh: An empirical study
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1