Latency-aware automatic CNN channel pruning with GPU runtime analysis

BenchCouncil Transactions on Benchmarks, Standards and Evaluations Pub Date : 2021-10-01 DOI:10.1016/j.tbench.2021.100009

Jiaqiang Liu, Jingwei Sun, Zhongtian Xu, Guangzhong Sun

{"title":"Latency-aware automatic CNN channel pruning with GPU runtime analysis","authors":"Jiaqiang Liu, Jingwei Sun, Zhongtian Xu, Guangzhong Sun","doi":"10.1016/j.tbench.2021.100009","DOIUrl":null,"url":null,"abstract":"<div><p>The huge storage and computation cost of convolutional neural networks (CNN) make them challenging to meet the real-time inference requirement in many applications. Existing channel pruning methods mainly focus on removing unimportant channels in a CNN model based on rule-of-thumb designs, using reduced floating-point operations (FLOPs) and parameter numbers to measure the pruning quality. The inference latency of pruned models is often overlooked. In this paper, we propose a latency-aware automatic CNN channel pruning method (LACP), which aims to search low latency and accurate pruned network structure automatically. We evaluate the inaccuracy of measuring pruning quality by FLOPs and the number of parameters, and use the model inference latency as the direct optimization metric. To bridge model pruning and inference acceleration, we analyze the inference latency of convolutional layers on GPU. Results show that the inference latency of convolutional layers exhibits a staircase pattern along with channel number due to the GPU tail effect. Based on that observation, we greatly shrink the search space of network structures. Then we apply an evolutionary procedure to search a computationally efficient pruned network structure, which reduces the inference latency and maintains the model accuracy. Experiments and comparisons with state-of-the-art methods on three image classification datasets show that our method can achieve better inference acceleration with less accuracy loss.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"1 1","pages":"Article 100009"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772485921000090/pdfft?md5=e3e618453811eda67a8549ff2a96e500&pid=1-s2.0-S2772485921000090-main.pdf","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772485921000090","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

The huge storage and computation cost of convolutional neural networks (CNN) make them challenging to meet the real-time inference requirement in many applications. Existing channel pruning methods mainly focus on removing unimportant channels in a CNN model based on rule-of-thumb designs, using reduced floating-point operations (FLOPs) and parameter numbers to measure the pruning quality. The inference latency of pruned models is often overlooked. In this paper, we propose a latency-aware automatic CNN channel pruning method (LACP), which aims to search low latency and accurate pruned network structure automatically. We evaluate the inaccuracy of measuring pruning quality by FLOPs and the number of parameters, and use the model inference latency as the direct optimization metric. To bridge model pruning and inference acceleration, we analyze the inference latency of convolutional layers on GPU. Results show that the inference latency of convolutional layers exhibits a staircase pattern along with channel number due to the GPU tail effect. Based on that observation, we greatly shrink the search space of network structures. Then we apply an evolutionary procedure to search a computationally efficient pruned network structure, which reduces the inference latency and maintains the model accuracy. Experiments and comparisons with state-of-the-art methods on three image classification datasets show that our method can achieve better inference acceleration with less accuracy loss.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

延迟感知自动CNN频道修剪与GPU运行时分析

卷积神经网络(CNN)巨大的存储和计算成本使其难以满足许多应用对实时推理的要求。现有的信道修剪方法主要集中在基于经验法则设计的CNN模型中去除不重要的信道，使用减少的浮点运算(FLOPs)和参数数来衡量修剪质量。精简模型的推理延迟常常被忽略。本文提出了一种延迟感知的CNN频道自动剪枝方法(LACP)，该方法旨在自动搜索低延迟和精确剪枝的网络结构。我们评估了用FLOPs和参数数量来衡量修剪质量的不准确性，并使用模型推理延迟作为直接优化度量。为了桥接模型修剪和推理加速，我们分析了卷积层在GPU上的推理延迟。结果表明，由于GPU的尾部效应，卷积层的推理延迟随通道数呈阶梯状分布。在此基础上，我们大大缩小了网络结构的搜索空间。然后应用进化过程搜索计算效率高的剪枝网络结构，减少了推理延迟并保持了模型的准确性。在三个图像分类数据集上的实验和与现有方法的比较表明，我们的方法可以在较小的精度损失下获得更好的推理加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

BenchCouncil Transactions on Benchmarks, Standards and Evaluations

CiteScore

4.80

自引率

0.00%

发文量