UCC: A unified cascade compression framework for vision transformer models

IF 5.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neurocomputing Pub Date : 2024-10-18 DOI:10.1016/j.neucom.2024.128747
Dingfu Chen , Kangwei Lin , Qingxu Deng
{"title":"UCC: A unified cascade compression framework for vision transformer models","authors":"Dingfu Chen ,&nbsp;Kangwei Lin ,&nbsp;Qingxu Deng","doi":"10.1016/j.neucom.2024.128747","DOIUrl":null,"url":null,"abstract":"<div><div>In recent years, Vision Transformer (ViT) and its variants have dominated many computer vision tasks. However, the high computational consumption and training data requirements of ViT make it challenging to be deployed directly on resource-constrained devices and environments. Model compression is an effective approach to accelerate deep learning networks, but existing methods for compressing ViT models are limited in their scopes and struggle to strike a balance between performance and computational cost. In this paper, we propose a novel Unified Cascaded Compression Framework (UCC) to compress ViT in a more precise and efficient manner. Specifically, we first analyze the frequency information within tokens and prune them based on a joint score of their both spatial and spectral characteristics. Subsequently, we propose a similarity-based token aggregation scheme that combines the abundant contextual information contained in all pruned tokens with the host tokens according to their weights. Additionally, we introduce a novel cumulative cascaded pruning strategy that performs bottom-up cascaded pruning of tokens based on cumulative scores, avoiding information loss caused by individual idiosyncrasies of blocks. Finally, we design a novel two-level distillation strategy, incorporating imitation and exploration, to ensure the diversity of knowledge and better performance recovery. Extensive experiments demonstrate that our unified cascaded compression framework outperforms most existing state-of-the-art approaches, compresses the floating-point operations of ViT-Base as well as DeiT-Base models by 22 % and 54.1 %, and improves the recognition accuracy of the models by 3.74 % and 1.89 %, respectively, significantly reducing model computational consumption while enhancing performance, which enables efficient end-to-end training of compact ViT models.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":null,"pages":null},"PeriodicalIF":5.5000,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224015182","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

In recent years, Vision Transformer (ViT) and its variants have dominated many computer vision tasks. However, the high computational consumption and training data requirements of ViT make it challenging to be deployed directly on resource-constrained devices and environments. Model compression is an effective approach to accelerate deep learning networks, but existing methods for compressing ViT models are limited in their scopes and struggle to strike a balance between performance and computational cost. In this paper, we propose a novel Unified Cascaded Compression Framework (UCC) to compress ViT in a more precise and efficient manner. Specifically, we first analyze the frequency information within tokens and prune them based on a joint score of their both spatial and spectral characteristics. Subsequently, we propose a similarity-based token aggregation scheme that combines the abundant contextual information contained in all pruned tokens with the host tokens according to their weights. Additionally, we introduce a novel cumulative cascaded pruning strategy that performs bottom-up cascaded pruning of tokens based on cumulative scores, avoiding information loss caused by individual idiosyncrasies of blocks. Finally, we design a novel two-level distillation strategy, incorporating imitation and exploration, to ensure the diversity of knowledge and better performance recovery. Extensive experiments demonstrate that our unified cascaded compression framework outperforms most existing state-of-the-art approaches, compresses the floating-point operations of ViT-Base as well as DeiT-Base models by 22 % and 54.1 %, and improves the recognition accuracy of the models by 3.74 % and 1.89 %, respectively, significantly reducing model computational consumption while enhancing performance, which enables efficient end-to-end training of compact ViT models.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
UCC:视觉变压器模型的统一级联压缩框架
近年来,视觉转换器(ViT)及其变体在许多计算机视觉任务中占据了主导地位。然而,ViT 的高计算消耗和训练数据要求使其难以直接部署在资源受限的设备和环境中。模型压缩是加速深度学习网络的有效方法,但现有的 ViT 模型压缩方法范围有限,难以在性能和计算成本之间取得平衡。在本文中,我们提出了一种新颖的统一级联压缩框架(UCC),以更精确、更高效的方式压缩 ViT。具体来说,我们首先分析标记内的频率信息,并根据其空间和频谱特征的联合评分对其进行剪切。随后,我们提出了一种基于相似性的标记聚合方案,该方案将所有剪切过的标记中包含的丰富上下文信息与主标记根据其权重结合在一起。此外,我们还引入了一种新颖的累积级联剪枝策略,根据累积分数对标记进行自下而上的级联剪枝,避免了因区块的个体特异性而造成的信息损失。最后,我们设计了一种新颖的两级提炼策略,将模仿和探索结合在一起,以确保知识的多样性和更好的性能恢复。广泛的实验证明,我们的统一级联压缩框架优于大多数现有的先进方法,将 ViT-Base 和 DeiT-Base 模型的浮点运算压缩了 22 % 和 54.1 %,并将模型的识别准确率分别提高了 3.74 % 和 1.89 %,在提高性能的同时显著降低了模型的计算消耗,从而实现了紧凑型 ViT 模型的高效端到端训练。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Neurocomputing
Neurocomputing 工程技术-计算机:人工智能
CiteScore
13.10
自引率
10.00%
发文量
1382
审稿时长
70 days
期刊介绍: Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.
期刊最新文献
An efficient re-parameterization feature pyramid network on YOLOv8 to the detection of steel surface defect Editorial Board Multi-contrast image clustering via multi-resolution augmentation and momentum-output queues Augmented ELBO regularization for enhanced clustering in variational autoencoders Learning from different perspectives for regret reduction in reinforcement learning: A free energy approach
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1