UCC: A unified cascade compression framework for vision transformer models

IF 6.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neurocomputing Pub Date : 2024-10-18 DOI:10.1016/j.neucom.2024.128747

Dingfu Chen , Kangwei Lin , Qingxu Deng

{"title":"UCC: A unified cascade compression framework for vision transformer models","authors":"Dingfu Chen , Kangwei Lin , Qingxu Deng","doi":"10.1016/j.neucom.2024.128747","DOIUrl":null,"url":null,"abstract":"<div><div>In recent years, Vision Transformer (ViT) and its variants have dominated many computer vision tasks. However, the high computational consumption and training data requirements of ViT make it challenging to be deployed directly on resource-constrained devices and environments. Model compression is an effective approach to accelerate deep learning networks, but existing methods for compressing ViT models are limited in their scopes and struggle to strike a balance between performance and computational cost. In this paper, we propose a novel Unified Cascaded Compression Framework (UCC) to compress ViT in a more precise and efficient manner. Specifically, we first analyze the frequency information within tokens and prune them based on a joint score of their both spatial and spectral characteristics. Subsequently, we propose a similarity-based token aggregation scheme that combines the abundant contextual information contained in all pruned tokens with the host tokens according to their weights. Additionally, we introduce a novel cumulative cascaded pruning strategy that performs bottom-up cascaded pruning of tokens based on cumulative scores, avoiding information loss caused by individual idiosyncrasies of blocks. Finally, we design a novel two-level distillation strategy, incorporating imitation and exploration, to ensure the diversity of knowledge and better performance recovery. Extensive experiments demonstrate that our unified cascaded compression framework outperforms most existing state-of-the-art approaches, compresses the floating-point operations of ViT-Base as well as DeiT-Base models by 22 % and 54.1 %, and improves the recognition accuracy of the models by 3.74 % and 1.89 %, respectively, significantly reducing model computational consumption while enhancing performance, which enables efficient end-to-end training of compact ViT models.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"612 ","pages":"Article 128747"},"PeriodicalIF":6.5000,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224015182","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, Vision Transformer (ViT) and its variants have dominated many computer vision tasks. However, the high computational consumption and training data requirements of ViT make it challenging to be deployed directly on resource-constrained devices and environments. Model compression is an effective approach to accelerate deep learning networks, but existing methods for compressing ViT models are limited in their scopes and struggle to strike a balance between performance and computational cost. In this paper, we propose a novel Unified Cascaded Compression Framework (UCC) to compress ViT in a more precise and efficient manner. Specifically, we first analyze the frequency information within tokens and prune them based on a joint score of their both spatial and spectral characteristics. Subsequently, we propose a similarity-based token aggregation scheme that combines the abundant contextual information contained in all pruned tokens with the host tokens according to their weights. Additionally, we introduce a novel cumulative cascaded pruning strategy that performs bottom-up cascaded pruning of tokens based on cumulative scores, avoiding information loss caused by individual idiosyncrasies of blocks. Finally, we design a novel two-level distillation strategy, incorporating imitation and exploration, to ensure the diversity of knowledge and better performance recovery. Extensive experiments demonstrate that our unified cascaded compression framework outperforms most existing state-of-the-art approaches, compresses the floating-point operations of ViT-Base as well as DeiT-Base models by 22 % and 54.1 %, and improves the recognition accuracy of the models by 3.74 % and 1.89 %, respectively, significantly reducing model computational consumption while enhancing performance, which enables efficient end-to-end training of compact ViT models.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

UCC：视觉变压器模型的统一级联压缩框架

近年来，视觉转换器（ViT）及其变体在许多计算机视觉任务中占据了主导地位。然而，ViT 的高计算消耗和训练数据要求使其难以直接部署在资源受限的设备和环境中。模型压缩是加速深度学习网络的有效方法，但现有的 ViT 模型压缩方法范围有限，难以在性能和计算成本之间取得平衡。在本文中，我们提出了一种新颖的统一级联压缩框架（UCC），以更精确、更高效的方式压缩 ViT。具体来说，我们首先分析标记内的频率信息，并根据其空间和频谱特征的联合评分对其进行剪切。随后，我们提出了一种基于相似性的标记聚合方案，该方案将所有剪切过的标记中包含的丰富上下文信息与主标记根据其权重结合在一起。此外，我们还引入了一种新颖的累积级联剪枝策略，根据累积分数对标记进行自下而上的级联剪枝，避免了因区块的个体特异性而造成的信息损失。最后，我们设计了一种新颖的两级提炼策略，将模仿和探索结合在一起，以确保知识的多样性和更好的性能恢复。广泛的实验证明，我们的统一级联压缩框架优于大多数现有的先进方法，将 ViT-Base 和 DeiT-Base 模型的浮点运算压缩了 22 % 和 54.1 %，并将模型的识别准确率分别提高了 3.74 % 和 1.89 %，在提高性能的同时显著降低了模型的计算消耗，从而实现了紧凑型 ViT 模型的高效端到端训练。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.

期刊最新文献

Dynamic hypergraph structure learning for spatio-temporal time series forecasting ASHSR: Enhancing query-based occupancy prediction via anti-occlusion sampling and hard sample reweighting CDNE: Community deception from node and edge perspectives 3MU-Net: A multi-layer, multi-view and multi-modal segmentation model for PET/CT images of lung tumors Editorial Board