{"title":"UCC: A unified cascade compression framework for vision transformer models","authors":"Dingfu Chen , Kangwei Lin , Qingxu Deng","doi":"10.1016/j.neucom.2024.128747","DOIUrl":null,"url":null,"abstract":"<div><div>In recent years, Vision Transformer (ViT) and its variants have dominated many computer vision tasks. However, the high computational consumption and training data requirements of ViT make it challenging to be deployed directly on resource-constrained devices and environments. Model compression is an effective approach to accelerate deep learning networks, but existing methods for compressing ViT models are limited in their scopes and struggle to strike a balance between performance and computational cost. In this paper, we propose a novel Unified Cascaded Compression Framework (UCC) to compress ViT in a more precise and efficient manner. Specifically, we first analyze the frequency information within tokens and prune them based on a joint score of their both spatial and spectral characteristics. Subsequently, we propose a similarity-based token aggregation scheme that combines the abundant contextual information contained in all pruned tokens with the host tokens according to their weights. Additionally, we introduce a novel cumulative cascaded pruning strategy that performs bottom-up cascaded pruning of tokens based on cumulative scores, avoiding information loss caused by individual idiosyncrasies of blocks. Finally, we design a novel two-level distillation strategy, incorporating imitation and exploration, to ensure the diversity of knowledge and better performance recovery. Extensive experiments demonstrate that our unified cascaded compression framework outperforms most existing state-of-the-art approaches, compresses the floating-point operations of ViT-Base as well as DeiT-Base models by 22 % and 54.1 %, and improves the recognition accuracy of the models by 3.74 % and 1.89 %, respectively, significantly reducing model computational consumption while enhancing performance, which enables efficient end-to-end training of compact ViT models.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":null,"pages":null},"PeriodicalIF":5.5000,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224015182","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, Vision Transformer (ViT) and its variants have dominated many computer vision tasks. However, the high computational consumption and training data requirements of ViT make it challenging to be deployed directly on resource-constrained devices and environments. Model compression is an effective approach to accelerate deep learning networks, but existing methods for compressing ViT models are limited in their scopes and struggle to strike a balance between performance and computational cost. In this paper, we propose a novel Unified Cascaded Compression Framework (UCC) to compress ViT in a more precise and efficient manner. Specifically, we first analyze the frequency information within tokens and prune them based on a joint score of their both spatial and spectral characteristics. Subsequently, we propose a similarity-based token aggregation scheme that combines the abundant contextual information contained in all pruned tokens with the host tokens according to their weights. Additionally, we introduce a novel cumulative cascaded pruning strategy that performs bottom-up cascaded pruning of tokens based on cumulative scores, avoiding information loss caused by individual idiosyncrasies of blocks. Finally, we design a novel two-level distillation strategy, incorporating imitation and exploration, to ensure the diversity of knowledge and better performance recovery. Extensive experiments demonstrate that our unified cascaded compression framework outperforms most existing state-of-the-art approaches, compresses the floating-point operations of ViT-Base as well as DeiT-Base models by 22 % and 54.1 %, and improves the recognition accuracy of the models by 3.74 % and 1.89 %, respectively, significantly reducing model computational consumption while enhancing performance, which enables efficient end-to-end training of compact ViT models.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.