PAdapter: Adapter combined with prompt for image and video classification

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Image and Vision Computing Pub Date : 2025-02-01 Epub Date: 2024-12-18 DOI:10.1016/j.imavis.2024.105395

Youwei Li, Junyong Ye, Xubin Wen, Guangyi Xu, Jingjing Wang, Xinyuan Liu

{"title":"PAdapter: Adapter combined with prompt for image and video classification","authors":"Youwei Li, Junyong Ye, Xubin Wen, Guangyi Xu, Jingjing Wang, Xinyuan Liu","doi":"10.1016/j.imavis.2024.105395","DOIUrl":null,"url":null,"abstract":"<div><div>In computer vision, parameter-efficient transfer learning has become an extensively used technology. Adapter is one of the commonly used basic modules, and its simplicity and efficiency have been widely proven. In the case of freezing the network backbone, only fine-tuning additional adapters can often achieve similar or even better results with lower computational costs compared to fully fine-tuning. However, the bottleneck structure of Adapter leads to a non-negligible information loss, thereby limiting the performance of Adapter. To alleviate this problem, this work proposes a plug-and-play lightweight module called PAdapter, which is a Prompt-combined Adapter that can achieve parameter-efficient transfer learning on image classification and video action recognition tasks. PAdapter is improved based on Adapter, and Prompt is introduced at the bottleneck to supplement the information that may be lost. Specifically, in the bottleneck structure of Adapter, we concatenate a learnable Prompt with bottleneck features at dimension <em>D</em> to supplement information and even enhance the visual expression ability of bottleneck features. Many experiments on image classification and video action recognition show that PAdapter achieves or exceeds the accuracy of full fine-tuning models with less than 2% extra parameters updated. For example, on the SSv2 and HMDB-51 datasets, PAdapter improves the accuracy by 5.49% and 16.68% respectively compared to full fine-tuning. And in almost all experiments, our PAdapter achieved higher accuracy than Adapter with similar number of tunable parameters. Code is available at <span><span>https://github.com/owlholy/PAdapter</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105395"},"PeriodicalIF":4.2000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624005006","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/18 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In computer vision, parameter-efficient transfer learning has become an extensively used technology. Adapter is one of the commonly used basic modules, and its simplicity and efficiency have been widely proven. In the case of freezing the network backbone, only fine-tuning additional adapters can often achieve similar or even better results with lower computational costs compared to fully fine-tuning. However, the bottleneck structure of Adapter leads to a non-negligible information loss, thereby limiting the performance of Adapter. To alleviate this problem, this work proposes a plug-and-play lightweight module called PAdapter, which is a Prompt-combined Adapter that can achieve parameter-efficient transfer learning on image classification and video action recognition tasks. PAdapter is improved based on Adapter, and Prompt is introduced at the bottleneck to supplement the information that may be lost. Specifically, in the bottleneck structure of Adapter, we concatenate a learnable Prompt with bottleneck features at dimension D to supplement information and even enhance the visual expression ability of bottleneck features. Many experiments on image classification and video action recognition show that PAdapter achieves or exceeds the accuracy of full fine-tuning models with less than 2% extra parameters updated. For example, on the SSv2 and HMDB-51 datasets, PAdapter improves the accuracy by 5.49% and 16.68% respectively compared to full fine-tuning. And in almost all experiments, our PAdapter achieved higher accuracy than Adapter with similar number of tunable parameters. Code is available at https://github.com/owlholy/PAdapter.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PAdapter：带有图像和视频分类提示的适配器

在计算机视觉中，参数高效迁移学习已经成为一项应用广泛的技术。适配器是一种常用的基本模块，它的简单性和高效性已经得到了广泛的证明。在冻结网络主干网的情况下，与完全微调相比，只有微调额外的适配器才能以更低的计算成本获得类似甚至更好的结果。然而，适配器的瓶颈结构导致了不可忽略的信息丢失，从而限制了适配器的性能。为了缓解这一问题，本工作提出了一种即插即用的轻量级模块PAdapter，它是一种可以在图像分类和视频动作识别任务上实现参数高效迁移学习的prompt - combination Adapter。PAdapter在Adapter的基础上进行了改进，并在瓶颈处引入了Prompt来补充可能丢失的信息。具体来说，在Adapter的瓶颈结构中，我们在D维将一个可学习的Prompt与瓶颈特征串联起来，补充信息，甚至增强瓶颈特征的视觉表达能力。许多图像分类和视频动作识别实验表明，PAdapter在不超过2%的额外参数更新的情况下达到或超过了全微调模型的精度。例如，在SSv2和HMDB-51数据集上，与完全微调相比，PAdapter分别提高了5.49%和16.68%的精度。在几乎所有的实验中，我们的PAdapter比具有相同数量可调参数的Adapter获得了更高的精度。代码可从https://github.com/owlholy/PAdapter获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.