Progressive Visual Prompt Learning with Contrastive Feature Re-formation

IF 11.6 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE International Journal of Computer Vision Pub Date : 2024-08-06 DOI:10.1007/s11263-024-02172-x
Chen Xu, Yuhan Zhu, Haocheng Shen, Boheng Chen, Yixuan Liao, Xiaoxin Chen, Limin Wang
{"title":"Progressive Visual Prompt Learning with Contrastive Feature Re-formation","authors":"Chen Xu, Yuhan Zhu, Haocheng Shen, Boheng Chen, Yixuan Liao, Xiaoxin Chen, Limin Wang","doi":"10.1007/s11263-024-02172-x","DOIUrl":null,"url":null,"abstract":"<p>Prompt learning has recently emerged as a compelling alternative to the traditional fine-tuning paradigm for adapting the pre-trained Vision-Language (V-L) models to downstream tasks. Drawing inspiration from the success of prompt learning in Natural Language Processing, pioneering research efforts have been predominantly concentrated on text-based prompting strategies. By contrast, the visual prompting within V-L models remains underexploited. The straightforward transposition of existing visual prompt methods, tailored for Vision Transformers (ViT), into the V-L models often leads to suboptimal performance or training instability. To mitigate these challenges, in this paper, we propose a novel structure called <b>Pro</b>gressive <b>V</b>isual <b>P</b>rompt (<b>ProVP</b>). This design aims to strengthen the interaction among prompts from adjacent layers, thereby enabling more effective propagation of image embeddings to deeper layers in a manner akin to an instance-specific manner. Additionally, to address the common issue of generalization deterioration in the training period of learnable prompts, we further introduce a contrastive feature re-formation technique for visual prompt learning. This method prevents significant deviations of prompted visual features from the fixed CLIP visual feature distribution, ensuring its better generalization capability. Combining the <b>ProVP</b> and the contrastive feature re-formation technique, our proposed method, <b>ProVP-Ref</b>, significantly stabilizes the training process and enhances both the adaptation and generalization capabilities of visual prompt learning in V-L models. To demonstrate the efficacy of our approach, we evaluate ProVP-Ref across 11 image datasets, achieving the state-of-the-art results on <b>7</b> of these datasets in both few-shot learning and base-to-new generalization settings. To the best of our knowledge, this is the first study to showcase the exceptional performance of visual prompts in V-L models compared to previous text prompting methods in this area.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"98 1","pages":""},"PeriodicalIF":11.6000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-024-02172-x","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Prompt learning has recently emerged as a compelling alternative to the traditional fine-tuning paradigm for adapting the pre-trained Vision-Language (V-L) models to downstream tasks. Drawing inspiration from the success of prompt learning in Natural Language Processing, pioneering research efforts have been predominantly concentrated on text-based prompting strategies. By contrast, the visual prompting within V-L models remains underexploited. The straightforward transposition of existing visual prompt methods, tailored for Vision Transformers (ViT), into the V-L models often leads to suboptimal performance or training instability. To mitigate these challenges, in this paper, we propose a novel structure called Progressive Visual Prompt (ProVP). This design aims to strengthen the interaction among prompts from adjacent layers, thereby enabling more effective propagation of image embeddings to deeper layers in a manner akin to an instance-specific manner. Additionally, to address the common issue of generalization deterioration in the training period of learnable prompts, we further introduce a contrastive feature re-formation technique for visual prompt learning. This method prevents significant deviations of prompted visual features from the fixed CLIP visual feature distribution, ensuring its better generalization capability. Combining the ProVP and the contrastive feature re-formation technique, our proposed method, ProVP-Ref, significantly stabilizes the training process and enhances both the adaptation and generalization capabilities of visual prompt learning in V-L models. To demonstrate the efficacy of our approach, we evaluate ProVP-Ref across 11 image datasets, achieving the state-of-the-art results on 7 of these datasets in both few-shot learning and base-to-new generalization settings. To the best of our knowledge, this is the first study to showcase the exceptional performance of visual prompts in V-L models compared to previous text prompting methods in this area.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用对比特征重构进行渐进式视觉提示学习
提示学习是近来出现的一种引人注目的替代传统微调范式的方法,用于将预先训练好的视觉语言(V-L)模型适应下游任务。受自然语言处理中提示学习成功经验的启发,开创性的研究工作主要集中在基于文本的提示策略上。相比之下,V-L 模型中的视觉提示仍未得到充分利用。将为视觉转换器(ViT)量身定制的现有视觉提示方法直接移植到 V-L 模型中,往往会导致性能不理想或训练不稳定。为了缓解这些挑战,我们在本文中提出了一种名为渐进式视觉提示(ProVP)的新结构。这种设计旨在加强相邻层提示之间的互动,从而使图像嵌入以类似于特定实例的方式更有效地传播到更深的层。此外,为了解决可学习提示在训练期间泛化能力下降的常见问题,我们进一步引入了一种用于视觉提示学习的对比特征重构技术。这种方法可以防止提示的视觉特征与固定的 CLIP 视觉特征分布产生明显偏差,从而确保其具有更好的泛化能力。结合 ProVP 和对比特征重构技术,我们提出的 ProVP-Ref 方法能显著稳定训练过程,并增强 V-L 模型中视觉提示学习的适应性和泛化能力。为了证明我们的方法的有效性,我们在 11 个图像数据集上对 ProVP-Ref 进行了评估,在其中 7 个数据集上,我们在少次学习和从基础到新的泛化设置上都取得了最先进的结果。据我们所知,这是第一项在 V-L 模型中展示视觉提示与该领域以前的文本提示方法相比的卓越性能的研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
International Journal of Computer Vision
International Journal of Computer Vision 工程技术-计算机:人工智能
CiteScore
29.80
自引率
2.10%
发文量
163
审稿时长
6 months
期刊介绍: The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.
期刊最新文献
CS-CoLBP: Cross-Scale Co-occurrence Local Binary Pattern for Image Classification Warping the Residuals for Image Editing with StyleGAN Pulling Target to Source: A New Perspective on Domain Adaptive Semantic Segmentation Feature Matching via Graph Clustering with Local Affine Consensus Learning to Detect Novel Species with SAM in the Wild
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1