Mixture of Prompt Learning for Vision Language Models

Yu Du, Tong Niu, Rong Zhao
{"title":"Mixture of Prompt Learning for Vision Language Models","authors":"Yu Du, Tong Niu, Rong Zhao","doi":"arxiv-2409.12011","DOIUrl":null,"url":null,"abstract":"As powerful pre-trained vision-language models (VLMs) like CLIP gain\nprominence, numerous studies have attempted to combine VLMs for downstream\ntasks. Among these, prompt learning has been validated as an effective method\nfor adapting to new tasks, which only requiring a small number of parameters.\nHowever, current prompt learning methods face two challenges: first, a single\nsoft prompt struggles to capture the diverse styles and patterns within a\ndataset; second, fine-tuning soft prompts is prone to overfitting. To address\nthese challenges, we propose a mixture of soft prompt learning method\nincorporating a routing module. This module is able to capture a dataset's\nvaried styles and dynamically selects the most suitable prompts for each\ninstance. Additionally, we introduce a novel gating mechanism to ensure the\nrouter selects prompts based on their similarity to hard prompt templates,\nwhich both retaining knowledge from hard prompts and improving selection\naccuracy. We also implement semantically grouped text-level supervision,\ninitializing each soft prompt with the token embeddings of manually designed\ntemplates from its group and applied a contrastive loss between the resulted\ntext feature and hard prompt encoded text feature. This supervision ensures\nthat the text features derived from soft prompts remain close to those from\ntheir corresponding hard prompts, preserving initial knowledge and mitigating\noverfitting. Our method has been validated on 11 datasets, demonstrating\nevident improvements in few-shot learning, domain generalization, and\nbase-to-new generalization scenarios compared to existing baselines. The code\nwill be available at \\url{https://anonymous.4open.science/r/mocoop-6387}","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

As powerful pre-trained vision-language models (VLMs) like CLIP gain prominence, numerous studies have attempted to combine VLMs for downstream tasks. Among these, prompt learning has been validated as an effective method for adapting to new tasks, which only requiring a small number of parameters. However, current prompt learning methods face two challenges: first, a single soft prompt struggles to capture the diverse styles and patterns within a dataset; second, fine-tuning soft prompts is prone to overfitting. To address these challenges, we propose a mixture of soft prompt learning method incorporating a routing module. This module is able to capture a dataset's varied styles and dynamically selects the most suitable prompts for each instance. Additionally, we introduce a novel gating mechanism to ensure the router selects prompts based on their similarity to hard prompt templates, which both retaining knowledge from hard prompts and improving selection accuracy. We also implement semantically grouped text-level supervision, initializing each soft prompt with the token embeddings of manually designed templates from its group and applied a contrastive loss between the resulted text feature and hard prompt encoded text feature. This supervision ensures that the text features derived from soft prompts remain close to those from their corresponding hard prompts, preserving initial knowledge and mitigating overfitting. Our method has been validated on 11 datasets, demonstrating evident improvements in few-shot learning, domain generalization, and base-to-new generalization scenarios compared to existing baselines. The code will be available at \url{https://anonymous.4open.science/r/mocoop-6387}
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
视觉语言模型的混合提示学习
随着功能强大的预训练视觉语言模型(VLM)(如 CLIP)逐渐占据主导地位,许多研究都尝试结合 VLM 来完成下游任务。然而,当前的提示学习方法面临两个挑战:首先,单一的软提示难以捕捉到数据集中的各种风格和模式;其次,对软提示进行微调容易造成过拟合。为了应对这些挑战,我们提出了一种混合软提示学习方法,其中包含一个路由模块。该模块能够捕捉数据集的不同风格,并为每个实例动态选择最合适的提示。此外,我们还引入了一种新颖的门控机制,确保路由器根据提示语与硬提示语模板的相似度来选择提示语,从而既保留了硬提示语中的知识,又提高了选择的准确性。我们还实现了语义分组文本级监督,用人工设计的模板中的标记嵌入来初始化每个软提示,并在生成的文本特征和硬提示编码的文本特征之间应用对比损失。这种监督确保了从软提示中得出的文本特征与其对应的硬提示中的文本特征保持接近,从而保留了初始知识并减少了过拟合。我们的方法已在 11 个数据集上进行了验证,与现有的基线相比,我们的方法在少量学习、领域泛化和从基础到新泛化等方面都有明显改善。代码可在 \url{https://anonymous.4open.science/r/mocoop-6387} 上获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Massively Multi-Person 3D Human Motion Forecasting with Scene Context Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Precise Forecasting of Sky Images Using Spatial Warping JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation Applications of Knowledge Distillation in Remote Sensing: A Survey
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1