Mixture of Prompt Learning for Vision Language Models

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI:arxiv-2409.12011

Yu Du, Tong Niu, Rong Zhao

{"title":"Mixture of Prompt Learning for Vision Language Models","authors":"Yu Du, Tong Niu, Rong Zhao","doi":"arxiv-2409.12011","DOIUrl":null,"url":null,"abstract":"As powerful pre-trained vision-language models (VLMs) like CLIP gain\nprominence, numerous studies have attempted to combine VLMs for downstream\ntasks. Among these, prompt learning has been validated as an effective method\nfor adapting to new tasks, which only requiring a small number of parameters.\nHowever, current prompt learning methods face two challenges: first, a single\nsoft prompt struggles to capture the diverse styles and patterns within a\ndataset; second, fine-tuning soft prompts is prone to overfitting. To address\nthese challenges, we propose a mixture of soft prompt learning method\nincorporating a routing module. This module is able to capture a dataset's\nvaried styles and dynamically selects the most suitable prompts for each\ninstance. Additionally, we introduce a novel gating mechanism to ensure the\nrouter selects prompts based on their similarity to hard prompt templates,\nwhich both retaining knowledge from hard prompts and improving selection\naccuracy. We also implement semantically grouped text-level supervision,\ninitializing each soft prompt with the token embeddings of manually designed\ntemplates from its group and applied a contrastive loss between the resulted\ntext feature and hard prompt encoded text feature. This supervision ensures\nthat the text features derived from soft prompts remain close to those from\ntheir corresponding hard prompts, preserving initial knowledge and mitigating\noverfitting. Our method has been validated on 11 datasets, demonstrating\nevident improvements in few-shot learning, domain generalization, and\nbase-to-new generalization scenarios compared to existing baselines. The code\nwill be available at \\url{https://anonymous.4open.science/r/mocoop-6387}","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"50 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

As powerful pre-trained vision-language models (VLMs) like CLIP gain prominence, numerous studies have attempted to combine VLMs for downstream tasks. Among these, prompt learning has been validated as an effective method for adapting to new tasks, which only requiring a small number of parameters. However, current prompt learning methods face two challenges: first, a single soft prompt struggles to capture the diverse styles and patterns within a dataset; second, fine-tuning soft prompts is prone to overfitting. To address these challenges, we propose a mixture of soft prompt learning method incorporating a routing module. This module is able to capture a dataset's varied styles and dynamically selects the most suitable prompts for each instance. Additionally, we introduce a novel gating mechanism to ensure the router selects prompts based on their similarity to hard prompt templates, which both retaining knowledge from hard prompts and improving selection accuracy. We also implement semantically grouped text-level supervision, initializing each soft prompt with the token embeddings of manually designed templates from its group and applied a contrastive loss between the resulted text feature and hard prompt encoded text feature. This supervision ensures that the text features derived from soft prompts remain close to those from their corresponding hard prompts, preserving initial knowledge and mitigating overfitting. Our method has been validated on 11 datasets, demonstrating evident improvements in few-shot learning, domain generalization, and base-to-new generalization scenarios compared to existing baselines. The code will be available at \url{https://anonymous.4open.science/r/mocoop-6387}

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

视觉语言模型的混合提示学习

随着功能强大的预训练视觉语言模型（VLM）（如 CLIP）逐渐占据主导地位，许多研究都尝试结合 VLM 来完成下游任务。然而，当前的提示学习方法面临两个挑战：首先，单一的软提示难以捕捉到数据集中的各种风格和模式；其次，对软提示进行微调容易造成过拟合。为了应对这些挑战，我们提出了一种混合软提示学习方法，其中包含一个路由模块。该模块能够捕捉数据集的不同风格，并为每个实例动态选择最合适的提示。此外，我们还引入了一种新颖的门控机制，确保路由器根据提示语与硬提示语模板的相似度来选择提示语，从而既保留了硬提示语中的知识，又提高了选择的准确性。我们还实现了语义分组文本级监督，用人工设计的模板中的标记嵌入来初始化每个软提示，并在生成的文本特征和硬提示编码的文本特征之间应用对比损失。这种监督确保了从软提示中得出的文本特征与其对应的硬提示中的文本特征保持接近，从而保留了初始知识并减少了过拟合。我们的方法已在 11 个数据集上进行了验证，与现有的基线相比，我们的方法在少量学习、领域泛化和从基础到新泛化等方面都有明显改善。代码可在 \url{https://anonymous.4open.science/r/mocoop-6387} 上获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Computer Vision and Pattern Recognition

自引率

0.00%

发文量

期刊最新文献

Massively Multi-Person 3D Human Motion Forecasting with Scene Context Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Precise Forecasting of Sky Images Using Spatial Warping JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation Applications of Knowledge Distillation in Remote Sensing: A Survey