Ruiting Dai, Yuqiao Tan, Lisi Mo, Tao He, Ke Qin, Shuang Liang
{"title":"MuAP:缺失模态视觉语言模型的多步自适应提示学习","authors":"Ruiting Dai, Yuqiao Tan, Lisi Mo, Tao He, Ke Qin, Shuang Liang","doi":"arxiv-2409.04693","DOIUrl":null,"url":null,"abstract":"Recently, prompt learning has garnered considerable attention for its success\nin various Vision-Language (VL) tasks. However, existing prompt-based models\nare primarily focused on studying prompt generation and prompt strategies with\ncomplete modality settings, which does not accurately reflect real-world\nscenarios where partial modality information may be missing. In this paper, we\npresent the first comprehensive investigation into prompt learning behavior\nwhen modalities are incomplete, revealing the high sensitivity of prompt-based\nmodels to missing modalities. To this end, we propose a novel Multi-step\nAdaptive Prompt Learning (MuAP) framework, aiming to generate multimodal\nprompts and perform multi-step prompt tuning, which adaptively learns knowledge\nby iteratively aligning modalities. Specifically, we generate multimodal\nprompts for each modality and devise prompt strategies to integrate them into\nthe Transformer model. Subsequently, we sequentially perform prompt tuning from\nsingle-stage and alignment-stage, allowing each modality-prompt to be\nautonomously and adaptively learned, thereby mitigating the imbalance issue\ncaused by only textual prompts that are learnable in previous works. Extensive\nexperiments demonstrate the effectiveness of our MuAP and this model achieves\nsignificant improvements compared to the state-of-the-art on all benchmark\ndatasets","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"37 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with Missing Modality\",\"authors\":\"Ruiting Dai, Yuqiao Tan, Lisi Mo, Tao He, Ke Qin, Shuang Liang\",\"doi\":\"arxiv-2409.04693\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, prompt learning has garnered considerable attention for its success\\nin various Vision-Language (VL) tasks. However, existing prompt-based models\\nare primarily focused on studying prompt generation and prompt strategies with\\ncomplete modality settings, which does not accurately reflect real-world\\nscenarios where partial modality information may be missing. In this paper, we\\npresent the first comprehensive investigation into prompt learning behavior\\nwhen modalities are incomplete, revealing the high sensitivity of prompt-based\\nmodels to missing modalities. To this end, we propose a novel Multi-step\\nAdaptive Prompt Learning (MuAP) framework, aiming to generate multimodal\\nprompts and perform multi-step prompt tuning, which adaptively learns knowledge\\nby iteratively aligning modalities. Specifically, we generate multimodal\\nprompts for each modality and devise prompt strategies to integrate them into\\nthe Transformer model. Subsequently, we sequentially perform prompt tuning from\\nsingle-stage and alignment-stage, allowing each modality-prompt to be\\nautonomously and adaptively learned, thereby mitigating the imbalance issue\\ncaused by only textual prompts that are learnable in previous works. Extensive\\nexperiments demonstrate the effectiveness of our MuAP and this model achieves\\nsignificant improvements compared to the state-of-the-art on all benchmark\\ndatasets\",\"PeriodicalId\":501479,\"journal\":{\"name\":\"arXiv - CS - Artificial Intelligence\",\"volume\":\"37 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.04693\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04693","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with Missing Modality
Recently, prompt learning has garnered considerable attention for its success
in various Vision-Language (VL) tasks. However, existing prompt-based models
are primarily focused on studying prompt generation and prompt strategies with
complete modality settings, which does not accurately reflect real-world
scenarios where partial modality information may be missing. In this paper, we
present the first comprehensive investigation into prompt learning behavior
when modalities are incomplete, revealing the high sensitivity of prompt-based
models to missing modalities. To this end, we propose a novel Multi-step
Adaptive Prompt Learning (MuAP) framework, aiming to generate multimodal
prompts and perform multi-step prompt tuning, which adaptively learns knowledge
by iteratively aligning modalities. Specifically, we generate multimodal
prompts for each modality and devise prompt strategies to integrate them into
the Transformer model. Subsequently, we sequentially perform prompt tuning from
single-stage and alignment-stage, allowing each modality-prompt to be
autonomously and adaptively learned, thereby mitigating the imbalance issue
caused by only textual prompts that are learnable in previous works. Extensive
experiments demonstrate the effectiveness of our MuAP and this model achieves
significant improvements compared to the state-of-the-art on all benchmark
datasets