{"title":"MCPL: Multi-modal Collaborative Prompt Learning for Medical Vision-Language Model.","authors":"Pengyu Wang, Huaqi Zhang, Yixuan Yuan","doi":"10.1109/TMI.2024.3418408","DOIUrl":null,"url":null,"abstract":"<p><p>Multi-modal prompt learning is a high-performance and cost-effective learning paradigm, which learns text as well as image prompts to tune pre-trained vision-language (V-L) models like CLIP for adapting multiple downstream tasks. However, recent methods typically treat text and image prompts as independent components without considering the dependency between prompts. Moreover, extending multi-modal prompt learning into the medical field poses challenges due to a significant gap between general- and medical-domain data. To this end, we propose a Multi-modal Collaborative Prompt Learning (MCPL) pipeline to tune a frozen V-L model for aligning medical text-image representations, thereby achieving medical downstream tasks. We first construct the anatomy-pathology (AP) prompt for multi-modal prompting jointly with text and image prompts. The AP prompt introduces instance-level anatomy and pathology information, thereby making a V-L model better comprehend medical reports and images. Next, we propose graph-guided prompt collaboration module (GPCM), which explicitly establishes multi-way couplings between the AP, text, and image prompts, enabling collaborative multi-modal prompt producing and updating for more effective prompting. Finally, we develop a novel prompt configuration scheme, which attaches the AP prompt to the query and key, and the text/image prompt to the value in self-attention layers for improving the interpretability of multi-modal prompts. Extensive experiments on numerous medical classification and object detection datasets show that the proposed pipeline achieves excellent effectiveness and generalization. Compared with state-of-the-art prompt learning methods, MCPL provides a more reliable multi-modal prompt paradigm for reducing tuning costs of V-L models on medical downstream tasks. Our code: https://github.com/CUHK-AIM-Group/MCPL.</p>","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"PP ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on medical imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TMI.2024.3418408","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Multi-modal prompt learning is a high-performance and cost-effective learning paradigm, which learns text as well as image prompts to tune pre-trained vision-language (V-L) models like CLIP for adapting multiple downstream tasks. However, recent methods typically treat text and image prompts as independent components without considering the dependency between prompts. Moreover, extending multi-modal prompt learning into the medical field poses challenges due to a significant gap between general- and medical-domain data. To this end, we propose a Multi-modal Collaborative Prompt Learning (MCPL) pipeline to tune a frozen V-L model for aligning medical text-image representations, thereby achieving medical downstream tasks. We first construct the anatomy-pathology (AP) prompt for multi-modal prompting jointly with text and image prompts. The AP prompt introduces instance-level anatomy and pathology information, thereby making a V-L model better comprehend medical reports and images. Next, we propose graph-guided prompt collaboration module (GPCM), which explicitly establishes multi-way couplings between the AP, text, and image prompts, enabling collaborative multi-modal prompt producing and updating for more effective prompting. Finally, we develop a novel prompt configuration scheme, which attaches the AP prompt to the query and key, and the text/image prompt to the value in self-attention layers for improving the interpretability of multi-modal prompts. Extensive experiments on numerous medical classification and object detection datasets show that the proposed pipeline achieves excellent effectiveness and generalization. Compared with state-of-the-art prompt learning methods, MCPL provides a more reliable multi-modal prompt paradigm for reducing tuning costs of V-L models on medical downstream tasks. Our code: https://github.com/CUHK-AIM-Group/MCPL.