MCPL：医学视觉语言模型的多模式协作提示学习。

IEEE transactions on medical imaging Pub Date : 2024-06-24 DOI:10.1109/TMI.2024.3418408

Pengyu Wang;Huaqi Zhang;Yixuan Yuan

{"title":"MCPL：医学视觉语言模型的多模式协作提示学习。","authors":"Pengyu Wang;Huaqi Zhang;Yixuan Yuan","doi":"10.1109/TMI.2024.3418408","DOIUrl":null,"url":null,"abstract":"Multi-modal prompt learning is a high-performance and cost-effective learning paradigm, which learns text as well as image prompts to tune pre-trained vision-language (V-L) models like CLIP for adapting multiple downstream tasks. However, recent methods typically treat text and image prompts as independent components without considering the dependency between prompts. Moreover, extending multi-modal prompt learning into the medical field poses challenges due to a significant gap between general- and medical-domain data. To this end, we propose a Multi-modal Collaborative Prompt Learning (MCPL) pipeline to tune a frozen V-L model for aligning medical text-image representations, thereby achieving medical downstream tasks. We first construct the anatomy-pathology (AP) prompt for multi-modal prompting jointly with text and image prompts. The AP prompt introduces instance-level anatomy and pathology information, thereby making a V-L model better comprehend medical reports and images. Next, we propose graph-guided prompt collaboration module (GPCM), which explicitly establishes multi-way couplings between the AP, text, and image prompts, enabling collaborative multi-modal prompt producing and updating for more effective prompting. Finally, we develop a novel prompt configuration scheme, which attaches the AP prompt to the query and key, and the text/image prompt to the value in self-attention layers for improving the interpretability of multi-modal prompts. Extensive experiments on numerous medical classification and object detection datasets show that the proposed pipeline achieves excellent effectiveness and generalization. Compared with state-of-the-art prompt learning methods, MCPL provides a more reliable multi-modal prompt paradigm for reducing tuning costs of V-L models on medical downstream tasks. Our code: \n<uri>https://github.com/CUHK-AIM-Group/MCPL</uri>\n.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"43 12","pages":"4224-4235"},"PeriodicalIF":0.0000,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MCPL: Multi-Modal Collaborative Prompt Learning for Medical Vision-Language Model\",\"authors\":\"Pengyu Wang;Huaqi Zhang;Yixuan Yuan\",\"doi\":\"10.1109/TMI.2024.3418408\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multi-modal prompt learning is a high-performance and cost-effective learning paradigm, which learns text as well as image prompts to tune pre-trained vision-language (V-L) models like CLIP for adapting multiple downstream tasks. However, recent methods typically treat text and image prompts as independent components without considering the dependency between prompts. Moreover, extending multi-modal prompt learning into the medical field poses challenges due to a significant gap between general- and medical-domain data. To this end, we propose a Multi-modal Collaborative Prompt Learning (MCPL) pipeline to tune a frozen V-L model for aligning medical text-image representations, thereby achieving medical downstream tasks. We first construct the anatomy-pathology (AP) prompt for multi-modal prompting jointly with text and image prompts. The AP prompt introduces instance-level anatomy and pathology information, thereby making a V-L model better comprehend medical reports and images. Next, we propose graph-guided prompt collaboration module (GPCM), which explicitly establishes multi-way couplings between the AP, text, and image prompts, enabling collaborative multi-modal prompt producing and updating for more effective prompting. Finally, we develop a novel prompt configuration scheme, which attaches the AP prompt to the query and key, and the text/image prompt to the value in self-attention layers for improving the interpretability of multi-modal prompts. Extensive experiments on numerous medical classification and object detection datasets show that the proposed pipeline achieves excellent effectiveness and generalization. Compared with state-of-the-art prompt learning methods, MCPL provides a more reliable multi-modal prompt paradigm for reducing tuning costs of V-L models on medical downstream tasks. Our code: \\n<uri>https://github.com/CUHK-AIM-Group/MCPL</uri>\\n.\",\"PeriodicalId\":94033,\"journal\":{\"name\":\"IEEE transactions on medical imaging\",\"volume\":\"43 12\",\"pages\":\"4224-4235\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on medical imaging\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10570257/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on medical imaging","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10570257/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

多模态提示学习是一种高性能、高性价比的学习范式，它通过学习文本和图像提示来调整像 CLIP 这样的预训练视觉语言（V-L）模型，以适应多种下游任务。然而，最近的方法通常将文本和图像提示作为独立组件处理，而不考虑提示之间的依赖关系。此外，由于通用数据和医疗领域数据之间存在巨大差距，将多模态提示学习扩展到医疗领域面临着挑战。为此，我们提出了多模态协同提示学习（MCPL）管道，以调整用于对齐医学文本-图像表征的冻结 V-L 模型，从而实现医学下游任务。我们首先构建了解剖病理学（AP）提示，用于联合文本和图像提示进行多模态提示。解剖病理提示引入了实例级的解剖和病理信息，从而使 V-L 模型能更好地理解医疗报告和图像。接着，我们提出了图引导提示协作模块（GPCM），该模块明确地在AP、文本和图像提示之间建立了多向耦合，实现了多模态提示的协作生成和更新，从而提高了提示效率。最后，我们开发了一种新颖的提示配置方案，将 AP 提示与查询和密钥相连，将文本/图像提示与自我关注层中的值相连，以提高多模态提示的可解释性。在大量医疗分类和物体检测数据集上进行的广泛实验表明，所提出的管道具有出色的有效性和泛化能力。与最先进的提示学习方法相比，MCPL 提供了一种更可靠的多模态提示范例，可降低医疗下游任务中 V-L 模型的调整成本。我们的代码：https://github.com/CUHK-AIM-Group/MCPL。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MCPL: Multi-Modal Collaborative Prompt Learning for Medical Vision-Language Model

Multi-modal prompt learning is a high-performance and cost-effective learning paradigm, which learns text as well as image prompts to tune pre-trained vision-language (V-L) models like CLIP for adapting multiple downstream tasks. However, recent methods typically treat text and image prompts as independent components without considering the dependency between prompts. Moreover, extending multi-modal prompt learning into the medical field poses challenges due to a significant gap between general- and medical-domain data. To this end, we propose a Multi-modal Collaborative Prompt Learning (MCPL) pipeline to tune a frozen V-L model for aligning medical text-image representations, thereby achieving medical downstream tasks. We first construct the anatomy-pathology (AP) prompt for multi-modal prompting jointly with text and image prompts. The AP prompt introduces instance-level anatomy and pathology information, thereby making a V-L model better comprehend medical reports and images. Next, we propose graph-guided prompt collaboration module (GPCM), which explicitly establishes multi-way couplings between the AP, text, and image prompts, enabling collaborative multi-modal prompt producing and updating for more effective prompting. Finally, we develop a novel prompt configuration scheme, which attaches the AP prompt to the query and key, and the text/image prompt to the value in self-attention layers for improving the interpretability of multi-modal prompts. Extensive experiments on numerous medical classification and object detection datasets show that the proposed pipeline achieves excellent effectiveness and generalization. Compared with state-of-the-art prompt learning methods, MCPL provides a more reliable multi-modal prompt paradigm for reducing tuning costs of V-L models on medical downstream tasks. Our code: https://github.com/CUHK-AIM-Group/MCPL .

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on medical imaging

自引率

0.00%

发文量