MedCLIP: Contrastive Learning from Unpaired Medical Images and Text.

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, Jimeng Sun
{"title":"MedCLIP: Contrastive Learning from Unpaired Medical Images and Text.","authors":"Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, Jimeng Sun","doi":"10.18653/v1/2022.emnlp-main.256","DOIUrl":null,"url":null,"abstract":"<p><p>Existing vision-text contrastive learning like CLIP (Radford et al., 2021) aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction. However, medical image-text datasets are orders of magnitude below the general images and captions from the internet. Moreover, previous methods encounter many false negatives, i.e., images and reports from separate patients probably carry the same semantics but are wrongly treated as negatives. In this paper, we decouple images and texts for multimodal contrastive learning thus scaling the usable training data in a combinatorial magnitude with low cost. We also propose to replace the InfoNCE loss with semantic matching loss based on medical knowledge to eliminate false negatives in contrastive learning. We prove that MedCLIP is a simple yet effective framework: it outperforms state-of-the-art methods on zero-shot prediction, supervised classification, and image-text retrieval. Surprisingly, we observe that with only 20K pre-training data, MedCLIP wins over the state-of-the-art method (using ≈200K data).</p>","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2022 ","pages":"3876-3887"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11323634/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2022.emnlp-main.256","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Existing vision-text contrastive learning like CLIP (Radford et al., 2021) aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction. However, medical image-text datasets are orders of magnitude below the general images and captions from the internet. Moreover, previous methods encounter many false negatives, i.e., images and reports from separate patients probably carry the same semantics but are wrongly treated as negatives. In this paper, we decouple images and texts for multimodal contrastive learning thus scaling the usable training data in a combinatorial magnitude with low cost. We also propose to replace the InfoNCE loss with semantic matching loss based on medical knowledge to eliminate false negatives in contrastive learning. We prove that MedCLIP is a simple yet effective framework: it outperforms state-of-the-art methods on zero-shot prediction, supervised classification, and image-text retrieval. Surprisingly, we observe that with only 20K pre-training data, MedCLIP wins over the state-of-the-art method (using ≈200K data).

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
MedCLIP:从非配对医学图像和文本中进行对比学习。
现有的视觉-文本对比学习,如 CLIP(Radford 等人,2021 年),旨在匹配配对的图像和标题嵌入,同时将其他图像和标题推开,从而提高表示的可转移性并支持零镜头预测。然而,医学图像-文本数据集比互联网上的普通图像和标题低几个数量级。此外,以前的方法会遇到许多假阴性,即来自不同患者的图像和报告可能具有相同的语义,但却被错误地视为阴性。在本文中,我们将图像和文本解耦,用于多模态对比学习,从而以较低的成本在组合量级上扩展可用的训练数据。我们还建议用基于医学知识的语义匹配损失取代 InfoNCE 损失,以消除对比学习中的假阴性。我们证明,MedCLIP 是一个简单而有效的框架:它在零镜头预测、监督分类和图像文本检索方面都优于最先进的方法。令人惊讶的是,我们发现,只需 20K 预训练数据,MedCLIP 就能战胜最先进的方法(使用 ≈200K 数据)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
EHRAgent: Code Empowers Large Language Models for Few-shot Complex Tabular Reasoning on Electronic Health Records. Two Directions for Clinical Data Generation with Large Language Models: Data-to-Label and Label-to-Data. Hierarchical Pretraining on Multimodal Electronic Health Records. An Integrative Survey on Mental Health Conversational Agents to Bridge Computer Science and Medical Perspectives. A Comprehensive Evaluation of Biomedical Entity Linking Models.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1