Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models.

IF 6.7 2区医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Journal of Biomedical and Health Informatics Pub Date : 2024-11-08 DOI:10.1109/JBHI.2024.3494246

Konstantinos Vilouras, Pedro Sanchez, Alison Q O'Neil, Sotirios A Tsaftaris

{"title":"Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models.","authors":"Konstantinos Vilouras, Pedro Sanchez, Alison Q O'Neil, Sotirios A Tsaftaris","doi":"10.1109/JBHI.2024.3494246","DOIUrl":null,"url":null,"abstract":"<p><p>Localizing the exact pathological regions in a given medical scan is an important imaging problem that traditionally requires a large amount of bounding box ground truth annotations to be accurately solved. However, there exist alternative, potentially weaker, forms of supervision, such as accompanying free-text reports, which are readily available. The task of performing localization with textual guidance is commonly referred to as phrase grounding. In this work, we use a publicly available Foundation Model, namely the Latent Diffusion Model, to perform this challenging task. This choice is supported by the fact that the Latent Diffusion Model, despite being generative in nature, contains cross-attention mechanisms that implicitly align visual and textual features, thus leading to intermediate representations that are suitable for the task at hand. In addition, we aim to perform this task in a zero-shot manner, i.e., without any training on the target task, meaning that the model's weights remain frozen. To this end, we devise strategies to select features and also refine them via post-processing without extra learnable parameters. We compare our proposed method with state-of-the-art approaches which explicitly enforce image-text alignment in a joint embedding space via contrastive learning. Results on a popular chest X-ray benchmark indicate that our method is competitive with SOTA on different types of pathology, and even outperforms them on average in terms of two metrics (mean IoU and AUC-ROC). Source code will be released upon acceptance at https://github.com/vios-s.</p>","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"PP ","pages":""},"PeriodicalIF":6.7000,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Biomedical and Health Informatics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1109/JBHI.2024.3494246","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Localizing the exact pathological regions in a given medical scan is an important imaging problem that traditionally requires a large amount of bounding box ground truth annotations to be accurately solved. However, there exist alternative, potentially weaker, forms of supervision, such as accompanying free-text reports, which are readily available. The task of performing localization with textual guidance is commonly referred to as phrase grounding. In this work, we use a publicly available Foundation Model, namely the Latent Diffusion Model, to perform this challenging task. This choice is supported by the fact that the Latent Diffusion Model, despite being generative in nature, contains cross-attention mechanisms that implicitly align visual and textual features, thus leading to intermediate representations that are suitable for the task at hand. In addition, we aim to perform this task in a zero-shot manner, i.e., without any training on the target task, meaning that the model's weights remain frozen. To this end, we devise strategies to select features and also refine them via post-processing without extra learnable parameters. We compare our proposed method with state-of-the-art approaches which explicitly enforce image-text alignment in a joint embedding space via contrastive learning. Results on a popular chest X-ray benchmark indicate that our method is competitive with SOTA on different types of pathology, and even outperforms them on average in terms of two metrics (mean IoU and AUC-ROC). Source code will be released upon acceptance at https://github.com/vios-s.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用现成的扩散模型实现医疗词组的零点接地。

在给定的医学扫描中定位准确的病理区域是一个重要的成像问题，传统上需要大量的边界框地面实况注释才能准确解决。不过，也有其他可能较弱的监督形式，如随附的自由文本报告，这些都是现成的。利用文本指导进行定位的任务通常被称为短语接地。在这项工作中，我们使用一个公开的基础模型，即潜在扩散模型，来完成这项具有挑战性的任务。潜在扩散模型尽管在本质上是生成模型，但它包含了交叉注意机制，可以隐式地调整视觉和文本特征，从而产生适合当前任务的中间表征，这一事实支持了我们的选择。此外，我们的目标是以 "0-shot "的方式完成这项任务，即不对目标任务进行任何训练，这意味着模型的权重保持冻结。为此，我们设计了一些策略来选择特征，并通过后处理来完善这些特征，而无需额外的可学习参数。我们将所提出的方法与通过对比学习在联合嵌入空间中明确执行图像-文本对齐的先进方法进行了比较。在一个流行的胸部 X 光基准上得出的结果表明，我们的方法在不同类型的病理上与 SOTA 具有竞争力，甚至在两个指标（平均 IoU 和 AUC-ROC）上平均优于它们。源代码将在 https://github.com/vios-s 上公布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Journal of Biomedical and Health Informatics COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

CiteScore

13.60

自引率

6.50%

发文量

1151

期刊介绍： IEEE Journal of Biomedical and Health Informatics publishes original papers presenting recent advances where information and communication technologies intersect with health, healthcare, life sciences, and biomedicine. Topics include acquisition, transmission, storage, retrieval, management, and analysis of biomedical and health information. The journal covers applications of information technologies in healthcare, patient monitoring, preventive care, early disease diagnosis, therapy discovery, and personalized treatment protocols. It explores electronic medical and health records, clinical information systems, decision support systems, medical and biological imaging informatics, wearable systems, body area/sensor networks, and more. Integration-related topics like interoperability, evidence-based medicine, and secure patient data are also addressed.