MammoVLM: A generative large vision–language model for mammography-related diagnostic assistance

IF 15.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Information Fusion Pub Date : 2025-02-10 DOI:10.1016/j.inffus.2025.102998
Zhenjie Cao , Zhuo Deng , Jie Ma , Jintao Hu , Lan Ma
{"title":"MammoVLM: A generative large vision–language model for mammography-related diagnostic assistance","authors":"Zhenjie Cao ,&nbsp;Zhuo Deng ,&nbsp;Jie Ma ,&nbsp;Jintao Hu ,&nbsp;Lan Ma","doi":"10.1016/j.inffus.2025.102998","DOIUrl":null,"url":null,"abstract":"<div><div>Inspired by the recent success of large language models (LLMs) in the general domain, many large multimodal models, such as vision–language models, have been developed to tackle problems across modalities.</div><div>In the realm of breast cancer, which is now the most deadly cancer worldwide, mammography serves as the primary screening approach for early detection. There is a practical need for patients to have a diagnostic assistant for their follow-up Q&amp;A regarding their mammography screening. We believe large vision–language models have great potential to address this need. However, applying off-the-shelf large models directly in medical scenarios normally provides unsatisfactory results.</div><div>In this work, we present MammoVLM, a large vision–language model to assist patients with problems related to mammograms. MammoVLM has a sparse visual-MoE module that attends to different encoders based on the densities of the input image. Besides, we build a novel projection module, UMiCon, that leverages unimodal and multimodal contrastive learning training strategies to improve the alignment between visual and textual features. GLM-4 9B, an open-source LLM, is attached after previous multimodal modules to generate answers after supervised fine-tuning. We build our own dataset with 33,630 mammogram studies with diagnostic reports from 30,495 patients. MammoVLM has shown extraordinary potential in multi-round interactive dialogues. Our experimental results show that it has not only beaten other leading VLMs but also shows a professional capability similar to that of a junior radiologist.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"118 ","pages":"Article 102998"},"PeriodicalIF":15.5000,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525000715","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Inspired by the recent success of large language models (LLMs) in the general domain, many large multimodal models, such as vision–language models, have been developed to tackle problems across modalities.
In the realm of breast cancer, which is now the most deadly cancer worldwide, mammography serves as the primary screening approach for early detection. There is a practical need for patients to have a diagnostic assistant for their follow-up Q&A regarding their mammography screening. We believe large vision–language models have great potential to address this need. However, applying off-the-shelf large models directly in medical scenarios normally provides unsatisfactory results.
In this work, we present MammoVLM, a large vision–language model to assist patients with problems related to mammograms. MammoVLM has a sparse visual-MoE module that attends to different encoders based on the densities of the input image. Besides, we build a novel projection module, UMiCon, that leverages unimodal and multimodal contrastive learning training strategies to improve the alignment between visual and textual features. GLM-4 9B, an open-source LLM, is attached after previous multimodal modules to generate answers after supervised fine-tuning. We build our own dataset with 33,630 mammogram studies with diagnostic reports from 30,495 patients. MammoVLM has shown extraordinary potential in multi-round interactive dialogues. Our experimental results show that it has not only beaten other leading VLMs but also shows a professional capability similar to that of a junior radiologist.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
MammoVLM:用于乳腺x线摄影相关诊断辅助的生成式大型视觉语言模型
受最近大型语言模型(llm)在通用领域的成功启发,许多大型多模态模型,如视觉语言模型,已经被开发出来解决跨模态的问题。乳腺癌是目前世界上最致命的癌症,在乳腺癌领域,乳房x光检查是早期发现的主要筛查方法。有一个实际的需要,病人有一个诊断助理,他们的后续问题,关于他们的乳房x光检查。我们相信大型视觉语言模型有很大的潜力来解决这一需求。然而,将现成的大型模型直接应用于医疗场景通常会提供令人不满意的结果。在这项工作中,我们提出了MammoVLM,一个大型视觉语言模型,以帮助患者解决与乳房x光检查相关的问题。MammoVLM有一个稀疏的visual-MoE模块,它根据输入图像的密度来处理不同的编码器。此外,我们还构建了一个新的投影模块UMiCon,该模块利用单模态和多模态对比学习训练策略来改善视觉特征和文本特征之间的对齐。glm - 49b是一个开源的LLM,附加在之前的多模态模块之后,在监督微调后生成答案。我们建立了自己的数据集,包括33,630项乳房x光检查和来自30,495名患者的诊断报告。MammoVLM在多轮互动对话中显示出非凡的潜力。我们的实验结果表明,它不仅击败了其他领先的VLMs,而且显示出类似初级放射科医生的专业能力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Information Fusion
Information Fusion 工程技术-计算机:理论方法
CiteScore
33.20
自引率
4.30%
发文量
161
审稿时长
7.9 months
期刊介绍: Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.
期刊最新文献
FineFake: A Knowledge-Enriched Dataset for Fine-Grained Multi-Domain Fake News Detection Decentralized Federated Learning with Multimodal Prototypes for Heterogeneous Data Generalization of Knowledge Graph Grounded Models: A Multi-Perspective Survey Security and Privacy in LLMs: A Comprehensive Survey of Threats and Mitigation Strategies Resilient Distributed Kalman Filtering for Cyber-Physical Systems via Mean Subsequence Reduction
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1