MammoVLM: A generative large vision–language model for mammography-related diagnostic assistance

IF 14.7 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Information Fusion Pub Date : 2025-02-10 DOI:10.1016/j.inffus.2025.102998

Zhenjie Cao , Zhuo Deng , Jie Ma , Jintao Hu , Lan Ma

{"title":"MammoVLM: A generative large vision–language model for mammography-related diagnostic assistance","authors":"Zhenjie Cao , Zhuo Deng , Jie Ma , Jintao Hu , Lan Ma","doi":"10.1016/j.inffus.2025.102998","DOIUrl":null,"url":null,"abstract":"<div><div>Inspired by the recent success of large language models (LLMs) in the general domain, many large multimodal models, such as vision–language models, have been developed to tackle problems across modalities.</div><div>In the realm of breast cancer, which is now the most deadly cancer worldwide, mammography serves as the primary screening approach for early detection. There is a practical need for patients to have a diagnostic assistant for their follow-up Q&A regarding their mammography screening. We believe large vision–language models have great potential to address this need. However, applying off-the-shelf large models directly in medical scenarios normally provides unsatisfactory results.</div><div>In this work, we present MammoVLM, a large vision–language model to assist patients with problems related to mammograms. MammoVLM has a sparse visual-MoE module that attends to different encoders based on the densities of the input image. Besides, we build a novel projection module, UMiCon, that leverages unimodal and multimodal contrastive learning training strategies to improve the alignment between visual and textual features. GLM-4 9B, an open-source LLM, is attached after previous multimodal modules to generate answers after supervised fine-tuning. We build our own dataset with 33,630 mammogram studies with diagnostic reports from 30,495 patients. MammoVLM has shown extraordinary potential in multi-round interactive dialogues. Our experimental results show that it has not only beaten other leading VLMs but also shows a professional capability similar to that of a junior radiologist.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"118 ","pages":"Article 102998"},"PeriodicalIF":14.7000,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525000715","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Inspired by the recent success of large language models (LLMs) in the general domain, many large multimodal models, such as vision–language models, have been developed to tackle problems across modalities.

In the realm of breast cancer, which is now the most deadly cancer worldwide, mammography serves as the primary screening approach for early detection. There is a practical need for patients to have a diagnostic assistant for their follow-up Q&A regarding their mammography screening. We believe large vision–language models have great potential to address this need. However, applying off-the-shelf large models directly in medical scenarios normally provides unsatisfactory results.

In this work, we present MammoVLM, a large vision–language model to assist patients with problems related to mammograms. MammoVLM has a sparse visual-MoE module that attends to different encoders based on the densities of the input image. Besides, we build a novel projection module, UMiCon, that leverages unimodal and multimodal contrastive learning training strategies to improve the alignment between visual and textual features. GLM-4 9B, an open-source LLM, is attached after previous multimodal modules to generate answers after supervised fine-tuning. We build our own dataset with 33,630 mammogram studies with diagnostic reports from 30,495 patients. MammoVLM has shown extraordinary potential in multi-round interactive dialogues. Our experimental results show that it has not only beaten other leading VLMs but also shows a professional capability similar to that of a junior radiologist.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.