Vision-language foundation model for generalizable nasal disease diagnosis using unlabeled endoscopic records

IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pattern Recognition Pub Date : 2025-09-01 Epub Date: 2025-04-04 DOI:10.1016/j.patcog.2025.111646
Xueli Liu , Wentao Gong , Xiao Chen , Zhen Li , Yinlong Liu , Li Wang , Quan Liu , Xicai Sun , Xiaofeng Liu , Xinrong Chen , Yuxuan Shi , Hongmeng Yu
{"title":"Vision-language foundation model for generalizable nasal disease diagnosis using unlabeled endoscopic records","authors":"Xueli Liu ,&nbsp;Wentao Gong ,&nbsp;Xiao Chen ,&nbsp;Zhen Li ,&nbsp;Yinlong Liu ,&nbsp;Li Wang ,&nbsp;Quan Liu ,&nbsp;Xicai Sun ,&nbsp;Xiaofeng Liu ,&nbsp;Xinrong Chen ,&nbsp;Yuxuan Shi ,&nbsp;Hongmeng Yu","doi":"10.1016/j.patcog.2025.111646","DOIUrl":null,"url":null,"abstract":"<div><div>Medical artificial intelligence (AI) holds significant potential in identifying signs of health conditions in nasal endoscopic images, thereby accelerating the diagnosis of diseases and systemic disorders. However, the performance of AI models heavily relies on expert annotations, and these models are usually task-specific with limited generalization performance across various clinical applications. In this paper, we introduce NasVLM, a Nasal Vision-Language foundation Model designed to extract universal representations from unlabeled nasal endoscopic data. Additionally, we construct a large-scale nasal endoscopic pre-training dataset and three downstream validation datasets from routine diagnostic records. The core strength of NasVLM lies in its ability to learn cross-modal semantic representations and perform multi-granular report-image alignment without depending on expert annotations. Furthermore, to the best of our knowledge, it is the first medical foundation model that effectively aligns medical report with multiple images of different anatomic regions, facilitated by a well-designed hierarchical report-supervised learning framework. The experimental results demonstrate that NasVLM has superior generalization performance across diverse diagnostic tasks and surpasses state-of-the-art self- and report-supervised methods in disease classification and lesion localization, especially in scenarios requiring label-efficient fine-tuning. For instance, NasVLM can distinguish normal nasopharynx (NOR) from abnormalities (benign hyperplasia, BH, and nasopharyngeal carcinoma, NPC) with an accuracy of 91.38% (95% CI, 90.59 to 92.17) and differentiate NPC from BH and NOR with an accuracy of 81.45% (95% CI, 80.21 to 82.67) on the multi-center NPC-Screen dataset using only 1% labeled data, on par with the performance of traditional supervised methods using 100% labeled data.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111646"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325003061","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/4 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Medical artificial intelligence (AI) holds significant potential in identifying signs of health conditions in nasal endoscopic images, thereby accelerating the diagnosis of diseases and systemic disorders. However, the performance of AI models heavily relies on expert annotations, and these models are usually task-specific with limited generalization performance across various clinical applications. In this paper, we introduce NasVLM, a Nasal Vision-Language foundation Model designed to extract universal representations from unlabeled nasal endoscopic data. Additionally, we construct a large-scale nasal endoscopic pre-training dataset and three downstream validation datasets from routine diagnostic records. The core strength of NasVLM lies in its ability to learn cross-modal semantic representations and perform multi-granular report-image alignment without depending on expert annotations. Furthermore, to the best of our knowledge, it is the first medical foundation model that effectively aligns medical report with multiple images of different anatomic regions, facilitated by a well-designed hierarchical report-supervised learning framework. The experimental results demonstrate that NasVLM has superior generalization performance across diverse diagnostic tasks and surpasses state-of-the-art self- and report-supervised methods in disease classification and lesion localization, especially in scenarios requiring label-efficient fine-tuning. For instance, NasVLM can distinguish normal nasopharynx (NOR) from abnormalities (benign hyperplasia, BH, and nasopharyngeal carcinoma, NPC) with an accuracy of 91.38% (95% CI, 90.59 to 92.17) and differentiate NPC from BH and NOR with an accuracy of 81.45% (95% CI, 80.21 to 82.67) on the multi-center NPC-Screen dataset using only 1% labeled data, on par with the performance of traditional supervised methods using 100% labeled data.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用未标记的内窥镜记录进行鼻部疾病诊断的视觉语言基础模型
医学人工智能(AI)在识别鼻内窥镜图像中的健康状况迹象方面具有巨大潜力,从而加速疾病和全身性疾病的诊断。然而,人工智能模型的性能严重依赖于专家注释,这些模型通常是特定于任务的,在各种临床应用中的泛化性能有限。在本文中,我们介绍了NasVLM,一个鼻视觉语言基础模型,旨在从未标记的鼻内窥镜数据中提取通用表示。此外,我们构建了一个大规模的鼻内窥镜预训练数据集和三个来自常规诊断记录的下游验证数据集。NasVLM的核心优势在于它能够学习跨模态语义表示和执行多粒度报告-图像对齐,而不依赖于专家注释。此外,据我们所知,它是第一个医学基础模型,可以有效地将医学报告与不同解剖区域的多个图像对齐,并通过精心设计的分层报告监督学习框架提供便利。实验结果表明,NasVLM在各种诊断任务中具有优越的泛化性能,并且在疾病分类和病灶定位方面优于最先进的自我和报告监督方法,特别是在需要标签高效微调的场景中。例如,NasVLM可以区分正常鼻咽(NOR)与异常(良性增生,BH和鼻咽癌,NPC),准确率为91.38% (95% CI, 90.59至92.17),并且在多中心NPC- screen数据集上,仅使用1%的标记数据就可以区分NPC与BH和NOR,准确率为81.45% (95% CI, 80.21至82.67),与使用100%标记数据的传统监督方法的性能相当。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Pattern Recognition
Pattern Recognition 工程技术-工程:电子与电气
CiteScore
14.40
自引率
16.20%
发文量
683
审稿时长
5.6 months
期刊介绍: The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.
期刊最新文献
Multiple similarity and multiple kernel fusion based on graph inference network for circRNA-disease association prediction FMaMIL: Synergistic spatial-frequency Mamba multi-instance learning for weakly supervised pathology lesion segmentation Flexible multi-view feature selection with semi-supervised label semantic alignment Unsupervised feature selection based on dual-graph clustering learning and adaptive weighting Model-based clustering of music pieces
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1