Vision-language foundation model for generalizable nasal disease diagnosis using unlabeled endoscopic records

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pattern Recognition Pub Date : 2025-09-01 Epub Date: 2025-04-04 DOI:10.1016/j.patcog.2025.111646

Xueli Liu , Wentao Gong , Xiao Chen , Zhen Li , Yinlong Liu , Li Wang , Quan Liu , Xicai Sun , Xiaofeng Liu , Xinrong Chen , Yuxuan Shi , Hongmeng Yu

{"title":"Vision-language foundation model for generalizable nasal disease diagnosis using unlabeled endoscopic records","authors":"Xueli Liu , Wentao Gong , Xiao Chen , Zhen Li , Yinlong Liu , Li Wang , Quan Liu , Xicai Sun , Xiaofeng Liu , Xinrong Chen , Yuxuan Shi , Hongmeng Yu","doi":"10.1016/j.patcog.2025.111646","DOIUrl":null,"url":null,"abstract":"<div><div>Medical artificial intelligence (AI) holds significant potential in identifying signs of health conditions in nasal endoscopic images, thereby accelerating the diagnosis of diseases and systemic disorders. However, the performance of AI models heavily relies on expert annotations, and these models are usually task-specific with limited generalization performance across various clinical applications. In this paper, we introduce NasVLM, a Nasal Vision-Language foundation Model designed to extract universal representations from unlabeled nasal endoscopic data. Additionally, we construct a large-scale nasal endoscopic pre-training dataset and three downstream validation datasets from routine diagnostic records. The core strength of NasVLM lies in its ability to learn cross-modal semantic representations and perform multi-granular report-image alignment without depending on expert annotations. Furthermore, to the best of our knowledge, it is the first medical foundation model that effectively aligns medical report with multiple images of different anatomic regions, facilitated by a well-designed hierarchical report-supervised learning framework. The experimental results demonstrate that NasVLM has superior generalization performance across diverse diagnostic tasks and surpasses state-of-the-art self- and report-supervised methods in disease classification and lesion localization, especially in scenarios requiring label-efficient fine-tuning. For instance, NasVLM can distinguish normal nasopharynx (NOR) from abnormalities (benign hyperplasia, BH, and nasopharyngeal carcinoma, NPC) with an accuracy of 91.38% (95% CI, 90.59 to 92.17) and differentiate NPC from BH and NOR with an accuracy of 81.45% (95% CI, 80.21 to 82.67) on the multi-center NPC-Screen dataset using only 1% labeled data, on par with the performance of traditional supervised methods using 100% labeled data.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111646"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325003061","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/4 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Medical artificial intelligence (AI) holds significant potential in identifying signs of health conditions in nasal endoscopic images, thereby accelerating the diagnosis of diseases and systemic disorders. However, the performance of AI models heavily relies on expert annotations, and these models are usually task-specific with limited generalization performance across various clinical applications. In this paper, we introduce NasVLM, a Nasal Vision-Language foundation Model designed to extract universal representations from unlabeled nasal endoscopic data. Additionally, we construct a large-scale nasal endoscopic pre-training dataset and three downstream validation datasets from routine diagnostic records. The core strength of NasVLM lies in its ability to learn cross-modal semantic representations and perform multi-granular report-image alignment without depending on expert annotations. Furthermore, to the best of our knowledge, it is the first medical foundation model that effectively aligns medical report with multiple images of different anatomic regions, facilitated by a well-designed hierarchical report-supervised learning framework. The experimental results demonstrate that NasVLM has superior generalization performance across diverse diagnostic tasks and surpasses state-of-the-art self- and report-supervised methods in disease classification and lesion localization, especially in scenarios requiring label-efficient fine-tuning. For instance, NasVLM can distinguish normal nasopharynx (NOR) from abnormalities (benign hyperplasia, BH, and nasopharyngeal carcinoma, NPC) with an accuracy of 91.38% (95% CI, 90.59 to 92.17) and differentiate NPC from BH and NOR with an accuracy of 81.45% (95% CI, 80.21 to 82.67) on the multi-center NPC-Screen dataset using only 1% labeled data, on par with the performance of traditional supervised methods using 100% labeled data.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用未标记的内窥镜记录进行鼻部疾病诊断的视觉语言基础模型

医学人工智能（AI）在识别鼻内窥镜图像中的健康状况迹象方面具有巨大潜力，从而加速疾病和全身性疾病的诊断。然而，人工智能模型的性能严重依赖于专家注释，这些模型通常是特定于任务的，在各种临床应用中的泛化性能有限。在本文中，我们介绍了NasVLM，一个鼻视觉语言基础模型，旨在从未标记的鼻内窥镜数据中提取通用表示。此外，我们构建了一个大规模的鼻内窥镜预训练数据集和三个来自常规诊断记录的下游验证数据集。NasVLM的核心优势在于它能够学习跨模态语义表示和执行多粒度报告-图像对齐，而不依赖于专家注释。此外，据我们所知，它是第一个医学基础模型，可以有效地将医学报告与不同解剖区域的多个图像对齐，并通过精心设计的分层报告监督学习框架提供便利。实验结果表明，NasVLM在各种诊断任务中具有优越的泛化性能，并且在疾病分类和病灶定位方面优于最先进的自我和报告监督方法，特别是在需要标签高效微调的场景中。例如，NasVLM可以区分正常鼻咽（NOR）与异常（良性增生，BH和鼻咽癌，NPC），准确率为91.38% (95% CI， 90.59至92.17)，并且在多中心NPC- screen数据集上，仅使用1%的标记数据就可以区分NPC与BH和NOR，准确率为81.45% (95% CI， 80.21至82.67)，与使用100%标记数据的传统监督方法的性能相当。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.