{"title":"HiCur-NPC: Hierarchical Feature Fusion Curriculum Learning for Multi-Modal Foundation Model in Nasopharyngeal Carcinoma","authors":"Zipei Wang;Mengjie Fang;Linglong Tang;Jie Tian;Di Dong","doi":"10.1109/TMI.2025.3558775","DOIUrl":null,"url":null,"abstract":"Providing precise and comprehensive diagnostic information to clinicians is crucial for improving the treatment and prognosis of nasopharyngeal carcinoma. Multi-modal foundation models, which can integrate data from various sources, have the potential to significantly enhance clinical assistance. However, several challenges remain: (1) the lack of large-scale visual-language datasets for nasopharyngeal carcinoma; (2) the inability of existing pre-training and fine-tuning methods to capture the hierarchical features required for complex clinical tasks; (3) current foundation models having limited visual perception due to inadequate integration of multi-modal information. While curriculum learning can improve a model’s ability to handle multiple tasks through systematic knowledge accumulation, it still lacks consideration for hierarchical features and their dependencies, affecting knowledge gains. To address these issues, we propose the Hierarchical Feature Fusion Curriculum Learning method, which consists of three stages: visual knowledge learning, coarse-grained alignment, and fine-grained fusion. First, we introduce the Hybrid Contrastive Masked Autoencoder to pre-train visual encoders on 755K multi-modal images of nasopharyngeal carcinoma CT, MRI, and endoscopy to fully extract deep visual information. Then, we construct a 65K visual instruction fine-tuning dataset based on open-source data and clinician diagnostic reports, achieving coarse-grained alignment with visual information in a large language model. Finally, we design a Mixture of Experts Cross Attention structure for deep fine-grained fusion of global multi-modal information. Our model outperforms previously developed specialized models in all key clinical tasks for nasopharyngeal carcinoma, including diagnosis, report generation, tumor segmentation, and prognosis.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 10","pages":"3997-4009"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on medical imaging","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10959026/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Providing precise and comprehensive diagnostic information to clinicians is crucial for improving the treatment and prognosis of nasopharyngeal carcinoma. Multi-modal foundation models, which can integrate data from various sources, have the potential to significantly enhance clinical assistance. However, several challenges remain: (1) the lack of large-scale visual-language datasets for nasopharyngeal carcinoma; (2) the inability of existing pre-training and fine-tuning methods to capture the hierarchical features required for complex clinical tasks; (3) current foundation models having limited visual perception due to inadequate integration of multi-modal information. While curriculum learning can improve a model’s ability to handle multiple tasks through systematic knowledge accumulation, it still lacks consideration for hierarchical features and their dependencies, affecting knowledge gains. To address these issues, we propose the Hierarchical Feature Fusion Curriculum Learning method, which consists of three stages: visual knowledge learning, coarse-grained alignment, and fine-grained fusion. First, we introduce the Hybrid Contrastive Masked Autoencoder to pre-train visual encoders on 755K multi-modal images of nasopharyngeal carcinoma CT, MRI, and endoscopy to fully extract deep visual information. Then, we construct a 65K visual instruction fine-tuning dataset based on open-source data and clinician diagnostic reports, achieving coarse-grained alignment with visual information in a large language model. Finally, we design a Mixture of Experts Cross Attention structure for deep fine-grained fusion of global multi-modal information. Our model outperforms previously developed specialized models in all key clinical tasks for nasopharyngeal carcinoma, including diagnosis, report generation, tumor segmentation, and prognosis.