Pub Date : 2025-11-03DOI: 10.1109/tmi.2025.3628159
Yuanhe Tian,Yan Song
Generating reports for medical images is an important task in medical automation that not only provides valuable objective diagnostic evidence but also alleviates the workload of radiologists. Many existing studies focus on chest X-rays that typically consist of one or a few images, where less attention is paid to other medical image types, such as computed tomography (CT) that contain a large number of continuous images. Many studies on CT report generation (CTRG) rely on convolutional networks or standard Transformers to model CT slice representation and combine them to obtain CT features, yet relatively little research has focused on subtle lesion features and volumetric continuity. In this paper, we propose shared low-rank matrix recovery (S-LMR) to decompose CT slices into shared anatomical patterns and lesion-focused features, together with continuous slice encoding (CSE) to explicitly model inter-slice continuity and capture progressive changes across adjacent slices, which are subsequently integrated with a large language model (LLM) for report generation. Specifically, the S-LMR separates the common patterns from the sparse lesion-focused features to highlight clinically significant information. Based on the outputs of S-LMR, CSE captures inter-slice relationships within a dedicated Transformer encoder and aligns the resulting visual features with textual information, thereby instructing the LLM to produce a CT report. Experiment results on benchmark datasets for CTRG show that our approach outperforms strong baselines and existing models, demonstrating state-of-the-art performance. Analyses further confirm that S-LMR and CSE effectively capture key evidence, leading to more accurate CTRG.
{"title":"Feature Decomposition via Shared Low-rank Matrix Recovery for CT Report Generation.","authors":"Yuanhe Tian,Yan Song","doi":"10.1109/tmi.2025.3628159","DOIUrl":"https://doi.org/10.1109/tmi.2025.3628159","url":null,"abstract":"Generating reports for medical images is an important task in medical automation that not only provides valuable objective diagnostic evidence but also alleviates the workload of radiologists. Many existing studies focus on chest X-rays that typically consist of one or a few images, where less attention is paid to other medical image types, such as computed tomography (CT) that contain a large number of continuous images. Many studies on CT report generation (CTRG) rely on convolutional networks or standard Transformers to model CT slice representation and combine them to obtain CT features, yet relatively little research has focused on subtle lesion features and volumetric continuity. In this paper, we propose shared low-rank matrix recovery (S-LMR) to decompose CT slices into shared anatomical patterns and lesion-focused features, together with continuous slice encoding (CSE) to explicitly model inter-slice continuity and capture progressive changes across adjacent slices, which are subsequently integrated with a large language model (LLM) for report generation. Specifically, the S-LMR separates the common patterns from the sparse lesion-focused features to highlight clinically significant information. Based on the outputs of S-LMR, CSE captures inter-slice relationships within a dedicated Transformer encoder and aligns the resulting visual features with textual information, thereby instructing the LLM to produce a CT report. Experiment results on benchmark datasets for CTRG show that our approach outperforms strong baselines and existing models, demonstrating state-of-the-art performance. Analyses further confirm that S-LMR and CSE effectively capture key evidence, leading to more accurate CTRG.","PeriodicalId":13418,"journal":{"name":"IEEE Transactions on Medical Imaging","volume":"1 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145433843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-03DOI: 10.1109/tmi.2025.3628174
Yihuang Hu,Zhicheng Du,Weiping Lin,Shurong Yang,Lequan Yu,Guojun Zhang,Liansheng Wang
The diagnosis of cancer primarily relies on pathological slides stained with hematoxylin and eosin (HE). These slides are typically prepared from tissue samples that have been fixed in formalin and embedded in paraffin (FFPE). However, the traditional process of staining FFPE samples with HE is time-consuming and resource-intensive. Recent advances in virtual staining technologies, driven by digital pathology and generative models, offer a promising alternative. However, the blurred structures in FFPE images pose unique challenges to achieving high-quality FFPE-to-HE virtual staining. In this context, we developed a novel Multiple Cell Semantics-guided supervised generative adversarial model, MCS-Stain. Specifically, the guidance consists of three components: (1) pretrained cell semantic guidance, aligning the powerful intermediate features of real and virtual images, embedded in the pretrained cell segmentation model (PCSM); (2) cell mask guidance, introducing comprehensible cell information which serves as part of the input to the discriminator through channel concatenation; (3) dynamic cell semantic guidance, aligning the dynamic intermediate features embedded in the generator during training. The comparative results on FFPE-to-HE datasets demonstrated that MCS-Stain outperforms existing state-of-the-art (SOTA) methods with substantial qualitative and quantitative improvements. Results across various PCSMs and data sources further confirmed its effectiveness and robustness. Notably, the dynamic cell semantic exhibits strong potential beyond FFPE-to-HE virtual staining, further demonstrated by virtual staining from HE images to immunohistochemical (IHC) images. In general, MCS-Stain presents a promising avenue to advance virtual staining techniques. Code is available at https://github.com/huyihuang/MCS-Stain.
{"title":"MCS-Stain: Boosting FFPE-to-HE Virtual Staining with Multiple Cell Semantics.","authors":"Yihuang Hu,Zhicheng Du,Weiping Lin,Shurong Yang,Lequan Yu,Guojun Zhang,Liansheng Wang","doi":"10.1109/tmi.2025.3628174","DOIUrl":"https://doi.org/10.1109/tmi.2025.3628174","url":null,"abstract":"The diagnosis of cancer primarily relies on pathological slides stained with hematoxylin and eosin (HE). These slides are typically prepared from tissue samples that have been fixed in formalin and embedded in paraffin (FFPE). However, the traditional process of staining FFPE samples with HE is time-consuming and resource-intensive. Recent advances in virtual staining technologies, driven by digital pathology and generative models, offer a promising alternative. However, the blurred structures in FFPE images pose unique challenges to achieving high-quality FFPE-to-HE virtual staining. In this context, we developed a novel Multiple Cell Semantics-guided supervised generative adversarial model, MCS-Stain. Specifically, the guidance consists of three components: (1) pretrained cell semantic guidance, aligning the powerful intermediate features of real and virtual images, embedded in the pretrained cell segmentation model (PCSM); (2) cell mask guidance, introducing comprehensible cell information which serves as part of the input to the discriminator through channel concatenation; (3) dynamic cell semantic guidance, aligning the dynamic intermediate features embedded in the generator during training. The comparative results on FFPE-to-HE datasets demonstrated that MCS-Stain outperforms existing state-of-the-art (SOTA) methods with substantial qualitative and quantitative improvements. Results across various PCSMs and data sources further confirmed its effectiveness and robustness. Notably, the dynamic cell semantic exhibits strong potential beyond FFPE-to-HE virtual staining, further demonstrated by virtual staining from HE images to immunohistochemical (IHC) images. In general, MCS-Stain presents a promising avenue to advance virtual staining techniques. Code is available at https://github.com/huyihuang/MCS-Stain.","PeriodicalId":13418,"journal":{"name":"IEEE Transactions on Medical Imaging","volume":"69 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145433842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-03DOI: 10.1109/tmi.2025.3627954
Yuwen Chen, Zafer Yildiz, Qihang Li, Yaqian Chen, Haoyu Dong, Hanxue Gu, Nicholas Konz, Maciej A. Mazurowski
{"title":"Accelerating Volumetric Medical Image Annotation via Short-Long Memory SAM 2","authors":"Yuwen Chen, Zafer Yildiz, Qihang Li, Yaqian Chen, Haoyu Dong, Hanxue Gu, Nicholas Konz, Maciej A. Mazurowski","doi":"10.1109/tmi.2025.3627954","DOIUrl":"https://doi.org/10.1109/tmi.2025.3627954","url":null,"abstract":"","PeriodicalId":13418,"journal":{"name":"IEEE Transactions on Medical Imaging","volume":"33 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145434217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31DOI: 10.1109/tmi.2025.3627815
Han Gyu Kang,Hideaki Tashima,Hidekatsu Wakizaka,Makoto Higuchi,Taiga Yamaya
Spatial resolution is the most important parameter for preclinical positron emission tomography (PET) to visualize mouse brain function with high quantification accuracy. However, the spatial resolution of PET has been limited to over 0.5 mm, which causes a substantial partial volume effect especially for small mouse brain structures. In this study, we present the initial results of a mouse brain dedicated PET scanner that can achieve sub-0.5 mm resolution. The ring diameter and axial coverage of the PET scanner are 48 mm and 23.4 mm. To encode depth-of-interaction (DOI) information, 3-layers of lutetium yttrium oxyorthosilicate crystals were stacked in a staggered configuration and coupled to a 5×5 array of silicon photomultipliers having a pixel pitch of 2.4 mm. The crystal pitch and total thickness are 0.8 mm and 11 mm. The PET performance was characterized according to the National Electrical Manufacturers Association NU4-2008 standard. In vivo mouse brain imaging was carried out with 18F-FITM and 18F-FDG tracers. The average radial resolution from the center to 10 mm offset was 0.67±0.06 mm with filtered back projection. The 0.45 mm diameter rods were identified clearly with an iterative reconstruction algorithm. To the best of our knowledge, this is the first separate identification of the hypothalamus, amygdala, and cerebellar nuclei of mouse brain. The developed PET scanner achieved sub-0.5 mm resolution thereby visualizing small mouse brain structures with high quantification accuracy.
{"title":"Sub-0.5 mm Resolution PET Scanner with 3-Layer DOI Detectors for Rodent Neuroimaging.","authors":"Han Gyu Kang,Hideaki Tashima,Hidekatsu Wakizaka,Makoto Higuchi,Taiga Yamaya","doi":"10.1109/tmi.2025.3627815","DOIUrl":"https://doi.org/10.1109/tmi.2025.3627815","url":null,"abstract":"Spatial resolution is the most important parameter for preclinical positron emission tomography (PET) to visualize mouse brain function with high quantification accuracy. However, the spatial resolution of PET has been limited to over 0.5 mm, which causes a substantial partial volume effect especially for small mouse brain structures. In this study, we present the initial results of a mouse brain dedicated PET scanner that can achieve sub-0.5 mm resolution. The ring diameter and axial coverage of the PET scanner are 48 mm and 23.4 mm. To encode depth-of-interaction (DOI) information, 3-layers of lutetium yttrium oxyorthosilicate crystals were stacked in a staggered configuration and coupled to a 5×5 array of silicon photomultipliers having a pixel pitch of 2.4 mm. The crystal pitch and total thickness are 0.8 mm and 11 mm. The PET performance was characterized according to the National Electrical Manufacturers Association NU4-2008 standard. In vivo mouse brain imaging was carried out with 18F-FITM and 18F-FDG tracers. The average radial resolution from the center to 10 mm offset was 0.67±0.06 mm with filtered back projection. The 0.45 mm diameter rods were identified clearly with an iterative reconstruction algorithm. To the best of our knowledge, this is the first separate identification of the hypothalamus, amygdala, and cerebellar nuclei of mouse brain. The developed PET scanner achieved sub-0.5 mm resolution thereby visualizing small mouse brain structures with high quantification accuracy.","PeriodicalId":13418,"journal":{"name":"IEEE Transactions on Medical Imaging","volume":"76 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145411488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Partial volume effect (PVE) arises from the limited spatial resolution of positron emission tomography (PET) scanners, causing significant quantitative biases that hinder accurate metabolic activity assessment. To address these problems, we proposed an unsupervised deep residual compensation model (U-DRCM) for PET partial volume correction (PVC). U-DRCM first predicted an initial blur kernel for the PVE-affected PET image based on a conditional blind deconvolution module (CBD module). Then, a conditional residual compensation module (CRC module) was introduced to compensate for the error caused by inaccurate blur kernel prediction. The whole model is unsupervised which only needs a single patient's PET image as the training label and the corresponding MR image as the network input. The performance of U-DRCM was evaluated against several established PVC approaches, including Richardson-Lucy (RL), reblurred Van-Cittert (RVC), iterative Yang (IY), neural blind deconvolution (NBD), and deep convolutional neural network (DeepPVC) using both simulated BrainWeb phantom and real clinical datasets. In the simulation study, U-DRCM consistently outperformed competing methods across multiple quantitative metrics, achieved a higher peak signal-to-noise ratio (PSNR), an improved structural similarity index (SSIM), and a lower root mean square error (RMSE). For the real clinical study, U-DRCM delivered substantial improvements in standardized uptake value (SUV) and standardized uptake value ratio (SUVR) across various brain volumes of interest (VOIs). Experimental results show that U-DRCM effectively mitigates the impact of PVE, resulting in high-quality PVC PET images with enhanced brain visualization.
{"title":"Deep Residual Compensation Model for Unsupervised PET Partial Volume Correction.","authors":"Jianan Cui,Jiankai Wu,Zhongxue Wu,Jianzhong He,Qingrun Zeng,Zan Chen,Yuanjing Feng","doi":"10.1109/tmi.2025.3627516","DOIUrl":"https://doi.org/10.1109/tmi.2025.3627516","url":null,"abstract":"Partial volume effect (PVE) arises from the limited spatial resolution of positron emission tomography (PET) scanners, causing significant quantitative biases that hinder accurate metabolic activity assessment. To address these problems, we proposed an unsupervised deep residual compensation model (U-DRCM) for PET partial volume correction (PVC). U-DRCM first predicted an initial blur kernel for the PVE-affected PET image based on a conditional blind deconvolution module (CBD module). Then, a conditional residual compensation module (CRC module) was introduced to compensate for the error caused by inaccurate blur kernel prediction. The whole model is unsupervised which only needs a single patient's PET image as the training label and the corresponding MR image as the network input. The performance of U-DRCM was evaluated against several established PVC approaches, including Richardson-Lucy (RL), reblurred Van-Cittert (RVC), iterative Yang (IY), neural blind deconvolution (NBD), and deep convolutional neural network (DeepPVC) using both simulated BrainWeb phantom and real clinical datasets. In the simulation study, U-DRCM consistently outperformed competing methods across multiple quantitative metrics, achieved a higher peak signal-to-noise ratio (PSNR), an improved structural similarity index (SSIM), and a lower root mean square error (RMSE). For the real clinical study, U-DRCM delivered substantial improvements in standardized uptake value (SUV) and standardized uptake value ratio (SUVR) across various brain volumes of interest (VOIs). Experimental results show that U-DRCM effectively mitigates the impact of PVE, resulting in high-quality PVC PET images with enhanced brain visualization.","PeriodicalId":13418,"journal":{"name":"IEEE Transactions on Medical Imaging","volume":"68 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145411484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-30DOI: 10.1109/tmi.2025.3627305
Baoshun Shi,Ke Jiang,Qiusheng Lian,Xinran Yu,Huazhu Fu
Despite significant advancements in deep learning-based sparse-view computed tomography (SVCT) reconstruction algorithms, these methods still encounter two primary limitations: (i) It is challenging to explicitly prove that the prior networks of deep unfolding algorithms satisfy Lipschitz constraints due to their empirically designed nature. (ii) The substantial storage costs of training a separate model for each setting in the case of multiple views hinder practical clinical applications. To address these issues, we elaborate an explicitly provable Lipschitz-constrained network, dubbed LipNet, and integrate an explicit prompt module to provide discriminative knowledge of different sparse sampling settings, enabling the treatment of multiple sparse view configurations within a single model. Furthermore, we develop a storage-saving deep unfolding framework for multiple-in-one SVCT reconstruction, termed PromptCT, which embeds LipNet as its prior network to ensure the convergence of its corresponding iterative algorithm. In simulated and real data experiments, PromptCT outperforms benchmark reconstruction algorithms in multiple-in-one SVCT reconstruction, achieving higher-quality reconstructions with lower storage costs. On the theoretical side, we explicitly demonstrate that LipNet satisfies boundary property, further proving its Lipschitz continuity and subsequently analyzing the convergence of the proposed iterative algorithms. The data and code are publicly available at https://github.com/shibaoshun/PromptCT.
{"title":"Prompting Lipschitz-constrained network for multiple-in-one sparse-view CT reconstruction.","authors":"Baoshun Shi,Ke Jiang,Qiusheng Lian,Xinran Yu,Huazhu Fu","doi":"10.1109/tmi.2025.3627305","DOIUrl":"https://doi.org/10.1109/tmi.2025.3627305","url":null,"abstract":"Despite significant advancements in deep learning-based sparse-view computed tomography (SVCT) reconstruction algorithms, these methods still encounter two primary limitations: (i) It is challenging to explicitly prove that the prior networks of deep unfolding algorithms satisfy Lipschitz constraints due to their empirically designed nature. (ii) The substantial storage costs of training a separate model for each setting in the case of multiple views hinder practical clinical applications. To address these issues, we elaborate an explicitly provable Lipschitz-constrained network, dubbed LipNet, and integrate an explicit prompt module to provide discriminative knowledge of different sparse sampling settings, enabling the treatment of multiple sparse view configurations within a single model. Furthermore, we develop a storage-saving deep unfolding framework for multiple-in-one SVCT reconstruction, termed PromptCT, which embeds LipNet as its prior network to ensure the convergence of its corresponding iterative algorithm. In simulated and real data experiments, PromptCT outperforms benchmark reconstruction algorithms in multiple-in-one SVCT reconstruction, achieving higher-quality reconstructions with lower storage costs. On the theoretical side, we explicitly demonstrate that LipNet satisfies boundary property, further proving its Lipschitz continuity and subsequently analyzing the convergence of the proposed iterative algorithms. The data and code are publicly available at https://github.com/shibaoshun/PromptCT.","PeriodicalId":13418,"journal":{"name":"IEEE Transactions on Medical Imaging","volume":"6 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145403771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-30DOI: 10.1109/tmi.2025.3627406
Meiling Wang,Wei Shao,Daoqiang Zhang,Qingshan Liu
The characteristic of neurodegenerative disorders is the progressive impairment of memory and other cognitive functions. However, these existing imaging genetic methods only use longitudinal imaging phenotypes straightforwardly, ignoring the latent pattern of the longitudinal data in the progression process. The phenotypes across multiple time-points may exhibit the latent pattern that can be used to facilitate the understanding of the progression process. Accordingly, in this paper, we explore underlying complementary information from multiple time-points and simultaneously seek the underlying latent representation. With the complementarity of multiple time-points, the latent representation depicts data more comprehensively than each individual time-point, therefore mining effective longitudinal phenotype latent pattern representation. Specifically, we first propose two latent pattern representation (LPR) for longitudinal imaging phenotypes: linear LPR (lLPR), based on linear relationships between latent representation and each time-point, and nonlinear LPR (nonlLPR), based on neural networks to deal with nonlinear relationships. Then, we calculate the imaging genetic association based on the latent pattern representation. Finally, we conduct the experiments on both synthetic and real longitudinal imaging genetic data. Related experimental results validate that our proposed approach outperforms several competing algorithms, establishes strong associations, and discovers consistent longitudinal imaging genetic biomarkers, thereby guiding disease interpretation.
{"title":"Identification of Genetic Risk Factors Based on Disease Progression Derived From Modeling Longitudinal Phenotype Latent Pattern Representation.","authors":"Meiling Wang,Wei Shao,Daoqiang Zhang,Qingshan Liu","doi":"10.1109/tmi.2025.3627406","DOIUrl":"https://doi.org/10.1109/tmi.2025.3627406","url":null,"abstract":"The characteristic of neurodegenerative disorders is the progressive impairment of memory and other cognitive functions. However, these existing imaging genetic methods only use longitudinal imaging phenotypes straightforwardly, ignoring the latent pattern of the longitudinal data in the progression process. The phenotypes across multiple time-points may exhibit the latent pattern that can be used to facilitate the understanding of the progression process. Accordingly, in this paper, we explore underlying complementary information from multiple time-points and simultaneously seek the underlying latent representation. With the complementarity of multiple time-points, the latent representation depicts data more comprehensively than each individual time-point, therefore mining effective longitudinal phenotype latent pattern representation. Specifically, we first propose two latent pattern representation (LPR) for longitudinal imaging phenotypes: linear LPR (lLPR), based on linear relationships between latent representation and each time-point, and nonlinear LPR (nonlLPR), based on neural networks to deal with nonlinear relationships. Then, we calculate the imaging genetic association based on the latent pattern representation. Finally, we conduct the experiments on both synthetic and real longitudinal imaging genetic data. Related experimental results validate that our proposed approach outperforms several competing algorithms, establishes strong associations, and discovers consistent longitudinal imaging genetic biomarkers, thereby guiding disease interpretation.","PeriodicalId":13418,"journal":{"name":"IEEE Transactions on Medical Imaging","volume":"154 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145403774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, numerous deep learning models have been proposed for breast cancer diagnosis using multimodal multi-view ultrasound images. However, their performance could be highly affected by overlooking interactions between different modalities and views. Moreover, existing methods struggle to handle cases where certain modalities or views are missing, which limits their clinical applications. To address these issues, we propose a novel Alignment and Imputation Network (AINet) by integrating 1) alignment and imputation pre-training, and 2) hierarchical fusion fine-tuning. Specifically, in the pre-training stage, cross-modal contrastive learning is employed to align features across different modalities, for effectively capturing inter-modal interactions. To simulate missing modality (view) scenarios, we randomly mask out features and then impute them by leveraging inter-modal and inter-view relationships. Following the clinical diagnosis procedure, the subsequent fine-tuning stage further incorporates modality-level and view-level fusion in a hierarchical manner. The proposed AINet is developed and evaluated on three datasets, comprising 15,223 subjects in total. Experimental results demonstrate that AINet significantly outperforms state-of-the-art methods, particularly in handling missing modalities (views). This highlights its robustness and potential for real-world clinical applications.
{"title":"An Alignment and Imputation Network (AINet) for Breast Cancer Diagnosis with Multimodal Multi-view Ultrasound Images.","authors":"Haoyuan Chen,Yonghao Li,Jiadong Zhang,Long Yang,Yiqun Sun,Yaling Chen,Shichong Zhou,Zhenhui Li,Xuejun Qian,Qi Xu,Dinggang Shen","doi":"10.1109/tmi.2025.3625254","DOIUrl":"https://doi.org/10.1109/tmi.2025.3625254","url":null,"abstract":"Recently, numerous deep learning models have been proposed for breast cancer diagnosis using multimodal multi-view ultrasound images. However, their performance could be highly affected by overlooking interactions between different modalities and views. Moreover, existing methods struggle to handle cases where certain modalities or views are missing, which limits their clinical applications. To address these issues, we propose a novel Alignment and Imputation Network (AINet) by integrating 1) alignment and imputation pre-training, and 2) hierarchical fusion fine-tuning. Specifically, in the pre-training stage, cross-modal contrastive learning is employed to align features across different modalities, for effectively capturing inter-modal interactions. To simulate missing modality (view) scenarios, we randomly mask out features and then impute them by leveraging inter-modal and inter-view relationships. Following the clinical diagnosis procedure, the subsequent fine-tuning stage further incorporates modality-level and view-level fusion in a hierarchical manner. The proposed AINet is developed and evaluated on three datasets, comprising 15,223 subjects in total. Experimental results demonstrate that AINet significantly outperforms state-of-the-art methods, particularly in handling missing modalities (views). This highlights its robustness and potential for real-world clinical applications.","PeriodicalId":13418,"journal":{"name":"IEEE Transactions on Medical Imaging","volume":"1 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145357891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-16DOI: 10.1109/tmi.2025.3622492
Qingtao Pan,Zhengrong Li,Guang Yang,Qing Yang,Bing Ji
The disparity between image and text representations, often referred to as the modality gap, remains a significant obstacle for Vision Language Models (VLMs) in medical image segmentation. This gap complicates multi-modal fusion, thereby restricting segmentation performance. To address this challenge, we propose Evidence-driven Vision Language Model (EviVLM)-a novel paradigm that integrates Evidential Learning (EL) into VLMs to systematically measure and mitigate the modality gap for enhanced multi-modal fusion. To drive this paradigm, an Evidence Affinity Map Generator (EAMG) is proposed to collect complementary cross-modal evidences by learning a global cross-modal affinity map, thus refining modality-specific evidence embedding. An Evidence Differential Similarity Learning (EDSL) is further proposed to collect consistent cross-modal evidences by performing Bias-Variance Decomposition on differential matrix derived from bidirectional similarity matrices between image and text evidence embeddings. Finally, the subjective logic is used for mapping the collected evidences to opinions, and the Dempster-Shafer's theory based combination rule is introduced for opinion aggregation, thereby quantifying the modality gap and facilitating effective multi-modal integration. Experimental results on three public medical image segmentation datasets validate that the proposed EviVLM can achieve state-of-the-art performance. Code is available at: https://github.com/QingtaoPan/EviVLM.
{"title":"EviVLM: When Evidential Learning Meets Vision Language Model for Medical Image Segmentation.","authors":"Qingtao Pan,Zhengrong Li,Guang Yang,Qing Yang,Bing Ji","doi":"10.1109/tmi.2025.3622492","DOIUrl":"https://doi.org/10.1109/tmi.2025.3622492","url":null,"abstract":"The disparity between image and text representations, often referred to as the modality gap, remains a significant obstacle for Vision Language Models (VLMs) in medical image segmentation. This gap complicates multi-modal fusion, thereby restricting segmentation performance. To address this challenge, we propose Evidence-driven Vision Language Model (EviVLM)-a novel paradigm that integrates Evidential Learning (EL) into VLMs to systematically measure and mitigate the modality gap for enhanced multi-modal fusion. To drive this paradigm, an Evidence Affinity Map Generator (EAMG) is proposed to collect complementary cross-modal evidences by learning a global cross-modal affinity map, thus refining modality-specific evidence embedding. An Evidence Differential Similarity Learning (EDSL) is further proposed to collect consistent cross-modal evidences by performing Bias-Variance Decomposition on differential matrix derived from bidirectional similarity matrices between image and text evidence embeddings. Finally, the subjective logic is used for mapping the collected evidences to opinions, and the Dempster-Shafer's theory based combination rule is introduced for opinion aggregation, thereby quantifying the modality gap and facilitating effective multi-modal integration. Experimental results on three public medical image segmentation datasets validate that the proposed EviVLM can achieve state-of-the-art performance. Code is available at: https://github.com/QingtaoPan/EviVLM.","PeriodicalId":13418,"journal":{"name":"IEEE Transactions on Medical Imaging","volume":"92 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145305637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}