Automatic anatomical localization is critical for radiology report generation. While many studies focus on lesion detection and segmentation, anatomical localization—accurately describing lesion positions in radiology reports—has received less attention. Conventional segmentation-based methods are limited to organ-level localization and often fail in severe disease cases due to low segmentation accuracy. To address these limitations, we reformulate anatomical localization as an image-to-text retrieval task. Specifically, we propose a CLIP-based framework that aligns lesion image patches with anatomically descriptive text embeddings in a shared multimodal space. By projecting lesion features into the semantic space and retrieving the most relevant anatomical descriptions in a coarse-to-fine manner, our method achieves fine-grained lesion localization with high accuracy across the entire body. Our main contributions are as follows: (1) hierarchical anatomical retrieval, which organizes 387 locations into a two-level hierarchy, by retrieving from the first level of 124 coarse categories to narrow down the search space and reduce localization complexity; (2) augmented location descriptions, which integrate domain-specific anatomical knowledge for enhancing semantic representation and improving visual—text alignment; and (3) semi-hard negative sample mining, which improves training stability and discriminative learning by avoiding selecting the overly similar negative samples that may introduce label noise or semantic ambiguity. We validate our method on two whole-body PET/CT datasets, achieving an 84.13% localization accuracy on the internal test set and 80.42% on the external test set, with a per-lesion inference time of 34 ms. The proposed framework also demonstrated superior robustness in complex clinical cases compared to segmentation-based approaches.
{"title":"Hierarchical Contrastive Learning for Precise Whole-Body Anatomical Localization in PET/CT Imaging","authors":"Yaozong Gao;Yiran Shu;Mingyang Yu;Yanbo Chen;Jingyu Liu;Shaonan Zhong;Weifang Zhang;Yiqiang Zhan;Xiang Sean Zhou;Xinlu Wang;Meixin Zhao;Dinggang Shen","doi":"10.1109/TMI.2025.3599197","DOIUrl":"10.1109/TMI.2025.3599197","url":null,"abstract":"Automatic anatomical localization is critical for radiology report generation. While many studies focus on lesion detection and segmentation, anatomical localization—accurately describing lesion positions in radiology reports—has received less attention. Conventional segmentation-based methods are limited to organ-level localization and often fail in severe disease cases due to low segmentation accuracy. To address these limitations, we reformulate anatomical localization as an image-to-text retrieval task. Specifically, we propose a CLIP-based framework that aligns lesion image patches with anatomically descriptive text embeddings in a shared multimodal space. By projecting lesion features into the semantic space and retrieving the most relevant anatomical descriptions in a coarse-to-fine manner, our method achieves fine-grained lesion localization with high accuracy across the entire body. Our main contributions are as follows: (1) hierarchical anatomical retrieval, which organizes 387 locations into a two-level hierarchy, by retrieving from the first level of 124 coarse categories to narrow down the search space and reduce localization complexity; (2) augmented location descriptions, which integrate domain-specific anatomical knowledge for enhancing semantic representation and improving visual—text alignment; and (3) semi-hard negative sample mining, which improves training stability and discriminative learning by avoiding selecting the overly similar negative samples that may introduce label noise or semantic ambiguity. We validate our method on two whole-body PET/CT datasets, achieving an 84.13% localization accuracy on the internal test set and 80.42% on the external test set, with a per-lesion inference time of 34 ms. The proposed framework also demonstrated superior robustness in complex clinical cases compared to segmentation-based approaches.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"45 1","pages":"391-405"},"PeriodicalIF":0.0,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144877630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-18DOI: 10.1109/TMI.2025.3599937
Domagoj Bošnjak;Gian Marco Melito;Richard Schussnig;Katrin Ellermann;Thomas-Peter Fries
The effects of the aortic geometry on its mechanics and blood flow, and subsequently on aortic pathologies, remain largely unexplored. The main obstacle lies in obtaining patient-specific aorta models, an extremely difficult procedure in terms of ethics and availability, segmentation, mesh generation, and all of the accompanying processes. Contrastingly, idealized models are easy to build but do not faithfully represent patient-specific variability. Additionally, a unified aortic parametrization in clinic and engineering has not yet been achieved. To bridge this gap, we introduce a new set of statistical parameters to generate synthetic models of the aorta. The parameters possess geometric significance and fall within physiological ranges, effectively bridging the disciplines of clinical medicine and engineering. Smoothly blended realistic representations are recovered with convolution surfaces. These enable high-quality visualization and biological appearance, whereas the structured mesh generation paves the way for numerical simulations. The only requirement of the approach is one patient-specific aorta model and the statistical data for parameter values obtained from the literature. The output of this work is SynthAorta, a dataset of ready-to-use synthetic, physiological aorta models, each containing a centerline, surface representation, and a structured hexahedral finite element mesh. The meshes are structured and fully consistent between different cases, making them imminently suitable for reduced order modeling and machine learning approaches.
{"title":"SynthAorta: A 3D Mesh Dataset of Parametrized Physiological Healthy Aortas","authors":"Domagoj Bošnjak;Gian Marco Melito;Richard Schussnig;Katrin Ellermann;Thomas-Peter Fries","doi":"10.1109/TMI.2025.3599937","DOIUrl":"10.1109/TMI.2025.3599937","url":null,"abstract":"The effects of the aortic geometry on its mechanics and blood flow, and subsequently on aortic pathologies, remain largely unexplored. The main obstacle lies in obtaining patient-specific aorta models, an extremely difficult procedure in terms of ethics and availability, segmentation, mesh generation, and all of the accompanying processes. Contrastingly, idealized models are easy to build but do not faithfully represent patient-specific variability. Additionally, a unified aortic parametrization in clinic and engineering has not yet been achieved. To bridge this gap, we introduce a new set of statistical parameters to generate synthetic models of the aorta. The parameters possess geometric significance and fall within physiological ranges, effectively bridging the disciplines of clinical medicine and engineering. Smoothly blended realistic representations are recovered with convolution surfaces. These enable high-quality visualization and biological appearance, whereas the structured mesh generation paves the way for numerical simulations. The only requirement of the approach is one patient-specific aorta model and the statistical data for parameter values obtained from the literature. The output of this work is <italic>SynthAorta</i>, a dataset of ready-to-use synthetic, physiological aorta models, each containing a centerline, surface representation, and a structured hexahedral finite element mesh. The meshes are structured and fully consistent between different cases, making them imminently suitable for reduced order modeling and machine learning approaches.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"45 1","pages":"421-430"},"PeriodicalIF":0.0,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11129067","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144877632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Magnetic resonance imaging (MRI) is powerful in medical diagnostics, yet high-field MRI, despite offering superior image quality, incurs significant costs for procurement, installation, maintenance, and operation, restricting its availability and accessibility, especially in low- and middle-income countries. Addressing this, our study proposes an unsupervised learning algorithm based on cycle-consistent generative adversarial networks. This framework transforms 0.3T low-field MRI into higher-quality 3T-like images, bypassing the need for paired low/high-field training data. The proposed architecture integrates two novel modules to enhance reconstruction quality: (1) an attention block that dynamically balances high-field-like features with the original low-field input, and (2) an edge block that refines boundary details, providing more accurate structural reconstruction. The proposed generative model is trained on large-scale, unpaired, public datasets, and further validated on paired low/high-field acquisitions of three major clinical MRI sequences: T1-weighted, T2-weighted, and fluid-attenuated inversion recovery (FLAIR) imaging. It demonstrates notable improvements in tissue contrast and signal-to-noise ratio while preserving anatomical fidelity. This approach utilizes rich information from publicly available MRI resources, providing a data-efficient unsupervised alternative that complements supervised methods to enhance the utility of low-field MRI.
{"title":"An Unsupervised Learning Approach for Reconstructing 3T-Like Images From 0.3T MRI Without Paired Training Data","authors":"Huaishui Yang;Shaojun Liu;Yilong Liu;Lingyan Zhang;Shoujin Huang;Jiayu Zheng;Jingzhe Liu;Hua Guo;Ed X. Wu;Mengye Lyu","doi":"10.1109/TMI.2025.3597401","DOIUrl":"10.1109/TMI.2025.3597401","url":null,"abstract":"Magnetic resonance imaging (MRI) is powerful in medical diagnostics, yet high-field MRI, despite offering superior image quality, incurs significant costs for procurement, installation, maintenance, and operation, restricting its availability and accessibility, especially in low- and middle-income countries. Addressing this, our study proposes an unsupervised learning algorithm based on cycle-consistent generative adversarial networks. This framework transforms 0.3T low-field MRI into higher-quality 3T-like images, bypassing the need for paired low/high-field training data. The proposed architecture integrates two novel modules to enhance reconstruction quality: (1) an attention block that dynamically balances high-field-like features with the original low-field input, and (2) an edge block that refines boundary details, providing more accurate structural reconstruction. The proposed generative model is trained on large-scale, unpaired, public datasets, and further validated on paired low/high-field acquisitions of three major clinical MRI sequences: T1-weighted, T2-weighted, and fluid-attenuated inversion recovery (FLAIR) imaging. It demonstrates notable improvements in tissue contrast and signal-to-noise ratio while preserving anatomical fidelity. This approach utilizes rich information from publicly available MRI resources, providing a data-efficient unsupervised alternative that complements supervised methods to enhance the utility of low-field MRI.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 12","pages":"5358-5371"},"PeriodicalIF":0.0,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144819720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unsupervised anomaly detection (UAD) methods typically detect anomalies by learning and reconstructing the normative distribution. However, since anomalies constantly invade and affect their surroundings, sub-healthy areas in the junction present structural deformations that could be easily misidentified as anomalies, posing difficulties for UAD methods that solely learn the normative distribution. The use of multimodal images can facilitate to address the above challenges, as they can provide complementary information of anomalies. Therefore, this paper propose a novel method for UAD in preoperative multimodal images, called Erasure Perception Diffusion model (EPDiff). First, the Local Erasure Progressive Training (LEPT) framework is designed to better rebuild sub-healthy structures around anomalies through the diffusion model with a two-phase process. Initially, healthy images are used to capture deviation features labeled as potential anomalies. Then, these anomalies are locally erased in multimodal images to progressively learn sub-healthy structures, obtaining a more detailed reconstruction around anomalies. Second, the Global Structural Perception (GSP) module is developed in the diffusion model to realize global structural representation and correlation within images and between modalities through interactions of high-level semantic information. In addition, a training-free module, named Multimodal Attention Fusion (MAF) module, is presented for weighted fusion of anomaly maps between different modalities and obtaining binary anomaly outputs. Experimental results show that EPDiff improves the AUPRC and mDice scores by 2% and 3.9% on BraTS2021, and by 5.2% and 4.5% on Shifts over the state-of-the-art methods, which proves the applicability of EPDiff in diverse anomaly diagnosis. The code is available at https://github.com/wjiazheng/EPDiff
{"title":"EPDiff: Erasure Perception Diffusion Model for Unsupervised Anomaly Detection in Preoperative Multimodal Images","authors":"Jiazheng Wang;Min Liu;Wenting Shen;Renjie Ding;Yaonan Wang;Erik Meijering","doi":"10.1109/TMI.2025.3597545","DOIUrl":"10.1109/TMI.2025.3597545","url":null,"abstract":"Unsupervised anomaly detection (UAD) methods typically detect anomalies by learning and reconstructing the normative distribution. However, since anomalies constantly invade and affect their surroundings, sub-healthy areas in the junction present structural deformations that could be easily misidentified as anomalies, posing difficulties for UAD methods that solely learn the normative distribution. The use of multimodal images can facilitate to address the above challenges, as they can provide complementary information of anomalies. Therefore, this paper propose a novel method for UAD in preoperative multimodal images, called Erasure Perception Diffusion model (EPDiff). First, the Local Erasure Progressive Training (LEPT) framework is designed to better rebuild sub-healthy structures around anomalies through the diffusion model with a two-phase process. Initially, healthy images are used to capture deviation features labeled as potential anomalies. Then, these anomalies are locally erased in multimodal images to progressively learn sub-healthy structures, obtaining a more detailed reconstruction around anomalies. Second, the Global Structural Perception (GSP) module is developed in the diffusion model to realize global structural representation and correlation within images and between modalities through interactions of high-level semantic information. In addition, a training-free module, named Multimodal Attention Fusion (MAF) module, is presented for weighted fusion of anomaly maps between different modalities and obtaining binary anomaly outputs. Experimental results show that EPDiff improves the AUPRC and mDice scores by 2% and 3.9% on BraTS2021, and by 5.2% and 4.5% on Shifts over the state-of-the-art methods, which proves the applicability of EPDiff in diverse anomaly diagnosis. The code is available at <uri>https://github.com/wjiazheng/EPDiff</uri>","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"45 1","pages":"379-390"},"PeriodicalIF":0.0,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144819772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-08DOI: 10.1109/TMI.2025.3597026
Xiaoyu Zhu;Shiyin Li;HongLiang Bi;Lina Guan;Haiyang Liu;Zhaolin Lu
Choroidal thickness variations serve as critical biomarkers for numerous ophthalmic diseases. Accurate segmentation and quantification of the choroid in optical coherence tomography (OCT) images is essential for clinical diagnosis and disease progression monitoring. Due to the small number of disease types in the public OCT dataset involving changes in choroidal thickness and the lack of a publicly available labeled dataset, we constructed the Xuzhou Municipal Hospital (XZMH)-Choroid dataset. This dataset contains annotated OCT images of normal and eight choroid-related diseases. However, segmentation of the choroid in OCT images remains a formidable challenge due to the confounding factors of blurred boundaries, non-uniform texture, and lesions. To overcome these challenges, we proposed a mixed attention-guided multiscale feature fusion network (MAMFF-Net). This network integrates a Mixed Attention Encoder (MAE) for enhanced fine-grained feature extraction, a deformable multiscale feature fusion path (DMFFP) for adaptive feature integration across lesion deformations, and a multiscale pyramid layer aggregation (MPLA) module for improved contextual representation learning. Through comparative experiments with other deep learning methods, we found that the MAMFF-Net model has better segmentation performance than other deep learning methods (mDice: 97.44, mIoU: 95.11, mAcc: 97.71). Based on the choroidal segmentation implemented in MAMFF-Net, an algorithm for automated choroidal thickness measurement was developed, and the automated measurement results approached the level of senior specialists.
{"title":"Automatic Choroid Segmentation and Thickness Measurement Based on Mixed Attention-Guided Multiscale Feature Fusion Network","authors":"Xiaoyu Zhu;Shiyin Li;HongLiang Bi;Lina Guan;Haiyang Liu;Zhaolin Lu","doi":"10.1109/TMI.2025.3597026","DOIUrl":"10.1109/TMI.2025.3597026","url":null,"abstract":"Choroidal thickness variations serve as critical biomarkers for numerous ophthalmic diseases. Accurate segmentation and quantification of the choroid in optical coherence tomography (OCT) images is essential for clinical diagnosis and disease progression monitoring. Due to the small number of disease types in the public OCT dataset involving changes in choroidal thickness and the lack of a publicly available labeled dataset, we constructed the Xuzhou Municipal Hospital (XZMH)-Choroid dataset. This dataset contains annotated OCT images of normal and eight choroid-related diseases. However, segmentation of the choroid in OCT images remains a formidable challenge due to the confounding factors of blurred boundaries, non-uniform texture, and lesions. To overcome these challenges, we proposed a mixed attention-guided multiscale feature fusion network (MAMFF-Net). This network integrates a Mixed Attention Encoder (MAE) for enhanced fine-grained feature extraction, a deformable multiscale feature fusion path (DMFFP) for adaptive feature integration across lesion deformations, and a multiscale pyramid layer aggregation (MPLA) module for improved contextual representation learning. Through comparative experiments with other deep learning methods, we found that the MAMFF-Net model has better segmentation performance than other deep learning methods (mDice: 97.44, mIoU: 95.11, mAcc: 97.71). Based on the choroidal segmentation implemented in MAMFF-Net, an algorithm for automated choroidal thickness measurement was developed, and the automated measurement results approached the level of senior specialists.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"45 1","pages":"350-363"},"PeriodicalIF":0.0,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144802501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unsupervised brain lesion segmentation, focusing on learning normative distributions from images of healthy subjects, are less dependent on lesion-labeled data, thus exhibiting better generalization capabilities. A fundamental challenge in learning normative distributions of images lies in the high dimensionality if image pixels are treated as correlated random variables to capture spatial dependence. In this study, we proposed a subspace-based deep generative model to learn the posterior normal distributions. Specifically, we used probabilistic subspace models to capture spatial-intensity distributions and spatial-structure distributions of brain images from healthy subjects. These models captured prior spatial-intensity and spatial-structure variations effectively by treating the subspace coefficients as random variables with basis functions being the eigen-images and eigen-density functions learned from the training data. These prior distributions were then converted to posterior distributions, including both the posterior normal and posterior lesion distributions for a given image using the subspace-based generative model and subspace-assisted Bayesian analysis, respectively. Finally, an unsupervised fusion classifier was used to combine the posterior and likelihood features for lesion segmentation. The proposed method has been evaluated on simulated and real lesion data, including tumor, multiple sclerosis, and stroke, demonstrating superior segmentation accuracy and robustness over the state-of-the-art methods. Our proposed method holds promise for enhancing unsupervised brain lesion delineation in clinical applications.
{"title":"Unsupervised Brain Lesion Segmentation Using Posterior Distributions Learned by Subspace-Based Generative Model","authors":"Huixiang Zhuang;Yue Guan;Yi Ding;Chang Xu;Zijun Cheng;Yuhao Ma;Ruihao Liu;Ziyu Meng;Li Cao;Yao Li;Zhi-Pei Liang","doi":"10.1109/TMI.2025.3597080","DOIUrl":"10.1109/TMI.2025.3597080","url":null,"abstract":"Unsupervised brain lesion segmentation, focusing on learning normative distributions from images of healthy subjects, are less dependent on lesion-labeled data, thus exhibiting better generalization capabilities. A fundamental challenge in learning normative distributions of images lies in the high dimensionality if image pixels are treated as correlated random variables to capture spatial dependence. In this study, we proposed a subspace-based deep generative model to learn the posterior normal distributions. Specifically, we used probabilistic subspace models to capture spatial-intensity distributions and spatial-structure distributions of brain images from healthy subjects. These models captured prior spatial-intensity and spatial-structure variations effectively by treating the subspace coefficients as random variables with basis functions being the eigen-images and eigen-density functions learned from the training data. These prior distributions were then converted to posterior distributions, including both the posterior normal and posterior lesion distributions for a given image using the subspace-based generative model and subspace-assisted Bayesian analysis, respectively. Finally, an unsupervised fusion classifier was used to combine the posterior and likelihood features for lesion segmentation. The proposed method has been evaluated on simulated and real lesion data, including tumor, multiple sclerosis, and stroke, demonstrating superior segmentation accuracy and robustness over the state-of-the-art methods. Our proposed method holds promise for enhancing unsupervised brain lesion delineation in clinical applications.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"45 1","pages":"364-378"},"PeriodicalIF":0.0,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144802499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Computed tomography (CT) is one of the most widely used non-invasive imaging modalities for medical diagnosis. In clinical practice, CT images are usually acquired with large slice thicknesses due to the high cost of memory storage and operation time, resulting in an anisotropic CT volume with much lower inter-slice resolution than in-plane resolution. Since such inconsistent resolution may lead to difficulties in disease diagnosis, deep learning-based volumetric super-resolution methods have been developed to improve inter-slice resolution. Most existing methods conduct single-image super-resolution on the through-plane or synthesize intermediate slices from adjacent slices; however, the anisotropic characteristic of 3D CT volume has not been well explored. In this paper, we propose a novel cross-view texture transfer approach for CT slice interpolation by fully utilizing the anisotropic nature of 3D CT volume. Specifically, we design a unique framework that takes high-resolution in-plane texture details as a reference and transfers them to low-resolution through-plane images. To this end, we introduce a multi-reference non-local attention module that extracts meaningful features for reconstructing through-plane high-frequency details from multiple in-plane images. Through extensive experiments, we demonstrate that our method performs significantly better in CT slice interpolation than existing competing methods on public CT datasets including a real-paired benchmark, verifying the effectiveness of the proposed framework. The source code of this work is available at https://github.com/khuhm/ACVTT
{"title":"An Anisotropic Cross-View Texture Transfer With Multi-Reference Non-Local Attention for CT Slice Interpolation","authors":"Kwang-Hyun Uhm;Hyunjun Cho;Sung-Hoo Hong;Seung-Won Jung","doi":"10.1109/TMI.2025.3596957","DOIUrl":"10.1109/TMI.2025.3596957","url":null,"abstract":"Computed tomography (CT) is one of the most widely used non-invasive imaging modalities for medical diagnosis. In clinical practice, CT images are usually acquired with large slice thicknesses due to the high cost of memory storage and operation time, resulting in an anisotropic CT volume with much lower inter-slice resolution than in-plane resolution. Since such inconsistent resolution may lead to difficulties in disease diagnosis, deep learning-based volumetric super-resolution methods have been developed to improve inter-slice resolution. Most existing methods conduct single-image super-resolution on the through-plane or synthesize intermediate slices from adjacent slices; however, the anisotropic characteristic of 3D CT volume has not been well explored. In this paper, we propose a novel cross-view texture transfer approach for CT slice interpolation by fully utilizing the anisotropic nature of 3D CT volume. Specifically, we design a unique framework that takes high-resolution in-plane texture details as a reference and transfers them to low-resolution through-plane images. To this end, we introduce a multi-reference non-local attention module that extracts meaningful features for reconstructing through-plane high-frequency details from multiple in-plane images. Through extensive experiments, we demonstrate that our method performs significantly better in CT slice interpolation than existing competing methods on public CT datasets including a real-paired benchmark, verifying the effectiveness of the proposed framework. The source code of this work is available at <uri>https://github.com/khuhm/ACVTT</uri>","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"45 1","pages":"336-349"},"PeriodicalIF":0.0,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144802503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-07DOI: 10.1109/TMI.2025.3596874
Yeganeh Madadi;Hina Raja;Koenraad A. Vermeer;Hans G. Lemij;Xiaoqin Huang;Eunjin Kim;Seunghoon Lee;Gitaek Kwon;Hyunwoo Kim;Jaeyoung Kim;Adrian Galdran;Miguel A. González Ballester;Dan Presil;Kristhian Aguilar;Victor Cavalcante;Celso Carvalho;Waldir Sabino;Mateus Oliveira;Hui Lin;Charilaos Apostolidis;Aggelos K. Katsaggelos;Tomasz Kubrak;Á. Casado-García;J. Heras;M. Ortega;L. Ramos;Philippe Zhang;Yihao Li;Jing Zhang;Weili Jiang;Pierre-Henri Conze;Mathieu Lamard;Gwenole Quellec;Mostafa El Habib Daho;Madukuri Shaurya;Anumeha Varma;Monika Agrawal;Siamak Yousefi
A major contributor to permanent vision loss is glaucoma. Early diagnosis is crucial for preventing vision loss due to glaucoma, making glaucoma screening essential. A more affordable method of glaucoma screening can be achieved by applying artificial intelligence to evaluate color fundus photographs (CFPs). We present the Justified Referral in AI Glaucoma Screening (JustRAIGS) challenge to further develop these AI algorithms for glaucoma screening and to assess their efficacy. To support this challenge, we have generated a distinctive big dataset containing more than 110,000 meticulously labeled CFPs obtained from approximately 60,000 patients and 500 distinct screening centers in the USA. Our objective is to assess the practicality of creating advanced and dependable AI systems that can take a CFP as input and produce the probability of referable glaucoma, as well as outputs for glaucoma justification by integrating both binary and multi-label classification tasks. This paper presents the evaluation of solutions provided by nine teams, recognizing the team with the highest level of performance. The highest achieved score of sensitivity at a specificity level of 95% was 85%, and the highest achieved score of Hamming losses average was 0.13. Additionally, we test the top three participants’ algorithms on an external dataset to validate the performance and generalization of these models. The outcomes of this research can offer valuable insights into the development of intelligent systems for detecting glaucoma. Ultimately, findings can aid in the early detection and treatment of glaucoma patients, hence decreasing preventable vision impairment and blindness caused by glaucoma.
{"title":"JustRAIGS: Justified Referral in AI Glaucoma Screening Challenge","authors":"Yeganeh Madadi;Hina Raja;Koenraad A. Vermeer;Hans G. Lemij;Xiaoqin Huang;Eunjin Kim;Seunghoon Lee;Gitaek Kwon;Hyunwoo Kim;Jaeyoung Kim;Adrian Galdran;Miguel A. González Ballester;Dan Presil;Kristhian Aguilar;Victor Cavalcante;Celso Carvalho;Waldir Sabino;Mateus Oliveira;Hui Lin;Charilaos Apostolidis;Aggelos K. Katsaggelos;Tomasz Kubrak;Á. Casado-García;J. Heras;M. Ortega;L. Ramos;Philippe Zhang;Yihao Li;Jing Zhang;Weili Jiang;Pierre-Henri Conze;Mathieu Lamard;Gwenole Quellec;Mostafa El Habib Daho;Madukuri Shaurya;Anumeha Varma;Monika Agrawal;Siamak Yousefi","doi":"10.1109/TMI.2025.3596874","DOIUrl":"10.1109/TMI.2025.3596874","url":null,"abstract":"A major contributor to permanent vision loss is glaucoma. Early diagnosis is crucial for preventing vision loss due to glaucoma, making glaucoma screening essential. A more affordable method of glaucoma screening can be achieved by applying artificial intelligence to evaluate color fundus photographs (CFPs). We present the Justified Referral in AI Glaucoma Screening (JustRAIGS) challenge to further develop these AI algorithms for glaucoma screening and to assess their efficacy. To support this challenge, we have generated a distinctive big dataset containing more than 110,000 meticulously labeled CFPs obtained from approximately 60,000 patients and 500 distinct screening centers in the USA. Our objective is to assess the practicality of creating advanced and dependable AI systems that can take a CFP as input and produce the probability of referable glaucoma, as well as outputs for glaucoma justification by integrating both binary and multi-label classification tasks. This paper presents the evaluation of solutions provided by nine teams, recognizing the team with the highest level of performance. The highest achieved score of sensitivity at a specificity level of 95% was 85%, and the highest achieved score of Hamming losses average was 0.13. Additionally, we test the top three participants’ algorithms on an external dataset to validate the performance and generalization of these models. The outcomes of this research can offer valuable insights into the development of intelligent systems for detecting glaucoma. Ultimately, findings can aid in the early detection and treatment of glaucoma patients, hence decreasing preventable vision impairment and blindness caused by glaucoma.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"45 1","pages":"320-335"},"PeriodicalIF":0.0,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11119643","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144796864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Classification of pathological images is the basis for automatic cancer diagnosis. Despite that deep learning methods have achieved remarkable performance, they heavily rely on labeled data, demanding extensive human annotation efforts. In this study, we present a novel human annotation-free method by leveraging pre-trained Vision-Language Models (VLMs). Without human annotation, pseudo-labels of the training set are obtained by utilizing the zero-shot inference capabilities of VLM, which may contain a lot of noise due to the domain gap between the pre-training and target datasets. To address this issue, we introduce VLM-CPL, a novel approach that contains two noisy label filtering techniques with a semi-supervised learning strategy. Specifically, we first obtain prompt-based pseudo-labels with uncertainty estimation by zero-shot inference with the VLM using multiple augmented views of an input. Then, by leveraging the feature representation ability of VLM, we obtain feature-based pseudo-labels via sample clustering in the feature space. Prompt-feature consensus is introduced to select reliable samples based on the consensus between the two types of pseudo-labels. We further propose High-confidence Cross Supervision by to learn from samples with reliable pseudo-labels and the remaining unlabeled samples. Additionally, we present an innovative open-set prompting strategy that filters irrelevant patches from whole slides to enhance the quality of selected patches. Experimental results on five public pathological image datasets for patch-level and slide-level classification showed that our method substantially outperformed zero-shot classification by VLMs, and was superior to existing noisy label learning methods. The code is publicly available at https://github.com/HiLab-git/VLM-CPL
{"title":"VLM-CPL: Consensus Pseudo-Labels From Vision-Language Models for Annotation-Free Pathological Image Classification","authors":"Lanfeng Zhong;Zongyao Huang;Yang Liu;Wenjun Liao;Shichuan Zhang;Guotai Wang;Shaoting Zhang","doi":"10.1109/TMI.2025.3595111","DOIUrl":"10.1109/TMI.2025.3595111","url":null,"abstract":"Classification of pathological images is the basis for automatic cancer diagnosis. Despite that deep learning methods have achieved remarkable performance, they heavily rely on labeled data, demanding extensive human annotation efforts. In this study, we present a novel human annotation-free method by leveraging pre-trained Vision-Language Models (VLMs). Without human annotation, pseudo-labels of the training set are obtained by utilizing the zero-shot inference capabilities of VLM, which may contain a lot of noise due to the domain gap between the pre-training and target datasets. To address this issue, we introduce VLM-CPL, a novel approach that contains two noisy label filtering techniques with a semi-supervised learning strategy. Specifically, we first obtain prompt-based pseudo-labels with uncertainty estimation by zero-shot inference with the VLM using multiple augmented views of an input. Then, by leveraging the feature representation ability of VLM, we obtain feature-based pseudo-labels via sample clustering in the feature space. Prompt-feature consensus is introduced to select reliable samples based on the consensus between the two types of pseudo-labels. We further propose High-confidence Cross Supervision by to learn from samples with reliable pseudo-labels and the remaining unlabeled samples. Additionally, we present an innovative open-set prompting strategy that filters irrelevant patches from whole slides to enhance the quality of selected patches. Experimental results on five public pathological image datasets for patch-level and slide-level classification showed that our method substantially outperformed zero-shot classification by VLMs, and was superior to existing noisy label learning methods. The code is publicly available at <uri>https://github.com/HiLab-git/VLM-CPL</uri>","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 10","pages":"4023-4036"},"PeriodicalIF":0.0,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144778204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Generating high-fidelity dental radiographs is essential for training diagnostic models. Despite the development of numerous methods for other medical data, generative approaches in dental radiology remain unexplored. Due to the intricate tooth structures and specialized terminology, these methods often yield ambiguous tooth regions and incorrect dental concepts when applied to dentistry. In this paper, we take the first attempt to investigate diffusion-based teeth X-ray image generation and propose ToothMaker, a novel framework specifically designed for the dental domain. Firstly, to synthesize X-ray images that possess accurate tooth structures and realistic radiological styles simultaneously, we design control-disentangled fine-tuning (CDFT) strategy. Specifically, we present two separate controllers to handle style and layout control respectively, and introduce a gradient-based decoupling method that optimizes each using their corresponding disentangled gradients. Secondly, to enhance model’s understanding of dental terminology, we propose prior-disentangled guidance module (PDGM), enabling precise synthesis of dental concepts. It utilizes large language model to decompose dental terminology into a series of meta-knowledge elements and performs interactions and refinements through hypergraph neural network. These elements are then fed into the network to guide the generation of dental concepts. Extensive experiments demonstrate the high fidelity and diversity of the images synthesized by our approach. By incorporating the generated data, we achieve substantial performance improvements on downstream segmentation and visual question answering tasks, indicating that our method can greatly reduce the reliance on manually annotated data. Code will be public available at https://github.com/CUHK-AIM-Group/ToothMaker
{"title":"ToothMaker: Realistic Panoramic Dental Radiograph Generation via Disentangled Control","authors":"Weihao Yu;Xiaoqing Guo;Wuyang Li;Xinyu Liu;Hui Chen;Yixuan Yuan","doi":"10.1109/TMI.2025.3588466","DOIUrl":"10.1109/TMI.2025.3588466","url":null,"abstract":"Generating high-fidelity dental radiographs is essential for training diagnostic models. Despite the development of numerous methods for other medical data, generative approaches in dental radiology remain unexplored. Due to the intricate tooth structures and specialized terminology, these methods often yield ambiguous tooth regions and incorrect dental concepts when applied to dentistry. In this paper, we take the first attempt to investigate diffusion-based teeth X-ray image generation and propose ToothMaker, a novel framework specifically designed for the dental domain. Firstly, to synthesize X-ray images that possess accurate tooth structures and realistic radiological styles simultaneously, we design control-disentangled fine-tuning (CDFT) strategy. Specifically, we present two separate controllers to handle style and layout control respectively, and introduce a gradient-based decoupling method that optimizes each using their corresponding disentangled gradients. Secondly, to enhance model’s understanding of dental terminology, we propose prior-disentangled guidance module (PDGM), enabling precise synthesis of dental concepts. It utilizes large language model to decompose dental terminology into a series of meta-knowledge elements and performs interactions and refinements through hypergraph neural network. These elements are then fed into the network to guide the generation of dental concepts. Extensive experiments demonstrate the high fidelity and diversity of the images synthesized by our approach. By incorporating the generated data, we achieve substantial performance improvements on downstream segmentation and visual question answering tasks, indicating that our method can greatly reduce the reliance on manually annotated data. Code will be public available at <uri>https://github.com/CUHK-AIM-Group/ToothMaker</uri>","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 12","pages":"5233-5244"},"PeriodicalIF":0.0,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144720145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}