Skin lesion segmentation is vital for the early detection, diagnosis, and treatment of melanoma, yet it remains challenging due to significant variations in lesion attributes (e.g., color, size, shape), ambiguous boundaries, and noise interference. Recent advancements have focused on capturing contextual information and incorporating boundary priors to handle challenging lesions. However, there has been limited exploration on the explicit analysis of the inherent patterns of skin lesions, a crucial aspect of the knowledge-driven decision-making process used by clinical experts. In this work, we introduce a novel approach called Probabilistic Attribute Learning (PAL), which leverages knowledge of lesion patterns to achieve enhanced performance on challenging lesions. Recognizing that the lesion patterns exhibited in each image can be properly depicted by disentangled attributes, we begin by explicitly estimating the distributions of these attributes as distinct Gaussian distributions, with mean and variance indicating the most likely pattern of that attribute and its variation. Using Monte Carlo Sampling, we iteratively draw multiple samples from these distributions to capture various potential patterns for each attribute. These samples are then merged through an effective attribute fusion technique, resulting in diverse representations that comprehensively depict the lesion class. By performing pixel-class proximity matching between each pixel-wise representation and the diverse class-wise representations, we significantly enhance the model’s robustness. Extensive experiments on two public skin lesion datasets and one unified polyp lesion dataset demonstrate the effectiveness and strong generalization ability of our method. Codes are available at https://github.com/IsYuchenYuan/PAL
{"title":"PAL: Boosting Skin Lesion Segmentation via Probabilistic Attribute Learning","authors":"Yuchen Yuan;Xi Wang;Jinpeng Li;Guangyong Chen;Pheng-Ann Heng","doi":"10.1109/TMI.2025.3588167","DOIUrl":"10.1109/TMI.2025.3588167","url":null,"abstract":"Skin lesion segmentation is vital for the early detection, diagnosis, and treatment of melanoma, yet it remains challenging due to significant variations in lesion attributes (e.g., color, size, shape), ambiguous boundaries, and noise interference. Recent advancements have focused on capturing contextual information and incorporating boundary priors to handle challenging lesions. However, there has been limited exploration on the explicit analysis of the inherent patterns of skin lesions, a crucial aspect of the knowledge-driven decision-making process used by clinical experts. In this work, we introduce a novel approach called Probabilistic Attribute Learning (PAL), which leverages knowledge of lesion patterns to achieve enhanced performance on challenging lesions. Recognizing that the lesion patterns exhibited in each image can be properly depicted by disentangled attributes, we begin by explicitly estimating the distributions of these attributes as distinct Gaussian distributions, with mean and variance indicating the most likely pattern of that attribute and its variation. Using Monte Carlo Sampling, we iteratively draw multiple samples from these distributions to capture various potential patterns for each attribute. These samples are then merged through an effective attribute fusion technique, resulting in diverse representations that comprehensively depict the lesion class. By performing pixel-class proximity matching between each pixel-wise representation and the diverse class-wise representations, we significantly enhance the model’s robustness. Extensive experiments on two public skin lesion datasets and one unified polyp lesion dataset demonstrate the effectiveness and strong generalization ability of our method. Codes are available at <uri>https://github.com/IsYuchenYuan/PAL</uri>","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 12","pages":"5183-5196"},"PeriodicalIF":0.0,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11078393","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144611278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-11DOI: 10.1109/TMI.2025.3588157
Xiaolong Deng;Huisi Wu
Accurate segmentation of the left ventricle in echocardiography is critical for diagnosing and treating cardiovascular diseases. However, accurate segmentation remains challenging due to the limitations of ultrasound imaging. Although numerous image and video segmentation methods have been proposed, existing methods still fail to effectively solve this task, which is limited by sparsity annotations. To address this problem, we propose a novel semi-supervised segmentation framework named NCM-Net for echocardiography. We first propose the neighborhood correlation mining (NCM) module, which sufficiently mines the correlations between query features and their spatiotemporal neighborhoods to resist noise influence. The module also captures cross-scale contextual correlations between pixels spatially to further refine features, thus alleviating the impact of noise on echocardiography segmentation. To further improve segmentation accuracy, we propose using unreliable-pixels masked attention (UMA). By masking reliable pixels, it pays extra attention to unreliable pixels to refine the boundary of segmentation. Further, we use cross-frame boundary constraints on the final predictions to optimize their temporal consistency. Through extensive experiments on two publicly available datasets, CAMUS and EchoNet-Dynamic, we demonstrate the effectiveness of the proposed, which achieves state-of-the-art performance and outstanding temporal consistency. Codes are available at https://github.com/dengxl0520/NCMNet
{"title":"Echocardiography Video Segmentation via Neighborhood Correlation Mining","authors":"Xiaolong Deng;Huisi Wu","doi":"10.1109/TMI.2025.3588157","DOIUrl":"10.1109/TMI.2025.3588157","url":null,"abstract":"Accurate segmentation of the left ventricle in echocardiography is critical for diagnosing and treating cardiovascular diseases. However, accurate segmentation remains challenging due to the limitations of ultrasound imaging. Although numerous image and video segmentation methods have been proposed, existing methods still fail to effectively solve this task, which is limited by sparsity annotations. To address this problem, we propose a novel semi-supervised segmentation framework named NCM-Net for echocardiography. We first propose the neighborhood correlation mining (NCM) module, which sufficiently mines the correlations between query features and their spatiotemporal neighborhoods to resist noise influence. The module also captures cross-scale contextual correlations between pixels spatially to further refine features, thus alleviating the impact of noise on echocardiography segmentation. To further improve segmentation accuracy, we propose using unreliable-pixels masked attention (UMA). By masking reliable pixels, it pays extra attention to unreliable pixels to refine the boundary of segmentation. Further, we use cross-frame boundary constraints on the final predictions to optimize their temporal consistency. Through extensive experiments on two publicly available datasets, CAMUS and EchoNet-Dynamic, we demonstrate the effectiveness of the proposed, which achieves state-of-the-art performance and outstanding temporal consistency. Codes are available at <uri>https://github.com/dengxl0520/NCMNet</uri>","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 12","pages":"5172-5182"},"PeriodicalIF":0.0,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144611122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-10DOI: 10.1109/TMI.2025.3587636
Jihun Kang;Eui Cheol Jung;Hyun Jung Koo;Dong Hyun Yang;Hojin Ha
In this study, we present enhanced physics-informed neural networks (PINNs), which were designed to address flow field errors in four-dimensional flow magnetic resonance imaging (4D Flow MRI). Flow field errors, typically occurring in high-velocity regions, lead to inaccuracies in velocity fields and flow rate underestimation. We proposed incorporating flow rate constraints to ensure physical consistency across cross-sections. The proposed framework included optimization strategies to improve convergence, stability, and accuracy. Artificial viscosity modeling, projecting conflicting gradients (PCGrad), and Euclidean norm scaling were applied to balance loss functions during training. The performance was validated using 2D computational fluid dynamics (CFD) with synthetic error, in-vitro 4D flow MRI mimicking aortic valve, and in-vivo 4D flow MRI from patients with aortic regurgitation and aortic stenosis. This study demonstrated considerable improvements in correcting flow field errors, denoising, and super-resolution. Notably, the proposed PINNs provided accurate flow rate reconstruction in stenotic and high-velocity regions. This approach extends the applicability of 4D flow MRI by providing reliable hemodynamics in the post-processing stage.
{"title":"Flow-Rate-Constrained Physics-Informed Neural Networks for Flow Field Error Correction in 4-D Flow Magnetic Resonance Imaging","authors":"Jihun Kang;Eui Cheol Jung;Hyun Jung Koo;Dong Hyun Yang;Hojin Ha","doi":"10.1109/TMI.2025.3587636","DOIUrl":"10.1109/TMI.2025.3587636","url":null,"abstract":"In this study, we present enhanced physics-informed neural networks (PINNs), which were designed to address flow field errors in four-dimensional flow magnetic resonance imaging (4D Flow MRI). Flow field errors, typically occurring in high-velocity regions, lead to inaccuracies in velocity fields and flow rate underestimation. We proposed incorporating flow rate constraints to ensure physical consistency across cross-sections. The proposed framework included optimization strategies to improve convergence, stability, and accuracy. Artificial viscosity modeling, projecting conflicting gradients (PCGrad), and Euclidean norm scaling were applied to balance loss functions during training. The performance was validated using 2D computational fluid dynamics (CFD) with synthetic error, in-vitro 4D flow MRI mimicking aortic valve, and in-vivo 4D flow MRI from patients with aortic regurgitation and aortic stenosis. This study demonstrated considerable improvements in correcting flow field errors, denoising, and super-resolution. Notably, the proposed PINNs provided accurate flow rate reconstruction in stenotic and high-velocity regions. This approach extends the applicability of 4D flow MRI by providing reliable hemodynamics in the post-processing stage.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 12","pages":"5155-5171"},"PeriodicalIF":0.0,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144603140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Metal dental implants may introduce metal artifacts (MA) during the CBCT imaging process, causing significant interference in subsequent diagnosis. In recent years, many deep learning methods for metal artifact reduction (MAR) have been proposed. Due to the huge difference between synthetic and clinical MA, supervised learning MAR methods may perform poorly in clinical settings. Many existing unsupervised MAR methods trained on clinical data often suffer from incorrect dental morphology. To alleviate the above problems, in this paper, we propose a new MAR method of Coupled Diffusion Models (CDM) for clinical dental CBCT images. Specifically, we separately train two diffusion models on clinical MA-degraded images and clinical clean images to obtain prior information, respectively. During the denoising process, the variances of noise levels are calculated from MA images and the prior of diffusion models. Then we develop a noise transformation module between the two diffusion models to transform the MA noise image into a new initial value for the denoising process. Our designs effectively exploit the inherent transformation between the misaligned MA-degraded images and clean images. Additionally, we introduce an MA-adaptive inference technique to better accommodate the MA degradation in different areas of an MA-degraded image. Experiments on our clinical dataset demonstrate that our CDM outperforms the comparison methods on both objective metrics and visual quality, especially for severe MA degradation. We will publicly release our code.
{"title":"Coupled Diffusion Models for Metal Artifact Reduction of Clinical Dental CBCT Images","authors":"Zhouzhuo Zhang;Juncheng Yan;Yuxuan Shi;Zhiming Cui;Jun Xu;Dinggang Shen","doi":"10.1109/TMI.2025.3587131","DOIUrl":"10.1109/TMI.2025.3587131","url":null,"abstract":"Metal dental implants may introduce metal artifacts (MA) during the CBCT imaging process, causing significant interference in subsequent diagnosis. In recent years, many deep learning methods for metal artifact reduction (MAR) have been proposed. Due to the huge difference between synthetic and clinical MA, supervised learning MAR methods may perform poorly in clinical settings. Many existing unsupervised MAR methods trained on clinical data often suffer from incorrect dental morphology. To alleviate the above problems, in this paper, we propose a new MAR method of Coupled Diffusion Models (CDM) for clinical dental CBCT images. Specifically, we separately train two diffusion models on clinical MA-degraded images and clinical clean images to obtain prior information, respectively. During the denoising process, the variances of noise levels are calculated from MA images and the prior of diffusion models. Then we develop a noise transformation module between the two diffusion models to transform the MA noise image into a new initial value for the denoising process. Our designs effectively exploit the inherent transformation between the misaligned MA-degraded images and clean images. Additionally, we introduce an MA-adaptive inference technique to better accommodate the MA degradation in different areas of an MA-degraded image. Experiments on our clinical dataset demonstrate that our CDM outperforms the comparison methods on both objective metrics and visual quality, especially for severe MA degradation. We will publicly release our code.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 12","pages":"5103-5116"},"PeriodicalIF":0.0,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144578618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cone-beam CT is extensively used in medical diagnosis and treatment. Despite its large longitudinal field of view (FoV), the horizontal FoV of CBCT systems is severely limited due to the detector width. Certain commercial CBCT systems increase the horizontal FoV by employing the offset detector method. However, this method necessitates 360° full circular scanning trajectory which increases the scanning time and is not compatible with specific CBCT system models. In this paper, we investigate the feasibility of large FoV imaging under short scan trajectories with an additional X-ray source. A dual-source CBCT geometry is proposed as well as two corresponding image reconstruction algorithms. The first one is based on cone-parallel rebinning and the subsequent employs a modified Parker weighting scheme. Theoretical calculations demonstrate that the proposed geometry achieves a wider horizontal FoV than the ${90}%$ detector offset geometry (radius of ${214}.{83}textit {mm}$ vs. ${198}.{99}textit {mm}$ ) with a significantly reduced rotation angle (less than 230° vs. 360°). As demonstrated by experiments, the proposed geometry and reconstruction algorithms obtain comparable imaging qualities within the FoV to conventional CBCT imaging techniques. Implementing the proposed geometry is straightforward and does not substantially increase development expenses. It possesses the capacity to expand CBCT applications even further.
锥束CT在医学诊断和治疗中有着广泛的应用。尽管CBCT系统具有较大的纵向视场(FoV),但由于探测器宽度的限制,CBCT系统的水平视场受到严重限制。某些商用CBCT系统通过采用偏移检测法来增加水平视场。然而,该方法需要360°全圆周扫描轨迹,增加了扫描时间,并且与特定的CBCT系统模型不兼容。在本文中,我们研究了在短扫描轨迹下使用附加x射线源进行大视场成像的可行性。提出了一种双源CBCT几何结构以及两种相应的图像重建算法。第一种方法是基于锥平行重球,第二种方法采用改进的帕克加权方法。理论计算表明,所提出的几何结构比${90}%$探测器偏移几何(半径${214})实现了更宽的水平视场。{83}textit {mm}$ vs. ${198}。{99}textit {mm}$),旋转角度明显减小(小于230°vs 360°)。实验证明,所提出的几何和重建算法在视场内获得了与传统CBCT成像技术相当的成像质量。实现所建议的几何结构非常简单,并且不会大幅增加开发费用。它具有进一步扩大CBCT应用的能力。
{"title":"Dual-Source CBCT for Large FoV Imaging Under Short-Scan Trajectories","authors":"Tianling Lyu;Xusheng Zhang;Xinyun Zhong;Zhan Wu;Yan Xi;Wei Zhao;Yang Chen;Yuanjing Feng;Wentao Zhu","doi":"10.1109/TMI.2025.3586622","DOIUrl":"10.1109/TMI.2025.3586622","url":null,"abstract":"Cone-beam CT is extensively used in medical diagnosis and treatment. Despite its large longitudinal field of view (FoV), the horizontal FoV of CBCT systems is severely limited due to the detector width. Certain commercial CBCT systems increase the horizontal FoV by employing the offset detector method. However, this method necessitates 360° full circular scanning trajectory which increases the scanning time and is not compatible with specific CBCT system models. In this paper, we investigate the feasibility of large FoV imaging under short scan trajectories with an additional X-ray source. A dual-source CBCT geometry is proposed as well as two corresponding image reconstruction algorithms. The first one is based on cone-parallel rebinning and the subsequent employs a modified Parker weighting scheme. Theoretical calculations demonstrate that the proposed geometry achieves a wider horizontal FoV than the <inline-formula> <tex-math>${90}%$ </tex-math></inline-formula> detector offset geometry (radius of <inline-formula> <tex-math>${214}.{83}textit {mm}$ </tex-math></inline-formula> vs. <inline-formula> <tex-math>${198}.{99}textit {mm}$ </tex-math></inline-formula>) with a significantly reduced rotation angle (less than 230° vs. 360°). As demonstrated by experiments, the proposed geometry and reconstruction algorithms obtain comparable imaging qualities within the FoV to conventional CBCT imaging techniques. Implementing the proposed geometry is straightforward and does not substantially increase development expenses. It possesses the capacity to expand CBCT applications even further.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 12","pages":"5051-5064"},"PeriodicalIF":0.0,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144577997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-25DOI: 10.1109/TMI.2025.3579214
Jieru Yao;Guangyu Guo;Zhaohui Zheng;Qiang Xie;Longfei Han;Dingwen Zhang;Junwei Han
Nuclei instance segmentation and classification are a fundamental and challenging task in whole slide Imaging (WSI) analysis. Most dense nuclei prediction studies rely heavily on crowd labelled data on high-resolution digital images, leading to a time-consuming and expertise-required paradigm. Recently, Vision-Language Models (VLMs) have been intensively investigated, which learn rich cross-modal correlation from large-scale image-text pairs without tedious annotations. Inspired by this, we build a novel framework, called PromptNu, aiming at infusing abundant nuclei knowledge into the training of the nuclei instance recognition model through vision-language contrastive learning and prompt engineering techniques. Specifically, our approach starts with the creation of multifaceted prompts that integrate comprehensive nuclear knowledge, including visual insights from the GPT-4V model, statistical analyses, and expert insights from the pathology field. Then, we propose a novel prompting methodology that consists of two pivotal vision-language contrastive learning components: the Prompting Nuclei Representation Learning (PNuRL) and the Prompting Nuclei Dense Prediction (PNuDP), which adeptly integrates the expertise embedded in pre-trained VLMs and multifaceted prompts into the feature extraction and prediction process, respectively. Comprehensive experiments on six datasets with extensive WSI scenarios demonstrate the effectiveness of our method for both nuclei instance segmentation and classification tasks. The code is available at https://github.com/NucleiDet/PromptNu
{"title":"Prompting Vision-Language Model for Nuclei Instance Segmentation and Classification","authors":"Jieru Yao;Guangyu Guo;Zhaohui Zheng;Qiang Xie;Longfei Han;Dingwen Zhang;Junwei Han","doi":"10.1109/TMI.2025.3579214","DOIUrl":"10.1109/TMI.2025.3579214","url":null,"abstract":"Nuclei instance segmentation and classification are a fundamental and challenging task in whole slide Imaging (WSI) analysis. Most dense nuclei prediction studies rely heavily on crowd labelled data on high-resolution digital images, leading to a time-consuming and expertise-required paradigm. Recently, Vision-Language Models (VLMs) have been intensively investigated, which learn rich cross-modal correlation from large-scale image-text pairs without tedious annotations. Inspired by this, we build a novel framework, called PromptNu, aiming at infusing abundant nuclei knowledge into the training of the nuclei instance recognition model through vision-language contrastive learning and prompt engineering techniques. Specifically, our approach starts with the creation of multifaceted prompts that integrate comprehensive nuclear knowledge, including visual insights from the GPT-4V model, statistical analyses, and expert insights from the pathology field. Then, we propose a novel prompting methodology that consists of two pivotal vision-language contrastive learning components: the Prompting Nuclei Representation Learning (PNuRL) and the Prompting Nuclei Dense Prediction (PNuDP), which adeptly integrates the expertise embedded in pre-trained VLMs and multifaceted prompts into the feature extraction and prediction process, respectively. Comprehensive experiments on six datasets with extensive WSI scenarios demonstrate the effectiveness of our method for both nuclei instance segmentation and classification tasks. The code is available at <uri>https://github.com/NucleiDet/PromptNu</uri>","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 11","pages":"4567-4578"},"PeriodicalIF":0.0,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144487884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-25DOI: 10.1109/TMI.2025.3580659
Zailong Chen;Yingshu Li;Zhanyu Wang;Peng Gao;Johan Barthelemy;Luping Zhou;Lei Wang
Radiology report generation using large language models has recently produced reports with more realistic styles and better language fluency. However, their clinical accuracy remains inadequate. Considering the significant imbalance between clinical phrases and general descriptions in a report, we argue that using an entire report for supervision is problematic as it fails to emphasize the crucial clinical phrases, which require focused learning. To address this issue, we propose a multi-phased supervision method, inspired by the spirit of curriculum learning where models are trained by gradually increasing task complexity. Our approach organizes the learning process into structured phases at different levels of semantical granularity, each building on the previous one to enhance the model. During the first phase, disease labels are used to supervise the model, equipping it with the ability to identify underlying diseases. The second phase progresses to use entity-relation triples to guide the model to describe associated clinical findings. Finally, in the third phase, we introduce conventional whole-report-based supervision to quickly adapt the model for report generation. Throughout the phased training, the model remains the same and consistently operates in the generation mode. As experimentally demonstrated, this proposed change in the way of supervision enhances report generation, achieving state-of-the-art performance in both language fluency and clinical accuracy. Our work underscores the importance of training process design in radiology report generation. Our code is available on https://github.com/zailongchen/MultiP-R2Gen
{"title":"Enhancing Radiology Report Generation via Multi-Phased Supervision","authors":"Zailong Chen;Yingshu Li;Zhanyu Wang;Peng Gao;Johan Barthelemy;Luping Zhou;Lei Wang","doi":"10.1109/TMI.2025.3580659","DOIUrl":"10.1109/TMI.2025.3580659","url":null,"abstract":"Radiology report generation using large language models has recently produced reports with more realistic styles and better language fluency. However, their clinical accuracy remains inadequate. Considering the significant imbalance between clinical phrases and general descriptions in a report, we argue that using an entire report for supervision is problematic as it fails to emphasize the crucial clinical phrases, which require focused learning. To address this issue, we propose a multi-phased supervision method, inspired by the spirit of curriculum learning where models are trained by gradually increasing task complexity. Our approach organizes the learning process into structured phases at different levels of semantical granularity, each building on the previous one to enhance the model. During the first phase, disease labels are used to supervise the model, equipping it with the ability to identify underlying diseases. The second phase progresses to use entity-relation triples to guide the model to describe associated clinical findings. Finally, in the third phase, we introduce conventional whole-report-based supervision to quickly adapt the model for report generation. Throughout the phased training, the model remains the same and consistently operates in the generation mode. As experimentally demonstrated, this proposed change in the way of supervision enhances report generation, achieving state-of-the-art performance in both language fluency and clinical accuracy. Our work underscores the importance of training process design in radiology report generation. Our code is available on <uri>https://github.com/zailongchen/MultiP-R2Gen</uri>","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 11","pages":"4666-4677"},"PeriodicalIF":0.0,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144487975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-20DOI: 10.1109/TMI.2025.3581108
Yuyang Du;Kexin Chen;Yue Zhan;Chang Han Low;Mobarakol Islam;Ziyu Guo;Yueming Jin;Guangyong Chen;Pheng Ann Heng
Visual question answering (VQA) plays a vital role in advancing surgical education. However, due to the privacy concern of patient data, training VQA model with previously used data becomes restricted, making it necessary to use the exemplar-free continual learning (CL) approach. Previous CL studies in the surgical field neglected two critical issues: i) significant domain shifts caused by the wide range of surgical procedures collected from various sources, and ii) the data imbalance problem caused by the unequal occurrence of medical instruments or surgical procedures. This paper addresses these challenges with a multimodal large language model (LLM) and an adaptive weight assignment strategy. First, we developed a novel LLM-assisted multi-teacher CL framework (named LMT++), which could harness the strength of a multimodal LLM as a supplementary teacher. The LLM’s strong generalization ability, as well as its good understanding of the surgical domain, help to address the knowledge gap arising from domain shifts and data imbalances. To incorporate the LLM in our CL framework, we further proposed an innovative approach to process the training data, which involves the conversion of complex LLM embeddings into logits value used within our CL training framework. Moreover, we design an adaptive weight assignment approach that balances the generalization ability of the LLM and the domain expertise of conventional VQA models obtained in previous model training processes within the CL framework. Finally, we created a new surgical VQA dataset for model evaluation. Comprehensive experimental findings on these datasets show that our approach surpasses state-of-the-art CL methods.
{"title":"LMT++: Adaptively Collaborating LLMs With Multi-Specialized Teachers for Continual VQA in Robotic Surgical Videos","authors":"Yuyang Du;Kexin Chen;Yue Zhan;Chang Han Low;Mobarakol Islam;Ziyu Guo;Yueming Jin;Guangyong Chen;Pheng Ann Heng","doi":"10.1109/TMI.2025.3581108","DOIUrl":"10.1109/TMI.2025.3581108","url":null,"abstract":"Visual question answering (VQA) plays a vital role in advancing surgical education. However, due to the privacy concern of patient data, training VQA model with previously used data becomes restricted, making it necessary to use the exemplar-free continual learning (CL) approach. Previous CL studies in the surgical field neglected two critical issues: i) significant domain shifts caused by the wide range of surgical procedures collected from various sources, and ii) the data imbalance problem caused by the unequal occurrence of medical instruments or surgical procedures. This paper addresses these challenges with a multimodal large language model (LLM) and an adaptive weight assignment strategy. First, we developed a novel LLM-assisted multi-teacher CL framework (named LMT++), which could harness the strength of a multimodal LLM as a supplementary teacher. The LLM’s strong generalization ability, as well as its good understanding of the surgical domain, help to address the knowledge gap arising from domain shifts and data imbalances. To incorporate the LLM in our CL framework, we further proposed an innovative approach to process the training data, which involves the conversion of complex LLM embeddings into logits value used within our CL training framework. Moreover, we design an adaptive weight assignment approach that balances the generalization ability of the LLM and the domain expertise of conventional VQA models obtained in previous model training processes within the CL framework. Finally, we created a new surgical VQA dataset for model evaluation. Comprehensive experimental findings on these datasets show that our approach surpasses state-of-the-art CL methods.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 11","pages":"4678-4689"},"PeriodicalIF":0.0,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144335331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-20DOI: 10.1109/TMI.2025.3581605
Seungeun Lee;Seunghwan Lee;Sunghwa Ryu;Ilwoo Lyu
We present a novel learning-based spherical registration method, called SPHARM-Reg, tailored for establishing cortical shape correspondence. SPHARM-Reg aims to reduce warp distortion that can introduce biases in downstream shape analyses. To achieve this, we tackle two critical challenges: (1) joint rigid and non-rigid alignments and (2) rotation-preserving smoothing. Conventional approaches perform rigid alignment only once before a non-rigid alignment. The resulting rotation is potentially sub-optimal, and the subsequent non-rigid alignment may introduce unnecessary distortion. In addition, common velocity encoding schemes on the unit sphere often fail to preserve the rotation component after spatial smoothing of velocity. To address these issues, we propose a diffeomorphic framework that integrates spherical harmonic decomposition of the velocity field with a novel velocity encoding scheme. SPHARM-Reg optimizes harmonic components of the velocity field, enabling joint adjustments for both rigid and non-rigid alignments. Furthermore, the proposed encoding scheme using spherical functions encourages consistent smoothing that preserves the rotation component. In the experiments, we validate SPHARM-Reg on healthy adult datasets. SPHARM-Reg achieves a substantial reduction in warp distortion while maintaining a high level of registration accuracy compared to existing methods. In the clinical analysis, we show that the extent of warp distortion significantly impacts statistical significance.
{"title":"SPHARM-Reg: Unsupervised Cortical Surface Registration Using Spherical Harmonics","authors":"Seungeun Lee;Seunghwan Lee;Sunghwa Ryu;Ilwoo Lyu","doi":"10.1109/TMI.2025.3581605","DOIUrl":"10.1109/TMI.2025.3581605","url":null,"abstract":"We present a novel learning-based spherical registration method, called SPHARM-Reg, tailored for establishing cortical shape correspondence. SPHARM-Reg aims to reduce warp distortion that can introduce biases in downstream shape analyses. To achieve this, we tackle two critical challenges: (1) joint rigid and non-rigid alignments and (2) rotation-preserving smoothing. Conventional approaches perform rigid alignment only once before a non-rigid alignment. The resulting rotation is potentially sub-optimal, and the subsequent non-rigid alignment may introduce unnecessary distortion. In addition, common velocity encoding schemes on the unit sphere often fail to preserve the rotation component after spatial smoothing of velocity. To address these issues, we propose a diffeomorphic framework that integrates spherical harmonic decomposition of the velocity field with a novel velocity encoding scheme. SPHARM-Reg optimizes harmonic components of the velocity field, enabling joint adjustments for both rigid and non-rigid alignments. Furthermore, the proposed encoding scheme using spherical functions encourages consistent smoothing that preserves the rotation component. In the experiments, we validate SPHARM-Reg on healthy adult datasets. SPHARM-Reg achieves a substantial reduction in warp distortion while maintaining a high level of registration accuracy compared to existing methods. In the clinical analysis, we show that the extent of warp distortion significantly impacts statistical significance.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 11","pages":"4732-4742"},"PeriodicalIF":0.0,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144334897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-19DOI: 10.1109/TMI.2025.3581200
Nuo Tong;Yuanlin Liu;Yueheng Ding;Tao Wang;Lingnan Hou;Mei Shi;Xiaoyi Hu;Shuiping Gou
Maxillofacial cysts pose significant surgical risks due to their proximity to critical anatomical structures, such as blood vessels and nerves. Precise identification of the safe resection margins is essential for complete lesion removal while minimizing damage to surrounding at-risk tissues, which highly relies on accurate segmentation in CT images. However, due to the limited space and complex anatomical structures in the maxillofacial region, along with heterogeneous compositions of bone and soft tissues, accurate segmentation is extremely challenging. Thus, a Progressive Edge Perception and Completion Network (PEPC-Net) is presented in this study, which integrates three novel components: 1) Progressive Edge Perception Branch, which progressively fuses semantic features from multiple resolution levels in a dual-stream manner, enabling the model to handle the varying forms of maxillofacial cysts at different stages. 2) Edge Information Completion Module, which captures subtle, differentiated edge features from adjacent layers within the encoding blocks, providing more comprehensive edge information for identifying heterogeneous boundaries. 3) Edge-Aware Skip Connection to adaptively fuse multi-scale edge features, preserving detailed edge information, to facilitate precise identification of the cyst boundaries. Extensive experiments on clinically collected maxillofacial lesion datasets validate the effectiveness of the proposed PEPC-Net, achieving a DSC of 88.71% and an ASD of 0.489mm. It’s generalizability is further assessed using an external validation set, which includes more diverse range of maxillofacial cyst cases and images of varying qualities. These experiments highlight the superior performance of PEPC-Net in delineating the polymorphic edges of heterogeneous lesions, which is critical for safe resection margins decision.
{"title":"PEPC-Net: Progressive Edge Perception and Completion Network for Precise Identification of Safe Resection Margins in Maxillofacial Cysts","authors":"Nuo Tong;Yuanlin Liu;Yueheng Ding;Tao Wang;Lingnan Hou;Mei Shi;Xiaoyi Hu;Shuiping Gou","doi":"10.1109/TMI.2025.3581200","DOIUrl":"10.1109/TMI.2025.3581200","url":null,"abstract":"Maxillofacial cysts pose significant surgical risks due to their proximity to critical anatomical structures, such as blood vessels and nerves. Precise identification of the safe resection margins is essential for complete lesion removal while minimizing damage to surrounding at-risk tissues, which highly relies on accurate segmentation in CT images. However, due to the limited space and complex anatomical structures in the maxillofacial region, along with heterogeneous compositions of bone and soft tissues, accurate segmentation is extremely challenging. Thus, a Progressive Edge Perception and Completion Network (PEPC-Net) is presented in this study, which integrates three novel components: 1) Progressive Edge Perception Branch, which progressively fuses semantic features from multiple resolution levels in a dual-stream manner, enabling the model to handle the varying forms of maxillofacial cysts at different stages. 2) Edge Information Completion Module, which captures subtle, differentiated edge features from adjacent layers within the encoding blocks, providing more comprehensive edge information for identifying heterogeneous boundaries. 3) Edge-Aware Skip Connection to adaptively fuse multi-scale edge features, preserving detailed edge information, to facilitate precise identification of the cyst boundaries. Extensive experiments on clinically collected maxillofacial lesion datasets validate the effectiveness of the proposed PEPC-Net, achieving a DSC of 88.71% and an ASD of 0.489mm. It’s generalizability is further assessed using an external validation set, which includes more diverse range of maxillofacial cyst cases and images of varying qualities. These experiments highlight the superior performance of PEPC-Net in delineating the polymorphic edges of heterogeneous lesions, which is critical for safe resection margins decision.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 11","pages":"4704-4716"},"PeriodicalIF":0.0,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144328530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}