Cone-beam CT is extensively used in medical diagnosis and treatment. Despite its large longitudinal field of view (FoV), the horizontal FoV of CBCT systems is severely limited due to the detector width. Certain commercial CBCT systems increase the horizontal FoV by employing the offset detector method. However, this method necessitates 360° full circular scanning trajectory which increases the scanning time and is not compatible with specific CBCT system models. In this paper, we investigate the feasibility of large FoV imaging under short scan trajectories with an additional X-ray source. A dual-source CBCT geometry is proposed as well as two corresponding image reconstruction algorithms. The first one is based on cone-parallel rebinning and the subsequent employs a modified Parker weighting scheme. Theoretical calculations demonstrate that the proposed geometry achieves a wider horizontal FoV than the ${90}%$ detector offset geometry (radius of ${214}.{83}textit {mm}$ vs. ${198}.{99}textit {mm}$ ) with a significantly reduced rotation angle (less than 230° vs. 360°). As demonstrated by experiments, the proposed geometry and reconstruction algorithms obtain comparable imaging qualities within the FoV to conventional CBCT imaging techniques. Implementing the proposed geometry is straightforward and does not substantially increase development expenses. It possesses the capacity to expand CBCT applications even further.
锥束CT在医学诊断和治疗中有着广泛的应用。尽管CBCT系统具有较大的纵向视场(FoV),但由于探测器宽度的限制,CBCT系统的水平视场受到严重限制。某些商用CBCT系统通过采用偏移检测法来增加水平视场。然而,该方法需要360°全圆周扫描轨迹,增加了扫描时间,并且与特定的CBCT系统模型不兼容。在本文中,我们研究了在短扫描轨迹下使用附加x射线源进行大视场成像的可行性。提出了一种双源CBCT几何结构以及两种相应的图像重建算法。第一种方法是基于锥平行重球,第二种方法采用改进的帕克加权方法。理论计算表明,所提出的几何结构比${90}%$探测器偏移几何(半径${214})实现了更宽的水平视场。{83}textit {mm}$ vs. ${198}。{99}textit {mm}$),旋转角度明显减小(小于230°vs 360°)。实验证明,所提出的几何和重建算法在视场内获得了与传统CBCT成像技术相当的成像质量。实现所建议的几何结构非常简单,并且不会大幅增加开发费用。它具有进一步扩大CBCT应用的能力。
{"title":"Dual-Source CBCT for Large FoV Imaging Under Short-Scan Trajectories","authors":"Tianling Lyu;Xusheng Zhang;Xinyun Zhong;Zhan Wu;Yan Xi;Wei Zhao;Yang Chen;Yuanjing Feng;Wentao Zhu","doi":"10.1109/TMI.2025.3586622","DOIUrl":"10.1109/TMI.2025.3586622","url":null,"abstract":"Cone-beam CT is extensively used in medical diagnosis and treatment. Despite its large longitudinal field of view (FoV), the horizontal FoV of CBCT systems is severely limited due to the detector width. Certain commercial CBCT systems increase the horizontal FoV by employing the offset detector method. However, this method necessitates 360° full circular scanning trajectory which increases the scanning time and is not compatible with specific CBCT system models. In this paper, we investigate the feasibility of large FoV imaging under short scan trajectories with an additional X-ray source. A dual-source CBCT geometry is proposed as well as two corresponding image reconstruction algorithms. The first one is based on cone-parallel rebinning and the subsequent employs a modified Parker weighting scheme. Theoretical calculations demonstrate that the proposed geometry achieves a wider horizontal FoV than the <inline-formula> <tex-math>${90}%$ </tex-math></inline-formula> detector offset geometry (radius of <inline-formula> <tex-math>${214}.{83}textit {mm}$ </tex-math></inline-formula> vs. <inline-formula> <tex-math>${198}.{99}textit {mm}$ </tex-math></inline-formula>) with a significantly reduced rotation angle (less than 230° vs. 360°). As demonstrated by experiments, the proposed geometry and reconstruction algorithms obtain comparable imaging qualities within the FoV to conventional CBCT imaging techniques. Implementing the proposed geometry is straightforward and does not substantially increase development expenses. It possesses the capacity to expand CBCT applications even further.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 12","pages":"5051-5064"},"PeriodicalIF":0.0,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144577997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-25DOI: 10.1109/TMI.2025.3579214
Jieru Yao;Guangyu Guo;Zhaohui Zheng;Qiang Xie;Longfei Han;Dingwen Zhang;Junwei Han
Nuclei instance segmentation and classification are a fundamental and challenging task in whole slide Imaging (WSI) analysis. Most dense nuclei prediction studies rely heavily on crowd labelled data on high-resolution digital images, leading to a time-consuming and expertise-required paradigm. Recently, Vision-Language Models (VLMs) have been intensively investigated, which learn rich cross-modal correlation from large-scale image-text pairs without tedious annotations. Inspired by this, we build a novel framework, called PromptNu, aiming at infusing abundant nuclei knowledge into the training of the nuclei instance recognition model through vision-language contrastive learning and prompt engineering techniques. Specifically, our approach starts with the creation of multifaceted prompts that integrate comprehensive nuclear knowledge, including visual insights from the GPT-4V model, statistical analyses, and expert insights from the pathology field. Then, we propose a novel prompting methodology that consists of two pivotal vision-language contrastive learning components: the Prompting Nuclei Representation Learning (PNuRL) and the Prompting Nuclei Dense Prediction (PNuDP), which adeptly integrates the expertise embedded in pre-trained VLMs and multifaceted prompts into the feature extraction and prediction process, respectively. Comprehensive experiments on six datasets with extensive WSI scenarios demonstrate the effectiveness of our method for both nuclei instance segmentation and classification tasks. The code is available at https://github.com/NucleiDet/PromptNu
{"title":"Prompting Vision-Language Model for Nuclei Instance Segmentation and Classification","authors":"Jieru Yao;Guangyu Guo;Zhaohui Zheng;Qiang Xie;Longfei Han;Dingwen Zhang;Junwei Han","doi":"10.1109/TMI.2025.3579214","DOIUrl":"10.1109/TMI.2025.3579214","url":null,"abstract":"Nuclei instance segmentation and classification are a fundamental and challenging task in whole slide Imaging (WSI) analysis. Most dense nuclei prediction studies rely heavily on crowd labelled data on high-resolution digital images, leading to a time-consuming and expertise-required paradigm. Recently, Vision-Language Models (VLMs) have been intensively investigated, which learn rich cross-modal correlation from large-scale image-text pairs without tedious annotations. Inspired by this, we build a novel framework, called PromptNu, aiming at infusing abundant nuclei knowledge into the training of the nuclei instance recognition model through vision-language contrastive learning and prompt engineering techniques. Specifically, our approach starts with the creation of multifaceted prompts that integrate comprehensive nuclear knowledge, including visual insights from the GPT-4V model, statistical analyses, and expert insights from the pathology field. Then, we propose a novel prompting methodology that consists of two pivotal vision-language contrastive learning components: the Prompting Nuclei Representation Learning (PNuRL) and the Prompting Nuclei Dense Prediction (PNuDP), which adeptly integrates the expertise embedded in pre-trained VLMs and multifaceted prompts into the feature extraction and prediction process, respectively. Comprehensive experiments on six datasets with extensive WSI scenarios demonstrate the effectiveness of our method for both nuclei instance segmentation and classification tasks. The code is available at <uri>https://github.com/NucleiDet/PromptNu</uri>","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 11","pages":"4567-4578"},"PeriodicalIF":0.0,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144487884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-25DOI: 10.1109/TMI.2025.3580659
Zailong Chen;Yingshu Li;Zhanyu Wang;Peng Gao;Johan Barthelemy;Luping Zhou;Lei Wang
Radiology report generation using large language models has recently produced reports with more realistic styles and better language fluency. However, their clinical accuracy remains inadequate. Considering the significant imbalance between clinical phrases and general descriptions in a report, we argue that using an entire report for supervision is problematic as it fails to emphasize the crucial clinical phrases, which require focused learning. To address this issue, we propose a multi-phased supervision method, inspired by the spirit of curriculum learning where models are trained by gradually increasing task complexity. Our approach organizes the learning process into structured phases at different levels of semantical granularity, each building on the previous one to enhance the model. During the first phase, disease labels are used to supervise the model, equipping it with the ability to identify underlying diseases. The second phase progresses to use entity-relation triples to guide the model to describe associated clinical findings. Finally, in the third phase, we introduce conventional whole-report-based supervision to quickly adapt the model for report generation. Throughout the phased training, the model remains the same and consistently operates in the generation mode. As experimentally demonstrated, this proposed change in the way of supervision enhances report generation, achieving state-of-the-art performance in both language fluency and clinical accuracy. Our work underscores the importance of training process design in radiology report generation. Our code is available on https://github.com/zailongchen/MultiP-R2Gen
{"title":"Enhancing Radiology Report Generation via Multi-Phased Supervision","authors":"Zailong Chen;Yingshu Li;Zhanyu Wang;Peng Gao;Johan Barthelemy;Luping Zhou;Lei Wang","doi":"10.1109/TMI.2025.3580659","DOIUrl":"10.1109/TMI.2025.3580659","url":null,"abstract":"Radiology report generation using large language models has recently produced reports with more realistic styles and better language fluency. However, their clinical accuracy remains inadequate. Considering the significant imbalance between clinical phrases and general descriptions in a report, we argue that using an entire report for supervision is problematic as it fails to emphasize the crucial clinical phrases, which require focused learning. To address this issue, we propose a multi-phased supervision method, inspired by the spirit of curriculum learning where models are trained by gradually increasing task complexity. Our approach organizes the learning process into structured phases at different levels of semantical granularity, each building on the previous one to enhance the model. During the first phase, disease labels are used to supervise the model, equipping it with the ability to identify underlying diseases. The second phase progresses to use entity-relation triples to guide the model to describe associated clinical findings. Finally, in the third phase, we introduce conventional whole-report-based supervision to quickly adapt the model for report generation. Throughout the phased training, the model remains the same and consistently operates in the generation mode. As experimentally demonstrated, this proposed change in the way of supervision enhances report generation, achieving state-of-the-art performance in both language fluency and clinical accuracy. Our work underscores the importance of training process design in radiology report generation. Our code is available on <uri>https://github.com/zailongchen/MultiP-R2Gen</uri>","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 11","pages":"4666-4677"},"PeriodicalIF":0.0,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144487975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-20DOI: 10.1109/TMI.2025.3581108
Yuyang Du;Kexin Chen;Yue Zhan;Chang Han Low;Mobarakol Islam;Ziyu Guo;Yueming Jin;Guangyong Chen;Pheng Ann Heng
Visual question answering (VQA) plays a vital role in advancing surgical education. However, due to the privacy concern of patient data, training VQA model with previously used data becomes restricted, making it necessary to use the exemplar-free continual learning (CL) approach. Previous CL studies in the surgical field neglected two critical issues: i) significant domain shifts caused by the wide range of surgical procedures collected from various sources, and ii) the data imbalance problem caused by the unequal occurrence of medical instruments or surgical procedures. This paper addresses these challenges with a multimodal large language model (LLM) and an adaptive weight assignment strategy. First, we developed a novel LLM-assisted multi-teacher CL framework (named LMT++), which could harness the strength of a multimodal LLM as a supplementary teacher. The LLM’s strong generalization ability, as well as its good understanding of the surgical domain, help to address the knowledge gap arising from domain shifts and data imbalances. To incorporate the LLM in our CL framework, we further proposed an innovative approach to process the training data, which involves the conversion of complex LLM embeddings into logits value used within our CL training framework. Moreover, we design an adaptive weight assignment approach that balances the generalization ability of the LLM and the domain expertise of conventional VQA models obtained in previous model training processes within the CL framework. Finally, we created a new surgical VQA dataset for model evaluation. Comprehensive experimental findings on these datasets show that our approach surpasses state-of-the-art CL methods.
{"title":"LMT++: Adaptively Collaborating LLMs With Multi-Specialized Teachers for Continual VQA in Robotic Surgical Videos","authors":"Yuyang Du;Kexin Chen;Yue Zhan;Chang Han Low;Mobarakol Islam;Ziyu Guo;Yueming Jin;Guangyong Chen;Pheng Ann Heng","doi":"10.1109/TMI.2025.3581108","DOIUrl":"10.1109/TMI.2025.3581108","url":null,"abstract":"Visual question answering (VQA) plays a vital role in advancing surgical education. However, due to the privacy concern of patient data, training VQA model with previously used data becomes restricted, making it necessary to use the exemplar-free continual learning (CL) approach. Previous CL studies in the surgical field neglected two critical issues: i) significant domain shifts caused by the wide range of surgical procedures collected from various sources, and ii) the data imbalance problem caused by the unequal occurrence of medical instruments or surgical procedures. This paper addresses these challenges with a multimodal large language model (LLM) and an adaptive weight assignment strategy. First, we developed a novel LLM-assisted multi-teacher CL framework (named LMT++), which could harness the strength of a multimodal LLM as a supplementary teacher. The LLM’s strong generalization ability, as well as its good understanding of the surgical domain, help to address the knowledge gap arising from domain shifts and data imbalances. To incorporate the LLM in our CL framework, we further proposed an innovative approach to process the training data, which involves the conversion of complex LLM embeddings into logits value used within our CL training framework. Moreover, we design an adaptive weight assignment approach that balances the generalization ability of the LLM and the domain expertise of conventional VQA models obtained in previous model training processes within the CL framework. Finally, we created a new surgical VQA dataset for model evaluation. Comprehensive experimental findings on these datasets show that our approach surpasses state-of-the-art CL methods.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 11","pages":"4678-4689"},"PeriodicalIF":0.0,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144335331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-20DOI: 10.1109/TMI.2025.3581605
Seungeun Lee;Seunghwan Lee;Sunghwa Ryu;Ilwoo Lyu
We present a novel learning-based spherical registration method, called SPHARM-Reg, tailored for establishing cortical shape correspondence. SPHARM-Reg aims to reduce warp distortion that can introduce biases in downstream shape analyses. To achieve this, we tackle two critical challenges: (1) joint rigid and non-rigid alignments and (2) rotation-preserving smoothing. Conventional approaches perform rigid alignment only once before a non-rigid alignment. The resulting rotation is potentially sub-optimal, and the subsequent non-rigid alignment may introduce unnecessary distortion. In addition, common velocity encoding schemes on the unit sphere often fail to preserve the rotation component after spatial smoothing of velocity. To address these issues, we propose a diffeomorphic framework that integrates spherical harmonic decomposition of the velocity field with a novel velocity encoding scheme. SPHARM-Reg optimizes harmonic components of the velocity field, enabling joint adjustments for both rigid and non-rigid alignments. Furthermore, the proposed encoding scheme using spherical functions encourages consistent smoothing that preserves the rotation component. In the experiments, we validate SPHARM-Reg on healthy adult datasets. SPHARM-Reg achieves a substantial reduction in warp distortion while maintaining a high level of registration accuracy compared to existing methods. In the clinical analysis, we show that the extent of warp distortion significantly impacts statistical significance.
{"title":"SPHARM-Reg: Unsupervised Cortical Surface Registration Using Spherical Harmonics","authors":"Seungeun Lee;Seunghwan Lee;Sunghwa Ryu;Ilwoo Lyu","doi":"10.1109/TMI.2025.3581605","DOIUrl":"10.1109/TMI.2025.3581605","url":null,"abstract":"We present a novel learning-based spherical registration method, called SPHARM-Reg, tailored for establishing cortical shape correspondence. SPHARM-Reg aims to reduce warp distortion that can introduce biases in downstream shape analyses. To achieve this, we tackle two critical challenges: (1) joint rigid and non-rigid alignments and (2) rotation-preserving smoothing. Conventional approaches perform rigid alignment only once before a non-rigid alignment. The resulting rotation is potentially sub-optimal, and the subsequent non-rigid alignment may introduce unnecessary distortion. In addition, common velocity encoding schemes on the unit sphere often fail to preserve the rotation component after spatial smoothing of velocity. To address these issues, we propose a diffeomorphic framework that integrates spherical harmonic decomposition of the velocity field with a novel velocity encoding scheme. SPHARM-Reg optimizes harmonic components of the velocity field, enabling joint adjustments for both rigid and non-rigid alignments. Furthermore, the proposed encoding scheme using spherical functions encourages consistent smoothing that preserves the rotation component. In the experiments, we validate SPHARM-Reg on healthy adult datasets. SPHARM-Reg achieves a substantial reduction in warp distortion while maintaining a high level of registration accuracy compared to existing methods. In the clinical analysis, we show that the extent of warp distortion significantly impacts statistical significance.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 11","pages":"4732-4742"},"PeriodicalIF":0.0,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144334897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-19DOI: 10.1109/TMI.2025.3581200
Nuo Tong;Yuanlin Liu;Yueheng Ding;Tao Wang;Lingnan Hou;Mei Shi;Xiaoyi Hu;Shuiping Gou
Maxillofacial cysts pose significant surgical risks due to their proximity to critical anatomical structures, such as blood vessels and nerves. Precise identification of the safe resection margins is essential for complete lesion removal while minimizing damage to surrounding at-risk tissues, which highly relies on accurate segmentation in CT images. However, due to the limited space and complex anatomical structures in the maxillofacial region, along with heterogeneous compositions of bone and soft tissues, accurate segmentation is extremely challenging. Thus, a Progressive Edge Perception and Completion Network (PEPC-Net) is presented in this study, which integrates three novel components: 1) Progressive Edge Perception Branch, which progressively fuses semantic features from multiple resolution levels in a dual-stream manner, enabling the model to handle the varying forms of maxillofacial cysts at different stages. 2) Edge Information Completion Module, which captures subtle, differentiated edge features from adjacent layers within the encoding blocks, providing more comprehensive edge information for identifying heterogeneous boundaries. 3) Edge-Aware Skip Connection to adaptively fuse multi-scale edge features, preserving detailed edge information, to facilitate precise identification of the cyst boundaries. Extensive experiments on clinically collected maxillofacial lesion datasets validate the effectiveness of the proposed PEPC-Net, achieving a DSC of 88.71% and an ASD of 0.489mm. It’s generalizability is further assessed using an external validation set, which includes more diverse range of maxillofacial cyst cases and images of varying qualities. These experiments highlight the superior performance of PEPC-Net in delineating the polymorphic edges of heterogeneous lesions, which is critical for safe resection margins decision.
{"title":"PEPC-Net: Progressive Edge Perception and Completion Network for Precise Identification of Safe Resection Margins in Maxillofacial Cysts","authors":"Nuo Tong;Yuanlin Liu;Yueheng Ding;Tao Wang;Lingnan Hou;Mei Shi;Xiaoyi Hu;Shuiping Gou","doi":"10.1109/TMI.2025.3581200","DOIUrl":"10.1109/TMI.2025.3581200","url":null,"abstract":"Maxillofacial cysts pose significant surgical risks due to their proximity to critical anatomical structures, such as blood vessels and nerves. Precise identification of the safe resection margins is essential for complete lesion removal while minimizing damage to surrounding at-risk tissues, which highly relies on accurate segmentation in CT images. However, due to the limited space and complex anatomical structures in the maxillofacial region, along with heterogeneous compositions of bone and soft tissues, accurate segmentation is extremely challenging. Thus, a Progressive Edge Perception and Completion Network (PEPC-Net) is presented in this study, which integrates three novel components: 1) Progressive Edge Perception Branch, which progressively fuses semantic features from multiple resolution levels in a dual-stream manner, enabling the model to handle the varying forms of maxillofacial cysts at different stages. 2) Edge Information Completion Module, which captures subtle, differentiated edge features from adjacent layers within the encoding blocks, providing more comprehensive edge information for identifying heterogeneous boundaries. 3) Edge-Aware Skip Connection to adaptively fuse multi-scale edge features, preserving detailed edge information, to facilitate precise identification of the cyst boundaries. Extensive experiments on clinically collected maxillofacial lesion datasets validate the effectiveness of the proposed PEPC-Net, achieving a DSC of 88.71% and an ASD of 0.489mm. It’s generalizability is further assessed using an external validation set, which includes more diverse range of maxillofacial cyst cases and images of varying qualities. These experiments highlight the superior performance of PEPC-Net in delineating the polymorphic edges of heterogeneous lesions, which is critical for safe resection margins decision.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 11","pages":"4704-4716"},"PeriodicalIF":0.0,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144328530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-18DOI: 10.1109/TMI.2025.3580713
Sekeun Kim;Pengfei Jin;Sifan Song;Cheng Chen;Yiwei Li;Hui Ren;Xiang Li;Tianming Liu;Quanzheng Li
Echocardiography is the first-line non-invasive cardiac imaging modality, providing rich spatio-temporal information on cardiac anatomy and physiology. Recently, foundation model trained on extensive and diverse datasets has shown strong performance in various downstream tasks. However, translating foundation models into the medical imaging domain remains challenging due to domain differences between medical and natural images, the lack of diverse patient and disease datasets. In this paper, we introduce EchoFM, a general-purpose vision foundation model for echocardiography trained on a large-scale dataset of over 20 million echocardiographic images from 6,500 patients. To enable effective learning of rich spatio-temporal representations from periodic videos, we propose a novel self-supervised learning framework based on a masked autoencoder with a spatio-temporal consistent masking strategy and periodic-driven contrastive learning. The learned cardiac representations can be readily adapted and fine-tuned for a wide range of downstream tasks, serving as a strong and flexible backbone model. We validate EchoFM through experiments across key downstream tasks in the clinical echocardiography workflow, leveraging public and multi-center internal datasets. EchoFM consistently outperforms SOTA methods, demonstrating superior generalization capabilities and flexibility. The code and checkpoints are available at: https://github.com/SekeunKim/EchoFM.git
{"title":"EchoFM: Foundation Model for Generalizable Echocardiogram Analysis","authors":"Sekeun Kim;Pengfei Jin;Sifan Song;Cheng Chen;Yiwei Li;Hui Ren;Xiang Li;Tianming Liu;Quanzheng Li","doi":"10.1109/TMI.2025.3580713","DOIUrl":"10.1109/TMI.2025.3580713","url":null,"abstract":"Echocardiography is the first-line non-invasive cardiac imaging modality, providing rich spatio-temporal information on cardiac anatomy and physiology. Recently, foundation model trained on extensive and diverse datasets has shown strong performance in various downstream tasks. However, translating foundation models into the medical imaging domain remains challenging due to domain differences between medical and natural images, the lack of diverse patient and disease datasets. In this paper, we introduce EchoFM, a general-purpose vision foundation model for echocardiography trained on a large-scale dataset of over 20 million echocardiographic images from 6,500 patients. To enable effective learning of rich spatio-temporal representations from periodic videos, we propose a novel self-supervised learning framework based on a masked autoencoder with a spatio-temporal consistent masking strategy and periodic-driven contrastive learning. The learned cardiac representations can be readily adapted and fine-tuned for a wide range of downstream tasks, serving as a strong and flexible backbone model. We validate EchoFM through experiments across key downstream tasks in the clinical echocardiography workflow, leveraging public and multi-center internal datasets. EchoFM consistently outperforms SOTA methods, demonstrating superior generalization capabilities and flexibility. The code and checkpoints are available at: <uri>https://github.com/SekeunKim/EchoFM.git</uri>","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 10","pages":"4049-4062"},"PeriodicalIF":0.0,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144319677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Coronary artery disease poses a significant global health challenge, often necessitating percutaneous coronary intervention (PCI) with stent implantation. Assessing stent apposition is crucial for preventing and identifying PCI complications leading to in-stent restenosis. Here we propose a novel three-dimensional (3D) distancecolor-coded assessment (DccA) for PCI stent apposition via deep-learning-based 3D multi-object segmentation in intravascular optical coherence tomography (IV-OCT). Our proposed 3D DccA accurately segments 3D vessel lumens and stents in IV-OCT images, using a hybrid-dimensional spatial matching network and dual-layer training with style transfer. It quantifies and maps stent-lumen distances into a 3D color space, achieving a 3D visual assessment of PCI stent apposition. Achieving over 95% segmentation precision for both stent struts and the lumen and having 3D color visualization, our proposed 3D DccA improves the clinical evaluation of PCI stent deployment and facilitates personalized treatment planning.
{"title":"3D Distance-Color-Coded Assessment of PCI Stent Apposition via Deep-Learning-Based Three-Dimensional Multi-Object Segmentation","authors":"Xiaoyang Qin;Hao Huang;Shuaichen Lin;Xinhao Zeng;Kaizhi Cao;Renxiong Wu;Yuming Huang;Junqing Yang;Yong Liu;Gang Li;Guangming Ni","doi":"10.1109/TMI.2025.3580619","DOIUrl":"10.1109/TMI.2025.3580619","url":null,"abstract":"Coronary artery disease poses a significant global health challenge, often necessitating percutaneous coronary intervention (PCI) with stent implantation. Assessing stent apposition is crucial for preventing and identifying PCI complications leading to in-stent restenosis. Here we propose a novel three-dimensional (3D) distancecolor-coded assessment (DccA) for PCI stent apposition via deep-learning-based 3D multi-object segmentation in intravascular optical coherence tomography (IV-OCT). Our proposed 3D DccA accurately segments 3D vessel lumens and stents in IV-OCT images, using a hybrid-dimensional spatial matching network and dual-layer training with style transfer. It quantifies and maps stent-lumen distances into a 3D color space, achieving a 3D visual assessment of PCI stent apposition. Achieving over 95% segmentation precision for both stent struts and the lumen and having 3D color visualization, our proposed 3D DccA improves the clinical evaluation of PCI stent deployment and facilitates personalized treatment planning.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 11","pages":"4717-4731"},"PeriodicalIF":0.0,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144311304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-17DOI: 10.1109/TMI.2025.3580383
Raymond Fang;Pengpeng Zhang;Tingwei Zhang;Zihang Yan;Daniel Kim;Edison Sun;Roman Kuranov;Junghun Kweon;Alex S. Huang;Hao F. Zhang
Imaging complex, non-planar anatomies with optical coherence tomography (OCT) is limited by the optical field of view (FOV) in a single volumetric acquisition. Combining linear mechanical translation with OCT extends the FOV but suffers from inflexibility in imaging non-planar anatomies. We report the robotic OCT to fill this gap. To address challenges in volumetric reconstruction associated with the robotic movement accuracy being two orders of magnitudes worse than OCT imaging resolution, we developed a volumetric montaging algorithm. To test the robotic OCT, we imaged the entire circumferential aqueous humor outflow pathway, whose imaging has the potential to customize glaucoma surgeries but is typically constrained by the FOV in mice in vivo. We acquired volumetric OCT data at different robotic poses and reconstructed the entire anterior segment of the eye. From the segmented Schlemm’s canal volume, we showed its circumferentially heterogeneous morphology; we also revealed a segmental nature in the circumferential distribution of collector channels with spatial features as small as a few micrometers.
{"title":"Robotic Optical Coherence Tomography With Expanded Three-Dimensional Field-of-View Using Point-Cloud-Based Volumetric Montaging","authors":"Raymond Fang;Pengpeng Zhang;Tingwei Zhang;Zihang Yan;Daniel Kim;Edison Sun;Roman Kuranov;Junghun Kweon;Alex S. Huang;Hao F. Zhang","doi":"10.1109/TMI.2025.3580383","DOIUrl":"10.1109/TMI.2025.3580383","url":null,"abstract":"Imaging complex, non-planar anatomies with optical coherence tomography (OCT) is limited by the optical field of view (FOV) in a single volumetric acquisition. Combining linear mechanical translation with OCT extends the FOV but suffers from inflexibility in imaging non-planar anatomies. We report the robotic OCT to fill this gap. To address challenges in volumetric reconstruction associated with the robotic movement accuracy being two orders of magnitudes worse than OCT imaging resolution, we developed a volumetric montaging algorithm. To test the robotic OCT, we imaged the entire circumferential aqueous humor outflow pathway, whose imaging has the potential to customize glaucoma surgeries but is typically constrained by the FOV in mice in vivo. We acquired volumetric OCT data at different robotic poses and reconstructed the entire anterior segment of the eye. From the segmented Schlemm’s canal volume, we showed its circumferentially heterogeneous morphology; we also revealed a segmental nature in the circumferential distribution of collector channels with spatial features as small as a few micrometers.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 11","pages":"4639-4651"},"PeriodicalIF":0.0,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144311300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Medical Visual Question Answering (Med-VQA) aims to answer questions regarding the content of medical images, crucial for enhancing diagnostics and education in healthcare. However, progress in this field is hindered by data scarcity due to the resource-intensive nature of medical data annotation. While existing Med-VQA approaches often rely on pre-training to mitigate this issue, bridging the semantic gap between pre-trained models and specific tasks remains a significant challenge. This paper presents the Dynamic Semantic-Adaptive Prompting (DSAP) framework, leveraging prompt learning to enhance model performance in Med-VQA. To this end, we introduce two prompting strategies: Semantic Alignment Prompting (SAP) and Dynamic Question-Aware Prompting (DQAP). SAP prompts multi-modal inputs during fine-tuning, reducing the semantic gap by aligning model outputs with domain-specific contexts. Simultaneously, DQAP enhances answer selection by leveraging grammatical relationships between questions and answers, thereby improving accuracy and relevance. The DSAP framework was pre-trained on three datasets—ROCO, MedICaT, and MIMIC-CXR—and comprehensively evaluated against 15 existing Med-VQA models on three public datasets: VQA-RAD, SLAKE, and PathVQA. Our results demonstrate a substantial performance improvement, with DSAP achieving a 1.9% enhancement in average results across benchmarks. These findings underscore DSAP’s effectiveness in addressing critical challenges in Med-VQA and suggest promising avenues for future developments in medical AI.
{"title":"Bridging the Semantic Gap in Medical Visual Question Answering With Prompt Learning","authors":"Zilin Lu;Qingjie Zeng;Mengkang Lu;Geng Chen;Yong Xia","doi":"10.1109/TMI.2025.3580561","DOIUrl":"10.1109/TMI.2025.3580561","url":null,"abstract":"Medical Visual Question Answering (Med-VQA) aims to answer questions regarding the content of medical images, crucial for enhancing diagnostics and education in healthcare. However, progress in this field is hindered by data scarcity due to the resource-intensive nature of medical data annotation. While existing Med-VQA approaches often rely on pre-training to mitigate this issue, bridging the semantic gap between pre-trained models and specific tasks remains a significant challenge. This paper presents the Dynamic Semantic-Adaptive Prompting (DSAP) framework, leveraging prompt learning to enhance model performance in Med-VQA. To this end, we introduce two prompting strategies: Semantic Alignment Prompting (SAP) and Dynamic Question-Aware Prompting (DQAP). SAP prompts multi-modal inputs during fine-tuning, reducing the semantic gap by aligning model outputs with domain-specific contexts. Simultaneously, DQAP enhances answer selection by leveraging grammatical relationships between questions and answers, thereby improving accuracy and relevance. The DSAP framework was pre-trained on three datasets—ROCO, MedICaT, and MIMIC-CXR—and comprehensively evaluated against 15 existing Med-VQA models on three public datasets: VQA-RAD, SLAKE, and PathVQA. Our results demonstrate a substantial performance improvement, with DSAP achieving a 1.9% enhancement in average results across benchmarks. These findings underscore DSAP’s effectiveness in addressing critical challenges in Med-VQA and suggest promising avenues for future developments in medical AI.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 11","pages":"4605-4616"},"PeriodicalIF":0.0,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144311306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}