Pub Date : 2025-06-03DOI: 10.1109/TMI.2025.3576163
Qixiang Zhang;Haonan Wang;Xiaomeng Li
Semi-supervised medical image segmentation (SSMIS) has emerged as a promising solution to tackle the challenges of time-consuming manual labeling in the medical field. However, in practical scenarios, there are often domain variations within the datasets, leading to derivative scenarios like semi-supervised medical domain generalization (Semi-MDG) and unsupervised medical domain adaptation (UMDA). In this paper, we aim to develop a generic framework that masters all three tasks. We notice a critical shared challenge across three scenarios: the explicit semantic knowledge for segmentation performance and rich domain knowledge for generalizability exclusively exist in the labeled set and unlabeled set respectively. Such discrepancy hinders existing methods from effectively comprehending both types of knowledge under semi-supervised settings. To tackle this challenge, we develop a Semantic & Domain Knowledge Messenger (S&D Messenger) which facilitates direct knowledge delivery between the labeled and unlabeled set, and thus allowing the model to comprehend both of them in each individual learning flow. Equipped with our S&D Messenger, a naive pseudo-labeling method can achieve huge improvement on ten benchmark datasets for SSMIS (+7.5%), UMDA (+5.6%), and Semi-MDG tasks (+1.14%), compared with state-of-the-art methods designed for specific tasks.
{"title":"S&D Messenger: Exchanging Semantic and Domain Knowledge for Generic Semi-Supervised Medical Image Segmentation","authors":"Qixiang Zhang;Haonan Wang;Xiaomeng Li","doi":"10.1109/TMI.2025.3576163","DOIUrl":"10.1109/TMI.2025.3576163","url":null,"abstract":"Semi-supervised medical image segmentation (SSMIS) has emerged as a promising solution to tackle the challenges of time-consuming manual labeling in the medical field. However, in practical scenarios, there are often domain variations within the datasets, leading to derivative scenarios like semi-supervised medical domain generalization (Semi-MDG) and unsupervised medical domain adaptation (UMDA). In this paper, we aim to develop a generic framework that masters all three tasks. We notice a critical shared challenge across three scenarios: the explicit semantic knowledge for segmentation performance and rich domain knowledge for generalizability exclusively exist in the labeled set and unlabeled set respectively. Such discrepancy hinders existing methods from effectively comprehending both types of knowledge under semi-supervised settings. To tackle this challenge, we develop a Semantic & Domain Knowledge Messenger (S&D Messenger) which facilitates direct knowledge delivery between the labeled and unlabeled set, and thus allowing the model to comprehend both of them in each individual learning flow. Equipped with our S&D Messenger, a naive pseudo-labeling method can achieve huge improvement on ten benchmark datasets for SSMIS (+7.5%), UMDA (+5.6%), and Semi-MDG tasks (+1.14%), compared with state-of-the-art methods designed for specific tasks.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 11","pages":"4487-4498"},"PeriodicalIF":0.0,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144210788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-02DOI: 10.1109/TMI.2025.3575402
Wenqiang Tang;Zhouwang Yang
Deep learning-based disease grading technologies facilitate timely medical intervention due to their high efficiency and accuracy. Recent advancements have enhanced grading performance by incorporating the ordinal relationships of disease labels. However, existing methods often assume same probability distributions for disease labels across instances within the same category, overlooking variations in label distributions. Additionally, the hyperparameters of these distributions are typically determined empirically, which may not accurately reflect the true distribution. To address these limitations, we propose a disease grading network utilizing a sample-aware asymmetric Gaussian label distribution, termed DGN-AGLD. This approach includes a variance predictor designed to learn and predict parameters that control the asymmetry of the Gaussian distribution, enabling distinct label distributions within the same category. This module can be seamlessly integrated into standard deep learning networks. Experimental results on four disease datasets validate the effectiveness and superiority of the proposed method, particularly on the IDRiD dataset, where it achieves a diabetic retinopathy accuracy of 77.67%. Furthermore, our method extends to joint disease grading tasks, yielding superior results and demonstrating significant generalization capabilities. Visual analysis indicates that our method more accurately captures the trend of disease progression by leveraging the asymmetry in label distribution. Our code is publicly available on https://github.com/ahtwq/AGNet
{"title":"Disease-Grading Networks With Asymmetric Gaussian Distribution for Medical Imaging","authors":"Wenqiang Tang;Zhouwang Yang","doi":"10.1109/TMI.2025.3575402","DOIUrl":"10.1109/TMI.2025.3575402","url":null,"abstract":"Deep learning-based disease grading technologies facilitate timely medical intervention due to their high efficiency and accuracy. Recent advancements have enhanced grading performance by incorporating the ordinal relationships of disease labels. However, existing methods often assume same probability distributions for disease labels across instances within the same category, overlooking variations in label distributions. Additionally, the hyperparameters of these distributions are typically determined empirically, which may not accurately reflect the true distribution. To address these limitations, we propose a disease grading network utilizing a sample-aware asymmetric Gaussian label distribution, termed DGN-AGLD. This approach includes a variance predictor designed to learn and predict parameters that control the asymmetry of the Gaussian distribution, enabling distinct label distributions within the same category. This module can be seamlessly integrated into standard deep learning networks. Experimental results on four disease datasets validate the effectiveness and superiority of the proposed method, particularly on the IDRiD dataset, where it achieves a diabetic retinopathy accuracy of 77.67%. Furthermore, our method extends to joint disease grading tasks, yielding superior results and demonstrating significant generalization capabilities. Visual analysis indicates that our method more accurately captures the trend of disease progression by leveraging the asymmetry in label distribution. Our code is publicly available on <uri>https://github.com/ahtwq/AGNet</uri>","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 11","pages":"4457-4472"},"PeriodicalIF":0.0,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144201433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-02DOI: 10.1109/TMI.2025.3575853
Chenyu Lian;Hong-Yu Zhou;Dongyun Liang;Jing Qin;Liansheng Wang
Medical vision-language alignment through cross-modal contrastive learning shows promising performance in image-text matching tasks, such as retrieval and zero-shot classification. However, conventional cross-modal contrastive learning (CLIP-based) methods suffer from suboptimal visual representation capabilities, which also limits their effectiveness in vision-language alignment. In contrast, although the models pretrained via multimodal masked modeling struggle with direct cross-modal matching, they excel in visual representation. To address this contradiction, we propose ALTA (ALign Through Adapting), an efficient medical vision-language alignment method that utilizes only about 8% of the trainable parameters and less than 1/5 of the computational consumption required for masked record modeling. ALTA achieves superior performance in vision-language matching tasks like retrieval and zero-shot classification by adapting the pretrained vision model from masked record modeling. Additionally, we integrate temporal-multiview radiograph inputs to enhance the information consistency between radiographs and their corresponding descriptions in reports, further improving the vision-language alignment. Experimental evaluations show that ALTA outperforms the best-performing counterpart by over 4% absolute points in text-to-image accuracy and approximately 6% absolute points in image-to-text retrieval accuracy. The adaptation of vision-language models during efficient alignment also promotes better vision and language understanding. Code is publicly available at https://github.com/DopamineLcy/ALTA.
基于跨模态对比学习的医学视觉语言对齐在检索和零样本分类等图像-文本匹配任务中表现出良好的性能。然而,传统的跨模态对比学习(基于clip的)方法存在视觉表征能力欠佳的问题,这也限制了它们在视觉语言对齐方面的有效性。相比之下,尽管通过多模态掩模建模预训练的模型难以直接进行跨模态匹配,但它们在视觉表现方面表现出色。为了解决这一矛盾,我们提出了ALTA (ALign Through adaptation),这是一种高效的医学视觉语言对齐方法,它只利用了约8%的可训练参数和不到1/5的掩模记录建模所需的计算消耗。ALTA采用掩码记录建模的预训练视觉模型,在检索和零射击分类等视觉语言匹配任务中取得了优异的性能。此外,我们整合了时间-多视图x光片输入,以增强x光片与其报告中相应描述之间的信息一致性,进一步改善视觉语言一致性。实验评估表明,ALTA在文本到图像的准确性方面比表现最好的同类产品高出4%以上的绝对分数,在图像到文本的检索准确性方面高出约6%的绝对分数。在有效对齐过程中,视觉语言模型的适应也促进了更好的视觉和语言理解。代码可在https://github.com/DopamineLcy/ALTA上公开获取。
{"title":"Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models","authors":"Chenyu Lian;Hong-Yu Zhou;Dongyun Liang;Jing Qin;Liansheng Wang","doi":"10.1109/TMI.2025.3575853","DOIUrl":"10.1109/TMI.2025.3575853","url":null,"abstract":"Medical vision-language alignment through cross-modal contrastive learning shows promising performance in image-text matching tasks, such as retrieval and zero-shot classification. However, conventional cross-modal contrastive learning (CLIP-based) methods suffer from suboptimal visual representation capabilities, which also limits their effectiveness in vision-language alignment. In contrast, although the models pretrained via multimodal masked modeling struggle with direct cross-modal matching, they excel in visual representation. To address this contradiction, we propose ALTA (ALign Through Adapting), an efficient medical vision-language alignment method that utilizes only about 8% of the trainable parameters and less than 1/5 of the computational consumption required for masked record modeling. ALTA achieves superior performance in vision-language matching tasks like retrieval and zero-shot classification by adapting the pretrained vision model from masked record modeling. Additionally, we integrate temporal-multiview radiograph inputs to enhance the information consistency between radiographs and their corresponding descriptions in reports, further improving the vision-language alignment. Experimental evaluations show that ALTA outperforms the best-performing counterpart by over 4% absolute points in text-to-image accuracy and approximately 6% absolute points in image-to-text retrieval accuracy. The adaptation of vision-language models during efficient alignment also promotes better vision and language understanding. Code is publicly available at <uri>https://github.com/DopamineLcy/ALTA</uri>.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 11","pages":"4499-4510"},"PeriodicalIF":0.0,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144201430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To segment medical images with distribution shifts, domain generalization (DG) has emerged as a promising setting to train models on source domains that can generalize to unseen target domains. Existing DG methods are mainly based on CNN or ViT architectures. Recently, advanced state space models, represented by Mamba, have shown promising results in various supervised medical image segmentation. The success of Mamba is primarily owing to its ability to capture long-range dependencies while keeping linear complexity with input sequence length, making it a promising alternative to CNNs and ViTs. Inspired by the success, in the paper, we explore the potential of the Mamba architecture to address distribution shifts in DG for medical image segmentation. Specifically, we propose a novel Mamba-based framework, Mamba-Sea, incorporating global-to-local sequence augmentation to improve the model’s generalizability under domain shift issues. Our Mamba-Sea introduces a global augmentation mechanism designed to simulate potential variations in appearance across different sites, aiming to suppress the model’s learning of domain-specific information. At the local level, we propose a sequence-wise augmentation along input sequences, which perturbs the style of tokens within random continuous sub-sequences by modeling and resampling style statistics associated with domain shifts. To our best knowledge, Mamba-Sea is the first work to explore the generalization of Mamba for medical image segmentation, providing an advanced and promising Mamba-based architecture with strong robustness to domain shifts. Remarkably, our proposed method is the first to surpass a Dice coefficient of 90% on the Prostate dataset, which exceeds previous SOTA of 88.61%. The code is available at https://github.com/orange-czh/Mamba-Sea.
{"title":"Mamba-Sea: A Mamba-Based Framework With Global-to-Local Sequence Augmentation for Generalizable Medical Image Segmentation","authors":"Zihan Cheng;Jintao Guo;Jian Zhang;Lei Qi;Luping Zhou;Yinghuan Shi;Yang Gao","doi":"10.1109/TMI.2025.3564765","DOIUrl":"10.1109/TMI.2025.3564765","url":null,"abstract":"To segment medical images with distribution shifts, domain generalization (DG) has emerged as a promising setting to train models on source domains that can generalize to unseen target domains. Existing DG methods are mainly based on CNN or ViT architectures. Recently, advanced state space models, represented by Mamba, have shown promising results in various supervised medical image segmentation. The success of Mamba is primarily owing to its ability to capture long-range dependencies while keeping linear complexity with input sequence length, making it a promising alternative to CNNs and ViTs. Inspired by the success, in the paper, we explore the potential of the Mamba architecture to address distribution shifts in DG for medical image segmentation. Specifically, we propose a novel Mamba-based framework, Mamba-Sea, incorporating global-to-local sequence augmentation to improve the model’s generalizability under domain shift issues. Our Mamba-Sea introduces a global augmentation mechanism designed to simulate potential variations in appearance across different sites, aiming to suppress the model’s learning of domain-specific information. At the local level, we propose a sequence-wise augmentation along input sequences, which perturbs the style of tokens within random continuous sub-sequences by modeling and resampling style statistics associated with domain shifts. To our best knowledge, Mamba-Sea is the first work to explore the generalization of Mamba for medical image segmentation, providing an advanced and promising Mamba-based architecture with strong robustness to domain shifts. Remarkably, our proposed method is the first to surpass a Dice coefficient of 90% on the Prostate dataset, which exceeds previous SOTA of 88.61%. The code is available at <uri>https://github.com/orange-czh/Mamba-Sea</uri>.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 9","pages":"3741-3755"},"PeriodicalIF":0.0,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143893129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rib fractures are a common and potentially severe injury that can be challenging and labor-intensive to detect in CT scans. While there have been efforts to address this field, the lack of large-scale annotated datasets and evaluation benchmarks has hindered the development and validation of deep learning algorithms. To address this issue, the RibFrac Challenge was introduced, providing a benchmark dataset of over 5,000 rib fractures from 660 CT scans, with voxel-level instance mask annotations and diagnosis labels for four clinical categories (buckle, nondisplaced, displaced, or segmental). The challenge includes two tracks: a detection (instance segmentation) track evaluated by an FROC-style metric and a classification track evaluated by an F1-style metric. During the MICCAI 2020 challenge period, 243 results were evaluated, and seven teams were invited to participate in the challenge summary. The analysis revealed that several top rib fracture detection solutions achieved performance comparable or even better than human experts. Nevertheless, the current rib fracture classification solutions are hardly clinically applicable, which can be an interesting area in the future. As an active benchmark and research resource, the data and online evaluation of the RibFrac Challenge are available at the challenge website (https://ribfrac.grand-challenge.org/). In addition, we further analyzed the impact of two post-challenge advancements—large-scale pretraining and rib segmentation—based on our internal baseline for rib fracture detection. These findings lay a foundation for future research and development in AI-assisted rib fracture diagnosis.
{"title":"Deep Rib Fracture Instance Segmentation and Classification From CT on the RibFrac Challenge","authors":"Jiancheng Yang;Rui Shi;Liang Jin;Xiaoyang Huang;Kaiming Kuang;Donglai Wei;Shixuan Gu;Jianying Liu;Pengfei Liu;Zhizhong Chai;Yongjie Xiao;Hao Chen;Liming Xu;Bang Du;Xiangyi Yan;Hao Tang;Adam Alessio;Gregory Holste;Jiapeng Zhang;Xiaoming Wang;Jianye He;Lixuan Che;Hanspeter Pfister;Ming Li;Bingbing Ni","doi":"10.1109/TMI.2025.3565514","DOIUrl":"10.1109/TMI.2025.3565514","url":null,"abstract":"Rib fractures are a common and potentially severe injury that can be challenging and labor-intensive to detect in CT scans. While there have been efforts to address this field, the lack of large-scale annotated datasets and evaluation benchmarks has hindered the development and validation of deep learning algorithms. To address this issue, the RibFrac Challenge was introduced, providing a benchmark dataset of over 5,000 rib fractures from 660 CT scans, with voxel-level instance mask annotations and diagnosis labels for four clinical categories (buckle, nondisplaced, displaced, or segmental). The challenge includes two tracks: a detection (instance segmentation) track evaluated by an FROC-style metric and a classification track evaluated by an F1-style metric. During the MICCAI 2020 challenge period, 243 results were evaluated, and seven teams were invited to participate in the challenge summary. The analysis revealed that several top rib fracture detection solutions achieved performance comparable or even better than human experts. Nevertheless, the current rib fracture classification solutions are hardly clinically applicable, which can be an interesting area in the future. As an active benchmark and research resource, the data and online evaluation of the RibFrac Challenge are available at the challenge website (<uri>https://ribfrac.grand-challenge.org/</uri>). In addition, we further analyzed the impact of two post-challenge advancements—large-scale pretraining and rib segmentation—based on our internal baseline for rib fracture detection. These findings lay a foundation for future research and development in AI-assisted rib fracture diagnosis.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 8","pages":"3410-3427"},"PeriodicalIF":0.0,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143893128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-29DOI: 10.1109/TMI.2025.3564320
Zibo Xu;Qiang Li;Weizhi Nie;Weijie Wang;Anan Liu
Medical Visual Question Answering (MedVQA) aims to answer medical questions according to medical images. However, the complexity of medical data leads to confounders that are difficult to observe, so bias between images and questions is inevitable. Such cross-modal bias makes it challenging to infer medically meaningful answers. In this work, we propose a causal inference framework for the MedVQA task, which effectively eliminates the relative confounding effect between the image and the question to ensure the precision of the question-answering (QA) session. We are the first to introduce a novel causal graph structure that represents the interaction between visual and textual elements, explicitly capturing how different questions influence visual features. During optimization, we apply the mutual information to discover spurious correlations and propose a multi-variable resampling front-door adjustment method to eliminate the relative confounding effect, which aims to align features based on their true causal relevance to the question-answering task. In addition, we also introduce a prompt strategy that combines multiple prompt forms to improve the model’s ability to understand complex medical data and answer accurately. Extensive experiments on three MedVQA datasets demonstrate that 1) our method significantly improves the accuracy of MedVQA, and 2) our method achieves true causal correlations in the face of complex medical data.
{"title":"Structure Causal Models and LLMs Integration in Medical Visual Question Answering","authors":"Zibo Xu;Qiang Li;Weizhi Nie;Weijie Wang;Anan Liu","doi":"10.1109/TMI.2025.3564320","DOIUrl":"10.1109/TMI.2025.3564320","url":null,"abstract":"Medical Visual Question Answering (MedVQA) aims to answer medical questions according to medical images. However, the complexity of medical data leads to confounders that are difficult to observe, so bias between images and questions is inevitable. Such cross-modal bias makes it challenging to infer medically meaningful answers. In this work, we propose a causal inference framework for the MedVQA task, which effectively eliminates the relative confounding effect between the image and the question to ensure the precision of the question-answering (QA) session. We are the first to introduce a novel causal graph structure that represents the interaction between visual and textual elements, explicitly capturing how different questions influence visual features. During optimization, we apply the mutual information to discover spurious correlations and propose a multi-variable resampling front-door adjustment method to eliminate the relative confounding effect, which aims to align features based on their true causal relevance to the question-answering task. In addition, we also introduce a prompt strategy that combines multiple prompt forms to improve the model’s ability to understand complex medical data and answer accurately. Extensive experiments on three MedVQA datasets demonstrate that 1) our method significantly improves the accuracy of MedVQA, and 2) our method achieves true causal correlations in the face of complex medical data.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 8","pages":"3476-3489"},"PeriodicalIF":0.0,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143890058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multiple instance learning (MIL) has become a standard paradigm for the weakly supervised classification of whole slide images (WSIs). However, this paradigm relies on using a large number of labeled WSIs for training. The lack of training data and the presence of rare diseases pose significant challenges for these methods. Prompt tuning combined with pre-trained Vision-Language models (VLMs) is an effective solution to the Few-shot Weakly Supervised WSI Classification (FSWC) task. Nevertheless, applying prompt tuning methods designed for natural images to WSIs presents three significant challenges: 1) These methods fail to fully leverage the prior knowledge from the VLM’s text modality; 2) They overlook the essential multi-scale and contextual information in WSIs, leading to suboptimal results; and 3) They lack exploration of instance aggregation methods. To address these problems, we propose a Multi-Scale and Context-focused Prompt Tuning (MSCPT) method for FSWC task. Specifically, MSCPT employs the frozen large language model to generate pathological visual language prior knowledge at multiple scales, guiding hierarchical prompt tuning. Additionally, we design a graph prompt tuning module to learn essential contextual information within WSI, and finally, a non-parametric cross-guided instance aggregation module has been introduced to derive the WSI-level features. Extensive experiments, visualizations, and interpretability analyses were conducted on five datasets and three downstream tasks using three VLMs, demonstrating the strong performance of our MSCPT. All codes have been made publicly accessible at https://github.com/Hanminghao/MSCPT.
{"title":"MSCPT: Few-Shot Whole Slide Image Classification With Multi-Scale and Context-Focused Prompt Tuning","authors":"Minghao Han;Linhao Qu;Dingkang Yang;Xukun Zhang;Xiaoying Wang;Lihua Zhang","doi":"10.1109/TMI.2025.3564976","DOIUrl":"10.1109/TMI.2025.3564976","url":null,"abstract":"Multiple instance learning (MIL) has become a standard paradigm for the weakly supervised classification of whole slide images (WSIs). However, this paradigm relies on using a large number of labeled WSIs for training. The lack of training data and the presence of rare diseases pose significant challenges for these methods. Prompt tuning combined with pre-trained Vision-Language models (VLMs) is an effective solution to the Few-shot Weakly Supervised WSI Classification (FSWC) task. Nevertheless, applying prompt tuning methods designed for natural images to WSIs presents three significant challenges: 1) These methods fail to fully leverage the prior knowledge from the VLM’s text modality; 2) They overlook the essential multi-scale and contextual information in WSIs, leading to suboptimal results; and 3) They lack exploration of instance aggregation methods. To address these problems, we propose a Multi-Scale and Context-focused Prompt Tuning (MSCPT) method for FSWC task. Specifically, MSCPT employs the frozen large language model to generate pathological visual language prior knowledge at multiple scales, guiding hierarchical prompt tuning. Additionally, we design a graph prompt tuning module to learn essential contextual information within WSI, and finally, a non-parametric cross-guided instance aggregation module has been introduced to derive the WSI-level features. Extensive experiments, visualizations, and interpretability analyses were conducted on five datasets and three downstream tasks using three VLMs, demonstrating the strong performance of our MSCPT. All codes have been made publicly accessible at <uri>https://github.com/Hanminghao/MSCPT</uri>.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 9","pages":"3756-3769"},"PeriodicalIF":0.0,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143890056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-28DOI: 10.1109/TMI.2025.3565000
Tingxin Hu;Weihang Zhang;Jia Guo;Huiqi Li
Due to the difficulty of collecting multi-label annotations for retinal diseases, fundus images are usually annotated with only one label, while they actually have multiple labels. Given that deep learning requires accurate training data, incomplete disease information may lead to unsatisfactory classifiers and even misdiagnosis. To cope with these challenges, we propose a co-pseudo labeling and active selection method for Fundus Single-Positive multi-label learning, named FSP. FSP trains two networks simultaneously to generate pseudo labels through curriculum co-pseudo labeling and active sample selection. The curriculum co-pseudo labeling adjusts the thresholds according to the model’s learning status of each class. Then, the active sample selection maintains confident positive predictions with more precise pseudo labels based on loss modeling. A detailed experimental evaluation is conducted on seven retinal datasets. Comparison experiments show the effectiveness of FSP and its superiority over previous methods. Downstream experiments are also presented to validate the proposed method.
{"title":"Co-Pseudo Labeling and Active Selection for Fundus Single-Positive Multi-Label Learning","authors":"Tingxin Hu;Weihang Zhang;Jia Guo;Huiqi Li","doi":"10.1109/TMI.2025.3565000","DOIUrl":"10.1109/TMI.2025.3565000","url":null,"abstract":"Due to the difficulty of collecting multi-label annotations for retinal diseases, fundus images are usually annotated with only one label, while they actually have multiple labels. Given that deep learning requires accurate training data, incomplete disease information may lead to unsatisfactory classifiers and even misdiagnosis. To cope with these challenges, we propose a co-pseudo labeling and active selection method for Fundus Single-Positive multi-label learning, named FSP. FSP trains two networks simultaneously to generate pseudo labels through curriculum co-pseudo labeling and active sample selection. The curriculum co-pseudo labeling adjusts the thresholds according to the model’s learning status of each class. Then, the active sample selection maintains confident positive predictions with more precise pseudo labels based on loss modeling. A detailed experimental evaluation is conducted on seven retinal datasets. Comparison experiments show the effectiveness of FSP and its superiority over previous methods. Downstream experiments are also presented to validate the proposed method.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 8","pages":"3428-3438"},"PeriodicalIF":0.0,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143884371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-28DOI: 10.1109/TMI.2025.3565023
Ge Zhang;Mathis Vert;Mohamed Nouhoum;Esteban Rivera;Nabil Haidour;Anatole Jimenez;Thomas Deffieux;Simon Barral;Pascal Hersen;Sophie Pezet;Claire Rabut;Mikhail G. Shapiro;Mickael Tanter
Ultrasound imaging holds significant promise for the observation of molecular and cellular phenomena through the utilization of acoustic contrast agents and acoustic reporter genes. Optimizing imaging methodologies for enhanced detection represents an imperative advancement in this field. Most advanced techniques relying on amplitude modulation schemes such as cross amplitude modulation (xAM) and ultrafast amplitude modulation (uAM) combined with Hadamard encoded multiplane wave transmissions have shown efficacy in capturing the acoustic signals of gas vesicles (GVs). Nonetheless, uAM sequence requires odd- or even-element transmissions leading to imprecise amplitude modulation emitting scheme, and the complex multiplane wave transmission scheme inherently yields overlong pulse durations. xAM sequence is limited in terms of field of view and imaging depth. To overcome these limitations, we introduce an innovative ultrafast imaging sequence called amplitude-modulated singular value decomposition (SVD) processing. Our method demonstrates a contrast imaging sensitivity comparable to the current gold-standard xAM and uAM, while requiring 4.8 times fewer pulse transmissions. With a similar number of transmit pulses, amplitude-modulated SVD outperforms xAM and uAM in terms of an improvement in signal-to-background ratio of $+ 4.78~pm ~0.35$ dB and $+ 8.29~pm ~3.52$ dB, respectively. Furthermore, the method exhibits superior robustness across a wide range of acoustic pressures and enables high-contrast imaging in ex vivo and in vivo settings. Furthermore, amplitude-modulated SVD is envisioned to be applicable for the detection of slow moving microbubbles in ultrasound localization microscopy (ULM).
{"title":"Amplitude-Modulated Singular Value Decomposition for Ultrafast Ultrasound Imaging of Gas Vesicles","authors":"Ge Zhang;Mathis Vert;Mohamed Nouhoum;Esteban Rivera;Nabil Haidour;Anatole Jimenez;Thomas Deffieux;Simon Barral;Pascal Hersen;Sophie Pezet;Claire Rabut;Mikhail G. Shapiro;Mickael Tanter","doi":"10.1109/TMI.2025.3565023","DOIUrl":"10.1109/TMI.2025.3565023","url":null,"abstract":"Ultrasound imaging holds significant promise for the observation of molecular and cellular phenomena through the utilization of acoustic contrast agents and acoustic reporter genes. Optimizing imaging methodologies for enhanced detection represents an imperative advancement in this field. Most advanced techniques relying on amplitude modulation schemes such as cross amplitude modulation (xAM) and ultrafast amplitude modulation (uAM) combined with Hadamard encoded multiplane wave transmissions have shown efficacy in capturing the acoustic signals of gas vesicles (GVs). Nonetheless, uAM sequence requires odd- or even-element transmissions leading to imprecise amplitude modulation emitting scheme, and the complex multiplane wave transmission scheme inherently yields overlong pulse durations. xAM sequence is limited in terms of field of view and imaging depth. To overcome these limitations, we introduce an innovative ultrafast imaging sequence called amplitude-modulated singular value decomposition (SVD) processing. Our method demonstrates a contrast imaging sensitivity comparable to the current gold-standard xAM and uAM, while requiring 4.8 times fewer pulse transmissions. With a similar number of transmit pulses, amplitude-modulated SVD outperforms xAM and uAM in terms of an improvement in signal-to-background ratio of <inline-formula> <tex-math>$+ 4.78~pm ~0.35$ </tex-math></inline-formula> dB and <inline-formula> <tex-math>$+ 8.29~pm ~3.52$ </tex-math></inline-formula> dB, respectively. Furthermore, the method exhibits superior robustness across a wide range of acoustic pressures and enables high-contrast imaging in ex vivo and in vivo settings. Furthermore, amplitude-modulated SVD is envisioned to be applicable for the detection of slow moving microbubbles in ultrasound localization microscopy (ULM).","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 8","pages":"3490-3501"},"PeriodicalIF":0.0,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143884373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-28DOI: 10.1109/TMI.2025.3563482
Peng Tang;Xiaoxiao Yan;Xiaobin Hu;Kai Wu;Tobias Lasser;Kuangyu Shi
Anomaly detection (AD) in medical applications is a promising field, offering a cost-effective alternative to labor-intensive abnormal data collection and labeling. However, the success of feature reconstruction-based methods in AD is often hindered by two critical factors: the domain gap of pre-trained encoders and the exploration of decoder potential. The EA2D method we propose overcomes these challenges, paving the way for more effective AD in medical imaging. In this paper, we present encoder-attention-2decoder (EA2D), a novel method tailored for medical AD. Firstly, EA2D is optimized through two tasks: a primary feature reconstruction task between the encoder and decoder, which detects anomalies based on reconstruction errors, and an auxiliary transformation-consistency contrastive learning task that explicitly optimizes the encoder to reduce the domain gap between natural images and medical images. Furthermore, EA2D intensely exploits the decoder’s capabilities to improve AD performance. We introduce a self-attention skip connection to augment the reconstruction quality of normal cases, thereby magnifying the distinction between normal and abnormal samples. Additionally, we propose using dual decoders to reconstruct dual views of an image, leveraging diverse perspectives while mitigating the over-reconstruction issue of anomalies in AD. Extensive experiments across four medical image modalities demonstrates the superiority of our EA2D in various medical scenarios. Our method’s code will be released at https://github.com/TumCCC/E2AD.
{"title":"Anomaly Detection in Medical Images Using Encoder-Attention-2Decoders Reconstruction","authors":"Peng Tang;Xiaoxiao Yan;Xiaobin Hu;Kai Wu;Tobias Lasser;Kuangyu Shi","doi":"10.1109/TMI.2025.3563482","DOIUrl":"10.1109/TMI.2025.3563482","url":null,"abstract":"Anomaly detection (AD) in medical applications is a promising field, offering a cost-effective alternative to labor-intensive abnormal data collection and labeling. However, the success of feature reconstruction-based methods in AD is often hindered by two critical factors: the domain gap of pre-trained encoders and the exploration of decoder potential. The EA2D method we propose overcomes these challenges, paving the way for more effective AD in medical imaging. In this paper, we present encoder-attention-2decoder (EA2D), a novel method tailored for medical AD. Firstly, EA2D is optimized through two tasks: a primary feature reconstruction task between the encoder and decoder, which detects anomalies based on reconstruction errors, and an auxiliary transformation-consistency contrastive learning task that explicitly optimizes the encoder to reduce the domain gap between natural images and medical images. Furthermore, EA2D intensely exploits the decoder’s capabilities to improve AD performance. We introduce a self-attention skip connection to augment the reconstruction quality of normal cases, thereby magnifying the distinction between normal and abnormal samples. Additionally, we propose using dual decoders to reconstruct dual views of an image, leveraging diverse perspectives while mitigating the over-reconstruction issue of anomalies in AD. Extensive experiments across four medical image modalities demonstrates the superiority of our EA2D in various medical scenarios. Our method’s code will be released at <uri>https://github.com/TumCCC/E2AD</uri>.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 8","pages":"3370-3382"},"PeriodicalIF":0.0,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143884375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}