Zhenxuan Zhang , Heye Zhang , Tieyong Zeng , Guang Yang , Zhenquan Shi , Zhifan Gao
{"title":"Bridging multi-level gaps: Bidirectional reciprocal cycle framework for text-guided label-efficient segmentation in echocardiography","authors":"Zhenxuan Zhang , Heye Zhang , Tieyong Zeng , Guang Yang , Zhenquan Shi , Zhifan Gao","doi":"10.1016/j.media.2025.103536","DOIUrl":null,"url":null,"abstract":"<div><div>Text-guided visual understanding is a potential solution for downstream task learning in echocardiography. It can reduce reliance on labeled large datasets and facilitate learning clinical tasks. This is because the text can embed highly condensed clinical information into predictions for visual tasks. The contrastive language-image pretraining (CLIP) based methods extract image-text features by constructing a contrastive learning pre-train process in a sequence of matched text and images. These methods adapt the pre-trained network parameters to improve downstream task performance with text guidance. However, these methods still have the challenge of the multi-level gap between image and text. It mainly stems from spatial-level, contextual-level, and domain-level gaps. It is difficult to deal with medical image–text pairs and dense prediction tasks. Therefore, we propose a bidirectional reciprocal cycle (BRC) framework to bridge the multi-level gaps. First, the BRC constructs pyramid reciprocal alignments of embedded global and local image–text feature representations. This matches complex medical expertise with corresponding phenomena. Second, BRC enforces the forward inference to be consistent with the reverse mapping (i.e., the text <span><math><mo>→</mo></math></span> feature is consistent with the feature <span><math><mo>→</mo></math></span> text or feature <span><math><mo>→</mo></math></span> image). This enforces the perception of the contextual relationship between input data and feature. Third, the BRC can adapt to the specific downstream segmentation task. This embeds complex text information to directly guide downstream tasks with a cross-modal attention mechanism. Compared with 22 existing methods, our BRC can achieve state-of-the-art performance on segmentation tasks (DSC = 95.2%). Extensive experiments on 11048 patients show that our method can significantly improve the accuracy and reduce the reliance on labeled data (DSC increased from 81.5% to 86.6% with text assistance in 1% labeled proportion data).</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"102 ","pages":"Article 103536"},"PeriodicalIF":10.7000,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical image analysis","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1361841525000830","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Text-guided visual understanding is a potential solution for downstream task learning in echocardiography. It can reduce reliance on labeled large datasets and facilitate learning clinical tasks. This is because the text can embed highly condensed clinical information into predictions for visual tasks. The contrastive language-image pretraining (CLIP) based methods extract image-text features by constructing a contrastive learning pre-train process in a sequence of matched text and images. These methods adapt the pre-trained network parameters to improve downstream task performance with text guidance. However, these methods still have the challenge of the multi-level gap between image and text. It mainly stems from spatial-level, contextual-level, and domain-level gaps. It is difficult to deal with medical image–text pairs and dense prediction tasks. Therefore, we propose a bidirectional reciprocal cycle (BRC) framework to bridge the multi-level gaps. First, the BRC constructs pyramid reciprocal alignments of embedded global and local image–text feature representations. This matches complex medical expertise with corresponding phenomena. Second, BRC enforces the forward inference to be consistent with the reverse mapping (i.e., the text feature is consistent with the feature text or feature image). This enforces the perception of the contextual relationship between input data and feature. Third, the BRC can adapt to the specific downstream segmentation task. This embeds complex text information to directly guide downstream tasks with a cross-modal attention mechanism. Compared with 22 existing methods, our BRC can achieve state-of-the-art performance on segmentation tasks (DSC = 95.2%). Extensive experiments on 11048 patients show that our method can significantly improve the accuracy and reduce the reliance on labeled data (DSC increased from 81.5% to 86.6% with text assistance in 1% labeled proportion data).
期刊介绍:
Medical Image Analysis serves as a platform for sharing new research findings in the realm of medical and biological image analysis, with a focus on applications of computer vision, virtual reality, and robotics to biomedical imaging challenges. The journal prioritizes the publication of high-quality, original papers contributing to the fundamental science of processing, analyzing, and utilizing medical and biological images. It welcomes approaches utilizing biomedical image datasets across all spatial scales, from molecular/cellular imaging to tissue/organ imaging.