Few-shot unsupervised domain adaptation (FS-UDA) leverages a limited amount of labeled data from a source domain to enable accurate classification in an unlabeled target domain. Despite recent advancements, current approaches of FS-UDA continue to confront a major challenge: models often demonstrate instability when adapted to new FS-UDA tasks and necessitate considerable time investment. To address these challenges, we put forward a novel framework called Enduring and Efficient Meta-Prompt Learning (E2MPL) for FS-UDA. Within this framework, we utilize the pre-trained CLIP model as the backbone of feature learning. Firstly, we design domain-shared prompts, consisting of virtual tokens, which primarily capture meta-knowledge from a wide range of meta-tasks to mitigate the domain gaps. Secondly, we develop a task prompt learning network that adaptively learns task-specific prompts with the goal of achieving fast and stable task generalization. Thirdly, we formulate the meta-prompt learning process as a bilevel optimization problem, consisting of (outer) meta-prompt learner and (inner) task-specific classifier and domain adapter. Also, the inner objective of each meta-task has the closed-form solution, which enables efficient prompt learning and adaptation to new tasks in a single step. Extensive experimental studies demonstrate the promising performance of our framework in a domain adaptation benchmark dataset DomainNet. Compared with state-of-the-art methods, our approach has improved the average accuracy by at least 15 percentage points and reduces the average time by 64.67% in the 5-way 1-shot task; in the 5-way 5-shot task, it achieves at least a 9-percentage-point improvement in average accuracy and reduces the average time by 63.18%. Moreover, our method exhibits more enduring and stable performance than the other methods, i.e., reducing the average IQR value by over 40.80% and 25.35% in the 5-way 1-shot and 5-shot task, respectively.
{"title":"E2MPL: An Enduring and Efficient Meta Prompt Learning Framework for Few-Shot Unsupervised Domain Adaptation","authors":"Wanqi Yang;Haoran Wang;Wei Wang;Lei Wang;Ge Song;Ming Yang;Yang Gao","doi":"10.1109/TIP.2025.3645560","DOIUrl":"10.1109/TIP.2025.3645560","url":null,"abstract":"Few-shot unsupervised domain adaptation (FS-UDA) leverages a limited amount of labeled data from a source domain to enable accurate classification in an unlabeled target domain. Despite recent advancements, current approaches of FS-UDA continue to confront a major challenge: models often demonstrate instability when adapted to new FS-UDA tasks and necessitate considerable time investment. To address these challenges, we put forward a novel framework called Enduring and Efficient Meta-Prompt Learning (E2MPL) for FS-UDA. Within this framework, we utilize the pre-trained CLIP model as the backbone of feature learning. Firstly, we design domain-shared prompts, consisting of virtual tokens, which primarily capture meta-knowledge from a wide range of meta-tasks to mitigate the domain gaps. Secondly, we develop a task prompt learning network that adaptively learns task-specific prompts with the goal of achieving fast and stable task generalization. Thirdly, we formulate the meta-prompt learning process as a bilevel optimization problem, consisting of (outer) meta-prompt learner and (inner) task-specific classifier and domain adapter. Also, the inner objective of each meta-task has the closed-form solution, which enables efficient prompt learning and adaptation to new tasks in a single step. Extensive experimental studies demonstrate the promising performance of our framework in a domain adaptation benchmark dataset DomainNet. Compared with state-of-the-art methods, our approach has improved the average accuracy by at least 15 percentage points and reduces the average time by 64.67% in the 5-way 1-shot task; in the 5-way 5-shot task, it achieves at least a 9-percentage-point improvement in average accuracy and reduces the average time by 63.18%. Moreover, our method exhibits more enduring and stable performance than the other methods, i.e., reducing the average IQR value by over 40.80% and 25.35% in the 5-way 1-shot and 5-shot task, respectively.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8656-8671"},"PeriodicalIF":13.7,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145812861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-23DOI: 10.1109/TIP.2025.3645583
Zhuoran Zheng;Pu Wang;Liubing Hu;Xin Su
Ultra-high-definition (UHD) image restoration is vital for applications demanding exceptional visual fidelity, yet existing methods often face a trade-off between restoration quality and efficiency, limiting their practical deployment. In this paper, we propose TSFormer, an all-in-one framework that integrates Trusted learning with Sparsification to boost both generalization capability and computational efficiency in UHD image restoration. The key to sparsification is that only a small amount of token movement is allowed within the model. To efficiently filter tokens, we use Min-$p$ with random matrix theory to quantify the uncertainty of tokens (lower trustworthiness), thereby improving the robustness of the model. Our model can run a 4K ($3840times 2160$ ) image in real time (40fps) with 3.38 M parameters. Extensive experiments demonstrate that TSFormer achieves state-of-the-art restoration quality while enhancing generalization and reducing computational demands. In addition, our token filtering method can be applied to other image restoration models to effectively accelerate inference and maintain performance.
{"title":"TSFormer: Efficient Ultra-High-Definition Image Restoration via Trusted Min-p","authors":"Zhuoran Zheng;Pu Wang;Liubing Hu;Xin Su","doi":"10.1109/TIP.2025.3645583","DOIUrl":"10.1109/TIP.2025.3645583","url":null,"abstract":"Ultra-high-definition (UHD) image restoration is vital for applications demanding exceptional visual fidelity, yet existing methods often face a trade-off between restoration quality and efficiency, limiting their practical deployment. In this paper, we propose TSFormer, an all-in-one framework that integrates Trusted learning with Sparsification to boost both generalization capability and computational efficiency in UHD image restoration. The key to sparsification is that only a small amount of token movement is allowed within the model. To efficiently filter tokens, we use Min-<inline-formula> <tex-math>$p$ </tex-math></inline-formula> with random matrix theory to quantify the uncertainty of tokens (lower trustworthiness), thereby improving the robustness of the model. Our model can run a 4K (<inline-formula> <tex-math>$3840times 2160$ </tex-math></inline-formula>) image in real time (40fps) with 3.38 M parameters. Extensive experiments demonstrate that TSFormer achieves state-of-the-art restoration quality while enhancing generalization and reducing computational demands. In addition, our token filtering method can be applied to other image restoration models to effectively accelerate inference and maintain performance.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"247-259"},"PeriodicalIF":13.7,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145812859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unsupervised domain adaptation semantic segmentation (UDASS) aims to perform dense prediction on the unlabeled target domain by training the model on a labeled source domain. In this field, self-training approaches have demonstrated strong competitiveness and advantages. However, existing methods often rely on additional training data (such as reference datasets or depth maps) to rectify the unreliable pseudo-labels, ignoring the cross-domain interaction between the target and source domains. To address this issue, in this paper, we propose a novel method for unsupervised domain adaptation semantic segmentation, termed Unlocking Cross-Domain Synergies (UCDS). Specifically, in the UCDS network, we design a new Dynamic Self-Correction (DSC) module that effectively transfers source domain knowledge and generates high-confidence pseudo-labels without additional training resources. Unlike the existing methods, DSC proposes a Dynamic Noisy Label Detection method for the target domain. To correct the noisy pseudo-labels, we design a Dual Bank mechanism that explores the reliable and unreliable predictions of the source domain, and conducts cross-domain synergy through Weighted Reassignment Self-Correction and Negative Correction Prevention strategies. To enhance the discriminative ability of features and amplify the dissimilarity of different categories, we propose Discrepancy-based Contrastive Learning (DCL). The DCL selects positive and negative samples in the source and target domains based on the semantic discrepancies among different categories, effectively avoiding the numerous false negative samples found in existing methods. Extensive experimental results on three commonly used datasets demonstrate the superiority of the proposed UCDS in comparison with the state-of-the-art methods. The project and code are available at https://github.com/wqh011128/UCDS
{"title":"Unlocking Cross-Domain Synergies for Domain Adaptive Semantic Segmentation","authors":"Qin Xu;Qihang Wu;Bo Jiang;Jiahui Wang;Yuan Chen;Jinhui Tang","doi":"10.1109/TIP.2025.3645599","DOIUrl":"10.1109/TIP.2025.3645599","url":null,"abstract":"Unsupervised domain adaptation semantic segmentation (UDASS) aims to perform dense prediction on the unlabeled target domain by training the model on a labeled source domain. In this field, self-training approaches have demonstrated strong competitiveness and advantages. However, existing methods often rely on additional training data (such as reference datasets or depth maps) to rectify the unreliable pseudo-labels, ignoring the cross-domain interaction between the target and source domains. To address this issue, in this paper, we propose a novel method for unsupervised domain adaptation semantic segmentation, termed Unlocking Cross-Domain Synergies (UCDS). Specifically, in the UCDS network, we design a new Dynamic Self-Correction (DSC) module that effectively transfers source domain knowledge and generates high-confidence pseudo-labels without additional training resources. Unlike the existing methods, DSC proposes a Dynamic Noisy Label Detection method for the target domain. To correct the noisy pseudo-labels, we design a Dual Bank mechanism that explores the reliable and unreliable predictions of the source domain, and conducts cross-domain synergy through Weighted Reassignment Self-Correction and Negative Correction Prevention strategies. To enhance the discriminative ability of features and amplify the dissimilarity of different categories, we propose Discrepancy-based Contrastive Learning (DCL). The DCL selects positive and negative samples in the source and target domains based on the semantic discrepancies among different categories, effectively avoiding the numerous false negative samples found in existing methods. Extensive experimental results on three commonly used datasets demonstrate the superiority of the proposed UCDS in comparison with the state-of-the-art methods. The project and code are available at <uri>https://github.com/wqh011128/UCDS</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"136-149"},"PeriodicalIF":13.7,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145812855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-22DOI: 10.1109/TIP.2025.3644787
Aimin Feng;Huichuan Huang;Guangyu Wei;Wenlong Sun
In the domain of image anomaly detection, significant progress has been made in unsupervised and self-supervised methods with datasets containing only normal samples. Although these methods perform well in general industrial anomaly detection scenarios, they often struggle with over- or under-detection when faced with fine-grained anomalies in products. In this paper, we propose GRAD: Bi-Grid Reconstruction for Image Anomaly Detection, which utilizes two continuous grids to detect anomalies from both normal and abnormal perspectives. In this work: 1) Grids serve as feature repositories to assist in the reconstruction task, achieving stronger generalization compared to discrete storage, while also helping to avoid the Identical Shortcut (IS) problem common in general reconstruction methods. 2) An additional grid storing abnormal features is introduced alongside the normal grid storing normal features, which refines the boundaries of normal features, thereby enhancing GRAD’s detection performance for fine-grained defects. 3) The Feature Block Pasting (FBP) module is designed to synthesize a variety of anomalies at the feature level, enabling the rapid deployment of the abnormal grid. Additionally, benefiting from the powerful representation capabilities of grids, GRAD is suitable for a unified task setting, requiring only a single model to be trained for multiple classes. GRAD has been comprehensively tested on classic industrial datasets including MVTecAD, VisA, and the newest GoodsAD dataset, showing significant improvement over current state-of-the-art methods.
{"title":"Bi-Grid Reconstruction for Image Anomaly Detection","authors":"Aimin Feng;Huichuan Huang;Guangyu Wei;Wenlong Sun","doi":"10.1109/TIP.2025.3644787","DOIUrl":"10.1109/TIP.2025.3644787","url":null,"abstract":"In the domain of image anomaly detection, significant progress has been made in unsupervised and self-supervised methods with datasets containing only normal samples. Although these methods perform well in general industrial anomaly detection scenarios, they often struggle with over- or under-detection when faced with fine-grained anomalies in products. In this paper, we propose GRAD: Bi-Grid Reconstruction for Image Anomaly Detection, which utilizes two continuous grids to detect anomalies from both normal and abnormal perspectives. In this work: 1) Grids serve as feature repositories to assist in the reconstruction task, achieving stronger generalization compared to discrete storage, while also helping to avoid the Identical Shortcut (IS) problem common in general reconstruction methods. 2) An additional grid storing abnormal features is introduced alongside the normal grid storing normal features, which refines the boundaries of normal features, thereby enhancing GRAD’s detection performance for fine-grained defects. 3) The Feature Block Pasting (FBP) module is designed to synthesize a variety of anomalies at the feature level, enabling the rapid deployment of the abnormal grid. Additionally, benefiting from the powerful representation capabilities of grids, GRAD is suitable for a unified task setting, requiring only a single model to be trained for multiple classes. GRAD has been comprehensively tested on classic industrial datasets including MVTecAD, VisA, and the newest GoodsAD dataset, showing significant improvement over current state-of-the-art methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8599-8613"},"PeriodicalIF":13.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145807706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-22DOI: 10.1109/TIP.2025.3644140
Zheren Fu;Zhendong Mao;Lei Zhang;Yongdong Zhang
Multimodal Large Language Models (MLLMs) exhibit impressive performance across vision-language tasks, but still face the hallucination challenges, where generated texts are factually inconsistent with visual input. Existing mitigation methods focus on surface symptoms of hallucination and heavily rely on post-hoc corrections, extensive data curation, or costly inference schemes. In this work, we identify two key factors of MLLM hallucination: Insufficient Visual Context, where ambiguous visual contexts lead to language speculation, and Progressive Textual Drift, where model attention strays from visual inputs in longer responses. To address these problems, we propose a novel Complementary Visual Grounding (CVG) framework. CVG exploits the intrinsic architecture of MLLMs, without requiring any external tools, models, or additional data. CVG first disentangles visual context into two complementary branches based on query relevance, then maintains steadfast visual grounding during the auto-regressive generation. Finally, it contrasts the output distributions of two branches to produce a faithful response. Extensive experiments on various hallucination and general benchmarks demonstrate that CVG achieves state-of-the-art performances across MLLM architectures and scales.
多模态大型语言模型(Multimodal Large Language Models, mllm)在视觉语言任务中表现出令人印象深刻的性能,但仍然面临幻觉的挑战,即生成的文本实际上与视觉输入不一致。现有的缓解方法侧重于幻觉的表面症状,严重依赖于事后纠正、广泛的数据管理或昂贵的推理方案。在这项工作中,我们确定了MLLM幻觉的两个关键因素:视觉上下文不足,其中模糊的视觉上下文导致语言猜测,以及渐进式文本漂移,其中模型注意力在较长的响应中偏离视觉输入。为了解决这些问题,我们提出了一个新的互补视觉基础(CVG)框架。CVG利用了mlm的内在架构,而不需要任何外部工具、模型或额外的数据。CVG首先根据查询相关性将视觉上下文分解为两个互补的分支,然后在自动回归生成过程中保持稳定的视觉基础。最后,对比两个分支的输出分布,以产生忠实的响应。对各种幻觉和一般基准的广泛实验表明,CVG在MLLM架构和规模上实现了最先进的性能。
{"title":"Boosting Faithful Multi-Modal LLMs via Complementary Visual Grounding","authors":"Zheren Fu;Zhendong Mao;Lei Zhang;Yongdong Zhang","doi":"10.1109/TIP.2025.3644140","DOIUrl":"10.1109/TIP.2025.3644140","url":null,"abstract":"Multimodal Large Language Models (MLLMs) exhibit impressive performance across vision-language tasks, but still face the hallucination challenges, where generated texts are factually inconsistent with visual input. Existing mitigation methods focus on surface symptoms of hallucination and heavily rely on post-hoc corrections, extensive data curation, or costly inference schemes. In this work, we identify two key factors of MLLM hallucination: Insufficient Visual Context, where ambiguous visual contexts lead to language speculation, and Progressive Textual Drift, where model attention strays from visual inputs in longer responses. To address these problems, we propose a novel Complementary Visual Grounding (CVG) framework. CVG exploits the intrinsic architecture of MLLMs, without requiring any external tools, models, or additional data. CVG first disentangles visual context into two complementary branches based on query relevance, then maintains steadfast visual grounding during the auto-regressive generation. Finally, it contrasts the output distributions of two branches to produce a faithful response. Extensive experiments on various hallucination and general benchmarks demonstrate that CVG achieves state-of-the-art performances across MLLM architectures and scales.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8641-8655"},"PeriodicalIF":13.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145807372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-22DOI: 10.1109/TIP.2025.3644793
Kareth M. León-López;Abderrahim Halimi;Jean-Yves Tourneret;Herwig Wendt
Multifractal analysis (MFA) provides a framework for the global characterization of image textures by describing the spatial fluctuations of their local regularity based on the multifractal spectrum. Several works have shown the interest of using MFA for the description of homogeneous textures in images. Nevertheless, natural images can be composed of several textures and, in turn, multifractal properties associated with those textures. This paper introduces an unsupervised Bayesian multifractal segmentation method to model and segment multifractal textures by jointly estimating the multifractal parameters and labels on images, at the pixel-level. For this, a computationally and statistically efficient multifractal parameter estimation model for wavelet leaders is firstly developed, defining different multifractality parameters for different regions of an image. Then, a multiscale Potts Markov random field is introduced as a prior to model the inherent spatial and scale correlations (referred to as cross-scale correlations) between the labels of the wavelet leaders. A Gibbs sampling methodology is finally used to draw samples from the posterior distribution of the unknown model parameters. Numerical experiments are conducted on synthetic multifractal images to evaluate the performance of the proposed segmentation approach. The proposed method achieves superior performance compared to traditional unsupervised segmentation techniques as well as modern deep learning-based approaches, showing its effectiveness for multifractal image segmentation.
{"title":"Bayesian Multifractal Image Segmentation","authors":"Kareth M. León-López;Abderrahim Halimi;Jean-Yves Tourneret;Herwig Wendt","doi":"10.1109/TIP.2025.3644793","DOIUrl":"10.1109/TIP.2025.3644793","url":null,"abstract":"Multifractal analysis (MFA) provides a framework for the global characterization of image textures by describing the spatial fluctuations of their local regularity based on the multifractal spectrum. Several works have shown the interest of using MFA for the description of homogeneous textures in images. Nevertheless, natural images can be composed of several textures and, in turn, multifractal properties associated with those textures. This paper introduces an unsupervised Bayesian multifractal segmentation method to model and segment multifractal textures by jointly estimating the multifractal parameters and labels on images, at the pixel-level. For this, a computationally and statistically efficient multifractal parameter estimation model for wavelet leaders is firstly developed, defining different multifractality parameters for different regions of an image. Then, a multiscale Potts Markov random field is introduced as a prior to model the inherent spatial and scale correlations (referred to as cross-scale correlations) between the labels of the wavelet leaders. A Gibbs sampling methodology is finally used to draw samples from the posterior distribution of the unknown model parameters. Numerical experiments are conducted on synthetic multifractal images to evaluate the performance of the proposed segmentation approach. The proposed method achieves superior performance compared to traditional unsupervised segmentation techniques as well as modern deep learning-based approaches, showing its effectiveness for multifractal image segmentation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8500-8510"},"PeriodicalIF":13.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145807710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A low-light colonoscopy video enhancement method is needed as poor illumination in colonoscopy can hinder accurate disease diagnosis and adversely affect surgical procedures. Existing low-light video enhancement methods usually apply a frame-by-frame enhancement strategy without considering the temporal correlation between them, which often causes a flickering problem. In addition, most methods are designed for endoscopic devices with fixed imaging styles and cannot be easily adapted to different devices. In this paper, we propose a Style-Guided Network (SGNet) for unpaired Low-Light Colonoscopy Video Enhancement (LLCVE). Given that collecting content-consistent paired videos is difficult, SGNet adopts a CycleGAN-based framework to convert low-light videos to normal-light videos, in which a Temporal Compensation (TC) module and a Style Guidance (SG) module are proposed to alleviate the flickering problem and achieve flexible style transfer, respectively. The TC module compensates for a low-light frame by learning the correlated feature of its adjacent frames, thereby improving the temporal smoothness of the enhanced video. The SG module encodes the text of the imaging style and adaptively explores its intrinsic relationships with video features to obtain style representations, which are then used to guide the subsequent enhancement process. Extensive experiments on a curated database show that SGNet achieves promising performance on the LLCVE task, outperforming state-of-the-art methods in both quantitative metrics and visual quality.
{"title":"SGNet: Style-Guided Network With Temporal Compensation for Unpaired Low-Light Colonoscopy Video Enhancement","authors":"Guanghui Yue;Lixin Zhang;Wanqing Liu;Jingfeng Du;Tianwei Zhou;Hanhe Lin;Qiuping Jiang;Wenqi Ren","doi":"10.1109/TIP.2025.3644172","DOIUrl":"10.1109/TIP.2025.3644172","url":null,"abstract":"A low-light colonoscopy video enhancement method is needed as poor illumination in colonoscopy can hinder accurate disease diagnosis and adversely affect surgical procedures. Existing low-light video enhancement methods usually apply a frame-by-frame enhancement strategy without considering the temporal correlation between them, which often causes a flickering problem. In addition, most methods are designed for endoscopic devices with fixed imaging styles and cannot be easily adapted to different devices. In this paper, we propose a Style-Guided Network (SGNet) for unpaired Low-Light Colonoscopy Video Enhancement (LLCVE). Given that collecting content-consistent paired videos is difficult, SGNet adopts a CycleGAN-based framework to convert low-light videos to normal-light videos, in which a Temporal Compensation (TC) module and a Style Guidance (SG) module are proposed to alleviate the flickering problem and achieve flexible style transfer, respectively. The TC module compensates for a low-light frame by learning the correlated feature of its adjacent frames, thereby improving the temporal smoothness of the enhanced video. The SG module encodes the text of the imaging style and adaptively explores its intrinsic relationships with video features to obtain style representations, which are then used to guide the subsequent enhancement process. Extensive experiments on a curated database show that SGNet achieves promising performance on the LLCVE task, outperforming state-of-the-art methods in both quantitative metrics and visual quality.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"234-246"},"PeriodicalIF":13.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145807708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Test-Time Adaptation (TTA) offers a practical solution for deploying image segmentation models under domain shift without accessing source data or retraining. Among existing TTA strategies, pseudo-label-based methods have shown promising performance. However, they often rely on perturbation-ensemble heuristics (e.g., dropout sampling, test-time augmentation, Gaussian noise), which lack distributional grounding and yield unstable training signals. This can trigger error accumulation and catastrophic forgetting during adaptation. To address this, we propose A3-TTA, a TTA framework that constructs reliable pseudo-labels through anchor-guided supervision. Specifically, we identify well-predicted target domain images using a class compact density metric, under the assumption that confident predictions imply distributional proximity to the source domain. These anchors serve as stable references to guide pseudo-label generation, which is further regularized via semantic consistency and boundary-aware entropy minimization. Additionally, we introduce a self-adaptive exponential moving average strategy to mitigate label noise and stabilize model update during adaptation. Evaluated on both multi-domain medical images (heart structure and prostate segmentation) and natural images, A3-TTA significantly improves average Dice scores by 10.40 to 17.68 percentage points compared to the source model, outperforming several state-of-the-art TTA methods under different segmentation model architectures. A3-TTA also excels in continual TTA, maintaining high performance across sequential target domains with strong anti-forgetting ability. The code will be made publicly available at https://github.com/HiLab-git/A3-TTA
{"title":"A3-TTA: Adaptive Anchor Alignment Test-Time Adaptation for Image Segmentation","authors":"Jianghao Wu;Xiangde Luo;Yubo Zhou;Lianming Wu;Guotai Wang;Shaoting Zhang","doi":"10.1109/TIP.2025.3644789","DOIUrl":"10.1109/TIP.2025.3644789","url":null,"abstract":"Test-Time Adaptation (TTA) offers a practical solution for deploying image segmentation models under domain shift without accessing source data or retraining. Among existing TTA strategies, pseudo-label-based methods have shown promising performance. However, they often rely on perturbation-ensemble heuristics (e.g., dropout sampling, test-time augmentation, Gaussian noise), which lack distributional grounding and yield unstable training signals. This can trigger error accumulation and catastrophic forgetting during adaptation. To address this, we propose A3-TTA, a TTA framework that constructs reliable pseudo-labels through anchor-guided supervision. Specifically, we identify well-predicted target domain images using a class compact density metric, under the assumption that confident predictions imply distributional proximity to the source domain. These anchors serve as stable references to guide pseudo-label generation, which is further regularized via semantic consistency and boundary-aware entropy minimization. Additionally, we introduce a self-adaptive exponential moving average strategy to mitigate label noise and stabilize model update during adaptation. Evaluated on both multi-domain medical images (heart structure and prostate segmentation) and natural images, A3-TTA significantly improves average Dice scores by 10.40 to 17.68 percentage points compared to the source model, outperforming several state-of-the-art TTA methods under different segmentation model architectures. A3-TTA also excels in continual TTA, maintaining high performance across sequential target domains with strong anti-forgetting ability. The code will be made publicly available at <uri>https://github.com/HiLab-git/A3-TTA</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8511-8522"},"PeriodicalIF":13.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145807373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information is beneficial. To address this limitation, we present a Mixture-of-Experts network for monocular 3D pose estimation named PoseMoE. Our approach introduces: 1) A mixture-of-experts network where specialized expert modules refine the well-detected 2D pose features and learn the depth features. This mixture-of-experts design disentangles the feature encoding process for 2D pose and depth, therefore reducing the explicit influence of uncertain depth features on 2D pose features. 2) A cross-expert knowledge aggregation module is proposed to aggregate cross-expert spatio-temporal contextual information. This step enhances features through bidirectional mapping between 2D pose and depth. Extensive experiments show that our proposed PoseMoE outperforms the conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.
{"title":"PoseMoE: Mixture-of-Experts Network for Monocular 3D Human Pose Estimation","authors":"Mengyuan Liu;Jiajie Liu;Jinyan Zhang;Wenhao Li;Junsong Yuan","doi":"10.1109/TIP.2025.3644785","DOIUrl":"10.1109/TIP.2025.3644785","url":null,"abstract":"The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information is beneficial. To address this limitation, we present a Mixture-of-Experts network for monocular 3D pose estimation named PoseMoE. Our approach introduces: 1) A mixture-of-experts network where specialized expert modules refine the well-detected 2D pose features and learn the depth features. This mixture-of-experts design disentangles the feature encoding process for 2D pose and depth, therefore reducing the explicit influence of uncertain depth features on 2D pose features. 2) A cross-expert knowledge aggregation module is proposed to aggregate cross-expert spatio-temporal contextual information. This step enhances features through bidirectional mapping between 2D pose and depth. Extensive experiments show that our proposed PoseMoE outperforms the conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8537-8551"},"PeriodicalIF":13.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145807705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Medical image restoration (MedIR) aims to recover high-quality images from degraded inputs, yet faces unique challenges from physics-driven degradations and multi-modal task interference. While existing all-in-one methods handle natural image degradations well, they struggle with medical scenarios due to limited degradation perception and suboptimal multi-task optimization. In response, we introduce DaPT, a Degradation-aware Prompted Transformer, which integrates dynamic prompt learning and modular expert mining for unified MedIR. First, DaPT introduces spatially compact prompts with optimal transport regularization, amplifying inter-prompt differences to capture diverse degradation patterns. Second, a mixture of experts dynamically routes inputs to specialized modules via prompt guidance, resolving task conflicts while reducing computational overhead. The synergy of prompt learning and expert mining further enables robust restoration across multi-modal medical data, offering a practical solution for clinical imaging. Extensive experiments across multiple modalities (MRI, CT, PET) and diverse degradations, covering both in-distribution and out-of-distribution scenarios, demonstrate that DaPT consistently outperforms state-of-the-art methods and generalizes reliably to unseen settings, underscoring its robustness, effectiveness, and clinical practicality. The source code will be released at https://github.com/weijinbao1998/DaPT
{"title":"Degradation-Aware Prompted Transformer for Unified Medical Image Restoration","authors":"Jinbao Wei;Gang Yang;Zhijie Wang;Shimin Tao;Aiping Liu;Xun Chen","doi":"10.1109/TIP.2025.3644795","DOIUrl":"10.1109/TIP.2025.3644795","url":null,"abstract":"Medical image restoration (MedIR) aims to recover high-quality images from degraded inputs, yet faces unique challenges from physics-driven degradations and multi-modal task interference. While existing all-in-one methods handle natural image degradations well, they struggle with medical scenarios due to limited degradation perception and suboptimal multi-task optimization. In response, we introduce DaPT, a Degradation-aware Prompted Transformer, which integrates dynamic prompt learning and modular expert mining for unified MedIR. First, DaPT introduces spatially compact prompts with optimal transport regularization, amplifying inter-prompt differences to capture diverse degradation patterns. Second, a mixture of experts dynamically routes inputs to specialized modules via prompt guidance, resolving task conflicts while reducing computational overhead. The synergy of prompt learning and expert mining further enables robust restoration across multi-modal medical data, offering a practical solution for clinical imaging. Extensive experiments across multiple modalities (MRI, CT, PET) and diverse degradations, covering both in-distribution and out-of-distribution scenarios, demonstrate that DaPT consistently outperforms state-of-the-art methods and generalizes reliably to unseen settings, underscoring its robustness, effectiveness, and clinical practicality. The source code will be released at <uri>https://github.com/weijinbao1998/DaPT</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8583-8598"},"PeriodicalIF":13.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145807707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}