Recovering High Dynamic Range (HDR) images from multiple Standard Dynamic Range (SDR) images becomes challenging when the SDR images exhibit noticeable degradation and missing content. Leveraging scene-specific semantic priors offers a promising solution for restoring heavily degraded regions. However, these priors are typically extracted from sRGB SDR images, the domain/format gap poses a significant challenge when applying it to HDR imaging. To address this issue, we propose a general framework that transfers semantic knowledge derived from SDR domain via self-distillation to boost existing HDR reconstruction. Specifically, the proposed framework first introduces the Semantic Priors Guided Reconstruction Model (SPGRM), which leverages SDR image semantic knowledge to address ill-posed problems in the initial HDR reconstruction results. Subsequently, we leverage a self-distillation mechanism that constrains the color and content information with semantic knowledge, aligning the external outputs between the baseline and SPGRM. Furthermore, to transfer the semantic knowledge of the internal features, we utilize a Semantic Knowledge Alignment Module (SKAM) to fill the missing semantic contents with the complementary masks. Extensive experiments demonstrate that our framework significantly boosts HDR imaging quality for existing methods without altering the network architecture.
{"title":"Boosting HDR Image Reconstruction via Semantic Knowledge Transfer.","authors":"Tao Hu, Longyao Wu, Wei Dong, Peng Wu, Jinqiu Sun, Xiaogang Xu, Qingsen Yan, Yanning Zhang","doi":"10.1109/TIP.2026.3652360","DOIUrl":"https://doi.org/10.1109/TIP.2026.3652360","url":null,"abstract":"<p><p>Recovering High Dynamic Range (HDR) images from multiple Standard Dynamic Range (SDR) images becomes challenging when the SDR images exhibit noticeable degradation and missing content. Leveraging scene-specific semantic priors offers a promising solution for restoring heavily degraded regions. However, these priors are typically extracted from sRGB SDR images, the domain/format gap poses a significant challenge when applying it to HDR imaging. To address this issue, we propose a general framework that transfers semantic knowledge derived from SDR domain via self-distillation to boost existing HDR reconstruction. Specifically, the proposed framework first introduces the Semantic Priors Guided Reconstruction Model (SPGRM), which leverages SDR image semantic knowledge to address ill-posed problems in the initial HDR reconstruction results. Subsequently, we leverage a self-distillation mechanism that constrains the color and content information with semantic knowledge, aligning the external outputs between the baseline and SPGRM. Furthermore, to transfer the semantic knowledge of the internal features, we utilize a Semantic Knowledge Alignment Module (SKAM) to fill the missing semantic contents with the complementary masks. Extensive experiments demonstrate that our framework significantly boosts HDR imaging quality for existing methods without altering the network architecture.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145991925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-16DOI: 10.1109/TIP.2026.3653198
Kang Geon Lee;Soochahn Lee;Kyoung Mu Lee
We propose a robust alignment technique for Standard Fundus Images (SFIs) and Ultra-Widefield Fundus Images (UWFIs), which are challenging to align due to differences in scale, appearance, and the scarcity of distinctive features. Our method, termed Particle Diffusion Matching (PDM), performs alignment through an iterative Random Walk Correspondence Search (RWCS) guided by a diffusion model. At each iteration, the model estimates displacement vectors for particle points by considering local appearance, the structural distribution of particles, and an estimated global transformation, enabling progressive refinement of correspondences even under difficult conditions. PDM achieves state-of-the-art performance across multiple retinal image alignment benchmarks, showing substantial improvement on a primary dataset of SFI-UWFI pairs and demonstrating its effectiveness in real-world clinical scenarios. By providing accurate and scalable correspondence estimation, PDM overcomes the limitations of existing methods and facilitates the integration of complementary retinal image modalities. This diffusion-guided search strategy offers a new direction for improving downstream supervised learning, disease diagnosis, and multi-modal image analysis in ophthalmology.
{"title":"Particle Diffusion Matching: Random Walk Correspondence Search for the Alignment of Standard and Ultra-Widefield Fundus Images","authors":"Kang Geon Lee;Soochahn Lee;Kyoung Mu Lee","doi":"10.1109/TIP.2026.3653198","DOIUrl":"10.1109/TIP.2026.3653198","url":null,"abstract":"We propose a robust alignment technique for Standard Fundus Images (SFIs) and Ultra-Widefield Fundus Images (UWFIs), which are challenging to align due to differences in scale, appearance, and the scarcity of distinctive features. Our method, termed Particle Diffusion Matching (PDM), performs alignment through an iterative Random Walk Correspondence Search (RWCS) guided by a diffusion model. At each iteration, the model estimates displacement vectors for particle points by considering local appearance, the structural distribution of particles, and an estimated global transformation, enabling progressive refinement of correspondences even under difficult conditions. PDM achieves state-of-the-art performance across multiple retinal image alignment benchmarks, showing substantial improvement on a primary dataset of SFI-UWFI pairs and demonstrating its effectiveness in real-world clinical scenarios. By providing accurate and scalable correspondence estimation, PDM overcomes the limitations of existing methods and facilitates the integration of complementary retinal image modalities. This diffusion-guided search strategy offers a new direction for improving downstream supervised learning, disease diagnosis, and multi-modal image analysis in ophthalmology.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"943-954"},"PeriodicalIF":13.7,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145986626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vision foundation models in remote sensing have been extensively studied due to their superior generalization on various downstream tasks. Synthetic Aperture Radar (SAR) offers all-day, all-weather imaging capabilities, providing significant advantages for Earth observation. However, establishing a foundation model for SAR image interpretation inevitably encounters the challenges of insufficient information utilization and poor interpretability. In this paper, we propose a remote sensing foundation model based on complex-valued SAR data, which simulates the polarimetric decomposition process for pre-training, i.e., characterizing pixel scattering intensity as a weighted combination of scattering bases and scattering coefficients, thereby endowing the foundation model with physical interpretability. Specifically, we construct a series of scattering queries, each representing an independent and meaningful scattering basis, which interact with SAR features in the scattering query decoder and output the corresponding scattering coefficient. To guide the pre-training process, polarimetric decomposition loss and power self-supervised loss are constructed. The former aligns the predicted coefficients with Yamaguchi coefficients, while the latter reconstructs power from the predicted coefficients and compares it to the input image's power. The performance of our foundation model is validated on nine typical downstream tasks, achieving state-of-the-art results. Notably, the foundation model can extract stable feature representations and exhibits strong generalization, even in data-scarce conditions.
{"title":"A Complex-valued SAR Foundation Model Based on Physically Inspired Representation Learning.","authors":"Mengyu Wang, Hanbo Bi, Yingchao Feng, Linlin Xin, Shuo Gong, Tianqi Wang, Zhiyuan Yan, Peijin Wang, Wenhui Diao, Xian Sun","doi":"10.1109/TIP.2026.3652417","DOIUrl":"https://doi.org/10.1109/TIP.2026.3652417","url":null,"abstract":"<p><p>Vision foundation models in remote sensing have been extensively studied due to their superior generalization on various downstream tasks. Synthetic Aperture Radar (SAR) offers all-day, all-weather imaging capabilities, providing significant advantages for Earth observation. However, establishing a foundation model for SAR image interpretation inevitably encounters the challenges of insufficient information utilization and poor interpretability. In this paper, we propose a remote sensing foundation model based on complex-valued SAR data, which simulates the polarimetric decomposition process for pre-training, i.e., characterizing pixel scattering intensity as a weighted combination of scattering bases and scattering coefficients, thereby endowing the foundation model with physical interpretability. Specifically, we construct a series of scattering queries, each representing an independent and meaningful scattering basis, which interact with SAR features in the scattering query decoder and output the corresponding scattering coefficient. To guide the pre-training process, polarimetric decomposition loss and power self-supervised loss are constructed. The former aligns the predicted coefficients with Yamaguchi coefficients, while the latter reconstructs power from the predicted coefficients and compares it to the input image's power. The performance of our foundation model is validated on nine typical downstream tasks, achieving state-of-the-art results. Notably, the foundation model can extract stable feature representations and exhibits strong generalization, even in data-scarce conditions.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145991968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-15DOI: 10.1109/TIP.2025.3650668
Ruonan Zhang;Gaoyun An;Yiqing Hao;Dapeng Oliver Wu
Scene Graph Generation (SGG) is a challenging cross-modal task, which aims to identify entities and relationships in a scene simultaneously. Due to the highly skewed long-tailed distribution, the generated scene graphs are dominated by relation categories of head samples. Current works address this problem by designing re-balancing strategies at the data level or refining relation representations at the feature level. Different from them, we attribute this impact to catastrophic interference, that is, the subsequent learning of dominant relations tends to overwrite the earlier learning of rare relations. To address it at the modeling level, a Hippocampal Memory-Like Separation-Completion Collaborative Network (HMSC2) is proposed here, which imitates the hippocampal encoding and retrieval process. Inspired by the pattern separation of dentate gyrus during memory encoding, a Gradient Separation Classifier and a Prototype Separation Learning module are proposed to relieve the catastrophic interference of tail categories by modeling the separated classifier and prototypes. In addition, inspired by the pattern completion of area CA3 of the hippocampus during memory retrieval, a Prototype Completion Module is designed to supplement the incomplete information of prototypes by introducing relation representations as cues. Finally, the completed prototype and relation representations are connected within a hypersphere space by a Contrastive Connected Module. Experimental results on the Visual Genome and GQA datasets show our HMSC2 achieves state-of-the-art performance on the unbiased SGG task, effectively relieving the long-tailed problem. The source codes are released on GitHub: https://github.com/Nora-Zhang98/HMSC2
{"title":"Hippocampal Memory-Like Separation-Completion Collaborative Network for Unbiased Scene Graph Generation","authors":"Ruonan Zhang;Gaoyun An;Yiqing Hao;Dapeng Oliver Wu","doi":"10.1109/TIP.2025.3650668","DOIUrl":"10.1109/TIP.2025.3650668","url":null,"abstract":"Scene Graph Generation (SGG) is a challenging cross-modal task, which aims to identify entities and relationships in a scene simultaneously. Due to the highly skewed long-tailed distribution, the generated scene graphs are dominated by relation categories of head samples. Current works address this problem by designing re-balancing strategies at the data level or refining relation representations at the feature level. Different from them, we attribute this impact to catastrophic interference, that is, the subsequent learning of dominant relations tends to overwrite the earlier learning of rare relations. To address it at the modeling level, a Hippocampal Memory-Like Separation-Completion Collaborative Network (HMSC2) is proposed here, which imitates the hippocampal encoding and retrieval process. Inspired by the pattern separation of dentate gyrus during memory encoding, a Gradient Separation Classifier and a Prototype Separation Learning module are proposed to relieve the catastrophic interference of tail categories by modeling the separated classifier and prototypes. In addition, inspired by the pattern completion of area CA3 of the hippocampus during memory retrieval, a Prototype Completion Module is designed to supplement the incomplete information of prototypes by introducing relation representations as cues. Finally, the completed prototype and relation representations are connected within a hypersphere space by a Contrastive Connected Module. Experimental results on the Visual Genome and GQA datasets show our HMSC2 achieves state-of-the-art performance on the unbiased SGG task, effectively relieving the long-tailed problem. The source codes are released on GitHub: <uri>https://github.com/Nora-Zhang98/HMSC2</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"770-785"},"PeriodicalIF":13.7,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advances in “track-anything” models have significantly improved fine-grained video understanding by simultaneously handling multiple video segmentation and tracking tasks. However, existing models often struggle with robust and efficient temporal propagation. To address these challenges, we propose the Sparse Spatio-Temporal Propagation (SSTP) method, which achieves robust and efficient unified video segmentation by selectively leveraging key spatio-temporal features in videos. Specifically, we design a dynamic 3D spatio-temporal convolution to aggregate global multi-frame spatio-temporal information into memory frames during memory construction. Additionally, we introduce a spatio-temporal aggregation reading strategy to efficiently aggregate the relevant spatio-temporal features from multiple memory frames during memory retrieval. By combining SSTP with an image segmentation foundation model, such as the segment anything model, our method effectively addresses multiple data-scarce video segmentation tasks. Our experimental results demonstrate state-of-the-art performance on five video segmentation tasks across eleven datasets, outperforming both task-specific and unified methods. Notably, SSTP exhibits strong robustness in handling sparse, low-frame-rate videos, making it well-suited for real-world applications.
{"title":"Fast Track Anything With Sparse Spatio-Temporal Propagation for Unified Video Segmentation","authors":"Jisheng Dang;Huicheng Zheng;Zhixuan Chen;Zhang Li;Yulan Guo;Tat-Seng Chua","doi":"10.1109/TIP.2025.3649365","DOIUrl":"10.1109/TIP.2025.3649365","url":null,"abstract":"Recent advances in “track-anything” models have significantly improved fine-grained video understanding by simultaneously handling multiple video segmentation and tracking tasks. However, existing models often struggle with robust and efficient temporal propagation. To address these challenges, we propose the Sparse Spatio-Temporal Propagation (SSTP) method, which achieves robust and efficient unified video segmentation by selectively leveraging key spatio-temporal features in videos. Specifically, we design a dynamic 3D spatio-temporal convolution to aggregate global multi-frame spatio-temporal information into memory frames during memory construction. Additionally, we introduce a spatio-temporal aggregation reading strategy to efficiently aggregate the relevant spatio-temporal features from multiple memory frames during memory retrieval. By combining SSTP with an image segmentation foundation model, such as the segment anything model, our method effectively addresses multiple data-scarce video segmentation tasks. Our experimental results demonstrate state-of-the-art performance on five video segmentation tasks across eleven datasets, outperforming both task-specific and unified methods. Notably, SSTP exhibits strong robustness in handling sparse, low-frame-rate videos, making it well-suited for real-world applications.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"955-969"},"PeriodicalIF":13.7,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-15DOI: 10.1109/TIP.2025.3648872
Wenjie Li;Heng Guo;Yuefeng Hou;Zhanyu Ma
Image super-resolution (SR) aims to recover low-resolution images to high-resolution images, where improving SR efficiency is a high-profile challenge. However, commonly used units in SR, like convolutions and window-based Transformers, have limited receptive fields, making it challenging to apply them to improve SR under extremely limited computational cost. To address this issue, inspired by modeling convolution theorem through token mix, we propose a Fourier token-based plugin called FourierSR to improve SR uniformly, which avoids the instability or inefficiency of existing token mix technologies when applied as plug-ins. Furthermore, compared to convolutions and windows-based Transformers, our FourierSR only utilizes Fourier transform and multiplication operations, greatly reducing complexity while having global receptive fields. Experimental results show that our FourierSR as a plug-and-play unit brings an average PSNR gain of 0.34dB for existing efficient SR methods on Manga109 test set at the scale of $times 4$ , while the average increase in the number of Params and FLOPs is only 0.6% and 1.5% of original sizes. We will release our codes upon acceptance.
{"title":"FourierSR: A Fourier Token-Based Plugin for Efficient Image Super-Resolution","authors":"Wenjie Li;Heng Guo;Yuefeng Hou;Zhanyu Ma","doi":"10.1109/TIP.2025.3648872","DOIUrl":"10.1109/TIP.2025.3648872","url":null,"abstract":"Image super-resolution (SR) aims to recover low-resolution images to high-resolution images, where improving SR efficiency is a high-profile challenge. However, commonly used units in SR, like convolutions and window-based Transformers, have limited receptive fields, making it challenging to apply them to improve SR under extremely limited computational cost. To address this issue, inspired by modeling convolution theorem through token mix, we propose a Fourier token-based plugin called FourierSR to improve SR uniformly, which avoids the instability or inefficiency of existing token mix technologies when applied as plug-ins. Furthermore, compared to convolutions and windows-based Transformers, our FourierSR only utilizes Fourier transform and multiplication operations, greatly reducing complexity while having global receptive fields. Experimental results show that our FourierSR as a plug-and-play unit brings an average PSNR gain of 0.34dB for existing efficient SR methods on Manga109 test set at the scale of <inline-formula> <tex-math>$times 4$ </tex-math></inline-formula>, while the average increase in the number of Params and FLOPs is only 0.6% and 1.5% of original sizes. We will release our codes upon acceptance.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"732-742"},"PeriodicalIF":13.7,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145972027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-14DOI: 10.1109/TIP.2026.3652357
Qun Li;Lu Huang;Fu Xiao;Na Zhao;Bir Bhanu
Incremental Few-shot Semantic Segmentation (iFSS) aims to learn novel classes with limited samples while preserving segmentation capability for base classes, addressing the challenge of continual learning of novel classes and catastrophic forgetting of previously seen classes. Existing methods mainly rely on techniques such as knowledge distillation and background learning, which, while partially effective, still suffer from issues such as feature drift and limited generalization to real-world novel classes, primarily due to a bidirectional coupling bottleneck between the learning of base classes and novel classes. To address these challenges, we propose, for the first time, a diffusion-based generative framework for iFSS. Specifically, we bridge the gap between generative and discriminative tasks through an innovative binary-to-RGB mask mapping mechanism, enabling pre-trained diffusion models to focus on target regions via class-specific semantic embedding optimization while sharpening foreground-background contrast with color embeddings. A lightweight post-processor then refines the generated images into high-quality binary masks. Crucially, by leveraging diffusion priors, our framework avoids complex training strategies. The optimization of class-specific semantic embeddings decouples the embedding spaces of base and novel classes, inherently preventing feature drift, mitigating catastrophic forgetting, and enabling rapid novel-class adaptation. Experimental results show that our method achieves state-of-the-art performance on the PASCAL-$5^{i}$ and COCO-$20^{i}$ datasets using much less data than other methods, and exhibiting competitive results in cross-domain few-shot segmentation tasks. Project page: https://ifss-diff.github.io/
{"title":"Toward Generative Understanding: Incremental Few-Shot Semantic Segmentation With Diffusion Models","authors":"Qun Li;Lu Huang;Fu Xiao;Na Zhao;Bir Bhanu","doi":"10.1109/TIP.2026.3652357","DOIUrl":"10.1109/TIP.2026.3652357","url":null,"abstract":"Incremental Few-shot Semantic Segmentation (iFSS) aims to learn novel classes with limited samples while preserving segmentation capability for base classes, addressing the challenge of continual learning of novel classes and catastrophic forgetting of previously seen classes. Existing methods mainly rely on techniques such as knowledge distillation and background learning, which, while partially effective, still suffer from issues such as feature drift and limited generalization to real-world novel classes, primarily due to a bidirectional coupling bottleneck between the learning of base classes and novel classes. To address these challenges, we propose, for the first time, a diffusion-based generative framework for iFSS. Specifically, we bridge the gap between generative and discriminative tasks through an innovative binary-to-RGB mask mapping mechanism, enabling pre-trained diffusion models to focus on target regions via class-specific semantic embedding optimization while sharpening foreground-background contrast with color embeddings. A lightweight post-processor then refines the generated images into high-quality binary masks. Crucially, by leveraging diffusion priors, our framework avoids complex training strategies. The optimization of class-specific semantic embeddings decouples the embedding spaces of base and novel classes, inherently preventing feature drift, mitigating catastrophic forgetting, and enabling rapid novel-class adaptation. Experimental results show that our method achieves state-of-the-art performance on the PASCAL-<inline-formula> <tex-math>$5^{i}$ </tex-math></inline-formula> and COCO-<inline-formula> <tex-math>$20^{i}$ </tex-math></inline-formula> datasets using much less data than other methods, and exhibiting competitive results in cross-domain few-shot segmentation tasks. Project page: <uri>https://ifss-diff.github.io/</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"743-758"},"PeriodicalIF":13.7,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-14DOI: 10.1109/TIP.2026.3652371
Zhaozhi Wang;Yunjie Tian;Lingxi Xie;Yaowei Wang;Qixiang Ye
In this study, we introduce EinsPT, an efficient instance-aware pre-training paradigm designed to reduce the transfer gap between vision foundation models and downstream instance-level tasks. Unlike conventional image-level pre-training that relies solely on unlabeled images, EinsPT leverages both image reconstruction and instance annotations to learn representations that are spatially coherent and instance discriminative. To achieve this efficiently, we propose a proxy–foundation architecture that decouples high-resolution and low-resolution learning: the foundation model processes masked low-resolution images for global semantics, while a lightweight proxy model operates on complete high-resolution images to preserve fine-grained details. The two branches are jointly optimized through reconstruction and instance-level prediction losses on fused features. Extensive experiments demonstrate that EinsPT consistently enhances recognition accuracy across various downstream tasks with substantially reduced computational cost, while qualitative results further reveal improved instance perception and completeness in visual representations. Code is available at github.com/feufhd/EinsPT
{"title":"EinsPT: Efficient Instance-Aware Pre-Training of Vision Foundation Models","authors":"Zhaozhi Wang;Yunjie Tian;Lingxi Xie;Yaowei Wang;Qixiang Ye","doi":"10.1109/TIP.2026.3652371","DOIUrl":"10.1109/TIP.2026.3652371","url":null,"abstract":"In this study, we introduce EinsPT, an efficient instance-aware pre-training paradigm designed to reduce the transfer gap between vision foundation models and downstream instance-level tasks. Unlike conventional image-level pre-training that relies solely on unlabeled images, EinsPT leverages both image reconstruction and instance annotations to learn representations that are spatially coherent and instance discriminative. To achieve this efficiently, we propose a proxy–foundation architecture that decouples high-resolution and low-resolution learning: the foundation model processes masked low-resolution images for global semantics, while a lightweight proxy model operates on complete high-resolution images to preserve fine-grained details. The two branches are jointly optimized through reconstruction and instance-level prediction losses on fused features. Extensive experiments demonstrate that EinsPT consistently enhances recognition accuracy across various downstream tasks with substantially reduced computational cost, while qualitative results further reveal improved instance perception and completeness in visual representations. Code is available at github.com/feufhd/EinsPT","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"786-799"},"PeriodicalIF":13.7,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Labeling Cadmium Zinc Telluride (CdZnTe) semiconductor images is challenging due to the low-contrast defect boundaries, necessitating annotators to cross-reference multiple views. These views share a single ground truth (GT), forming a unique “many-to-one” relationship. This characteristic renders advanced semi-supervised semantic segmentation (SSS) methods suboptimal, as they are generally limited by a “one-to-one” relationship, where each image is independently associated with its GT. Such limitation may lead to error accumulation in low-contrast regions, further exacerbating confirmation bias. To address this issue, we revisit the SSS pipeline from a group-oriented perspective and propose a human-inspired solution: the Intra-group Consistency Augmentation Framework (ICAF). First, we experimentally validate the inherent consistency constraints within CdZnTe groups, establishing a group-oriented baseline using the Intra-group View Sampling (IVS). Building on this insight, we introduce the Pseudo-label Correction Network (PCN) to enhance consistency representation, which consists of two key modules. The View Augmentation Module (VAM) improves boundary details by dynamically synthesizing a boundary-aware view through the aggregation of multiple views. In the View Correction Module (VCM), this synthesized view is paired with other views for information interaction, effectively emphasizing salient regions while minimizing noise. Extensive experiments demonstrate the effectiveness of our solution for CdZnTe materials. Leveraging DeepLabV3+ with a ResNet-101 backbone as our segmentation model, we achieve a 70.6% mIoU on the CdZnTe dataset using only 2 group-annotated data (5‰). The code is available at https://github.com/pipixiapipi/ICAF
{"title":"Harnessing Group-Oriented Consistency Constraints for Semi-Supervised Semantic Segmentation in CdZnTe Semiconductors","authors":"Peihao Li;Yan Fang;Man Liu;Huihui Bai;Anhong Wang;Yunchao Wei;Yao Zhao","doi":"10.1109/TIP.2025.3646474","DOIUrl":"10.1109/TIP.2025.3646474","url":null,"abstract":"Labeling Cadmium Zinc Telluride (CdZnTe) semiconductor images is challenging due to the low-contrast defect boundaries, necessitating annotators to cross-reference multiple views. These views share a single ground truth (GT), forming a unique “many-to-one” relationship. This characteristic renders advanced semi-supervised semantic segmentation (SSS) methods suboptimal, as they are generally limited by a “one-to-one” relationship, where each image is independently associated with its GT. Such limitation may lead to error accumulation in low-contrast regions, further exacerbating confirmation bias. To address this issue, we revisit the SSS pipeline from a group-oriented perspective and propose a human-inspired solution: the Intra-group Consistency Augmentation Framework (ICAF). First, we experimentally validate the inherent consistency constraints within CdZnTe groups, establishing a group-oriented baseline using the Intra-group View Sampling (IVS). Building on this insight, we introduce the Pseudo-label Correction Network (PCN) to enhance consistency representation, which consists of two key modules. The View Augmentation Module (VAM) improves boundary details by dynamically synthesizing a boundary-aware view through the aggregation of multiple views. In the View Correction Module (VCM), this synthesized view is paired with other views for information interaction, effectively emphasizing salient regions while minimizing noise. Extensive experiments demonstrate the effectiveness of our solution for CdZnTe materials. Leveraging DeepLabV3+ with a ResNet-101 backbone as our segmentation model, we achieve a 70.6% mIoU on the CdZnTe dataset using only 2 group-annotated data (5‰). The code is available at <uri>https://github.com/pipixiapipi/ICAF</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"759-769"},"PeriodicalIF":13.7,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145972028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vector-Quantization (VQ) based discrete generative models are widely used to learn powerful high-quality (HQ) priors for blind image restoration (BIR). In this paper, we diagnose the side-effects of discrete VQ process essential to VQ-based BIR methods: 1) confining the representation capacity of HQ codebook, 2) being error-prone for code index prediction on low-quality (LQ) images, and 3) under-valuing the importance of input LQ image. These motivate us to learn continuous feature representation of HQ codebook for better restoration performance than using discrete VQ process. To further improve the restoration fidelity, we propose a new Self-in-Cross-Attention (SinCA) module to augment the HQ codebook with the feature of input LQ image, and perform cross-attention between LQ feature and input-augmented codebook. By this way, our SinCA leverages the input LQ image to enhance the representation of codebook for restoration fidelity. Experiments on four typical VQ-based BIR methods demonstrate that, by replacing the VQ process with a transformer using our SinCA, they achieve better quantitative and qualitative performance on blind image super-resolution and blind face restoration. The code and pre-trained models are publicly released at https://github.com/lhy-85/SinCA
{"title":"Diagnosing and Improving Vector-Quantization-Based Blind Image Restoration","authors":"Hongyu Li;Tianyi Xu;Zengyou Wang;Xiantong Zhen;Ran Gu;David Zhang;Jun Xu","doi":"10.1109/TIP.2026.3651985","DOIUrl":"10.1109/TIP.2026.3651985","url":null,"abstract":"Vector-Quantization (VQ) based discrete generative models are widely used to learn powerful high-quality (HQ) priors for blind image restoration (BIR). In this paper, we diagnose the side-effects of discrete VQ process essential to VQ-based BIR methods: 1) confining the representation capacity of HQ codebook, 2) being error-prone for code index prediction on low-quality (LQ) images, and 3) under-valuing the importance of input LQ image. These motivate us to learn continuous feature representation of HQ codebook for better restoration performance than using discrete VQ process. To further improve the restoration fidelity, we propose a new Self-in-Cross-Attention (SinCA) module to augment the HQ codebook with the feature of input LQ image, and perform cross-attention between LQ feature and input-augmented codebook. By this way, our SinCA leverages the input LQ image to enhance the representation of codebook for restoration fidelity. Experiments on four typical VQ-based BIR methods demonstrate that, by replacing the VQ process with a transformer using our SinCA, they achieve better quantitative and qualitative performance on blind image super-resolution and blind face restoration. The code and pre-trained models are publicly released at <uri>https://github.com/lhy-85/SinCA</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"844-857"},"PeriodicalIF":13.7,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145961421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}