Pub Date : 2026-01-14DOI: 10.1109/TIP.2026.3652357
Qun Li;Lu Huang;Fu Xiao;Na Zhao;Bir Bhanu
Incremental Few-shot Semantic Segmentation (iFSS) aims to learn novel classes with limited samples while preserving segmentation capability for base classes, addressing the challenge of continual learning of novel classes and catastrophic forgetting of previously seen classes. Existing methods mainly rely on techniques such as knowledge distillation and background learning, which, while partially effective, still suffer from issues such as feature drift and limited generalization to real-world novel classes, primarily due to a bidirectional coupling bottleneck between the learning of base classes and novel classes. To address these challenges, we propose, for the first time, a diffusion-based generative framework for iFSS. Specifically, we bridge the gap between generative and discriminative tasks through an innovative binary-to-RGB mask mapping mechanism, enabling pre-trained diffusion models to focus on target regions via class-specific semantic embedding optimization while sharpening foreground-background contrast with color embeddings. A lightweight post-processor then refines the generated images into high-quality binary masks. Crucially, by leveraging diffusion priors, our framework avoids complex training strategies. The optimization of class-specific semantic embeddings decouples the embedding spaces of base and novel classes, inherently preventing feature drift, mitigating catastrophic forgetting, and enabling rapid novel-class adaptation. Experimental results show that our method achieves state-of-the-art performance on the PASCAL-$5^{i}$ and COCO-$20^{i}$ datasets using much less data than other methods, and exhibiting competitive results in cross-domain few-shot segmentation tasks. Project page: https://ifss-diff.github.io/
{"title":"Toward Generative Understanding: Incremental Few-Shot Semantic Segmentation With Diffusion Models","authors":"Qun Li;Lu Huang;Fu Xiao;Na Zhao;Bir Bhanu","doi":"10.1109/TIP.2026.3652357","DOIUrl":"10.1109/TIP.2026.3652357","url":null,"abstract":"Incremental Few-shot Semantic Segmentation (iFSS) aims to learn novel classes with limited samples while preserving segmentation capability for base classes, addressing the challenge of continual learning of novel classes and catastrophic forgetting of previously seen classes. Existing methods mainly rely on techniques such as knowledge distillation and background learning, which, while partially effective, still suffer from issues such as feature drift and limited generalization to real-world novel classes, primarily due to a bidirectional coupling bottleneck between the learning of base classes and novel classes. To address these challenges, we propose, for the first time, a diffusion-based generative framework for iFSS. Specifically, we bridge the gap between generative and discriminative tasks through an innovative binary-to-RGB mask mapping mechanism, enabling pre-trained diffusion models to focus on target regions via class-specific semantic embedding optimization while sharpening foreground-background contrast with color embeddings. A lightweight post-processor then refines the generated images into high-quality binary masks. Crucially, by leveraging diffusion priors, our framework avoids complex training strategies. The optimization of class-specific semantic embeddings decouples the embedding spaces of base and novel classes, inherently preventing feature drift, mitigating catastrophic forgetting, and enabling rapid novel-class adaptation. Experimental results show that our method achieves state-of-the-art performance on the PASCAL-<inline-formula> <tex-math>$5^{i}$ </tex-math></inline-formula> and COCO-<inline-formula> <tex-math>$20^{i}$ </tex-math></inline-formula> datasets using much less data than other methods, and exhibiting competitive results in cross-domain few-shot segmentation tasks. Project page: <uri>https://ifss-diff.github.io/</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"743-758"},"PeriodicalIF":13.7,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-14DOI: 10.1109/TIP.2026.3652371
Zhaozhi Wang;Yunjie Tian;Lingxi Xie;Yaowei Wang;Qixiang Ye
In this study, we introduce EinsPT, an efficient instance-aware pre-training paradigm designed to reduce the transfer gap between vision foundation models and downstream instance-level tasks. Unlike conventional image-level pre-training that relies solely on unlabeled images, EinsPT leverages both image reconstruction and instance annotations to learn representations that are spatially coherent and instance discriminative. To achieve this efficiently, we propose a proxy–foundation architecture that decouples high-resolution and low-resolution learning: the foundation model processes masked low-resolution images for global semantics, while a lightweight proxy model operates on complete high-resolution images to preserve fine-grained details. The two branches are jointly optimized through reconstruction and instance-level prediction losses on fused features. Extensive experiments demonstrate that EinsPT consistently enhances recognition accuracy across various downstream tasks with substantially reduced computational cost, while qualitative results further reveal improved instance perception and completeness in visual representations. Code is available at github.com/feufhd/EinsPT
{"title":"EinsPT: Efficient Instance-Aware Pre-Training of Vision Foundation Models","authors":"Zhaozhi Wang;Yunjie Tian;Lingxi Xie;Yaowei Wang;Qixiang Ye","doi":"10.1109/TIP.2026.3652371","DOIUrl":"10.1109/TIP.2026.3652371","url":null,"abstract":"In this study, we introduce EinsPT, an efficient instance-aware pre-training paradigm designed to reduce the transfer gap between vision foundation models and downstream instance-level tasks. Unlike conventional image-level pre-training that relies solely on unlabeled images, EinsPT leverages both image reconstruction and instance annotations to learn representations that are spatially coherent and instance discriminative. To achieve this efficiently, we propose a proxy–foundation architecture that decouples high-resolution and low-resolution learning: the foundation model processes masked low-resolution images for global semantics, while a lightweight proxy model operates on complete high-resolution images to preserve fine-grained details. The two branches are jointly optimized through reconstruction and instance-level prediction losses on fused features. Extensive experiments demonstrate that EinsPT consistently enhances recognition accuracy across various downstream tasks with substantially reduced computational cost, while qualitative results further reveal improved instance perception and completeness in visual representations. Code is available at github.com/feufhd/EinsPT","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"786-799"},"PeriodicalIF":13.7,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Labeling Cadmium Zinc Telluride (CdZnTe) semiconductor images is challenging due to the low-contrast defect boundaries, necessitating annotators to cross-reference multiple views. These views share a single ground truth (GT), forming a unique “many-to-one” relationship. This characteristic renders advanced semi-supervised semantic segmentation (SSS) methods suboptimal, as they are generally limited by a “one-to-one” relationship, where each image is independently associated with its GT. Such limitation may lead to error accumulation in low-contrast regions, further exacerbating confirmation bias. To address this issue, we revisit the SSS pipeline from a group-oriented perspective and propose a human-inspired solution: the Intra-group Consistency Augmentation Framework (ICAF). First, we experimentally validate the inherent consistency constraints within CdZnTe groups, establishing a group-oriented baseline using the Intra-group View Sampling (IVS). Building on this insight, we introduce the Pseudo-label Correction Network (PCN) to enhance consistency representation, which consists of two key modules. The View Augmentation Module (VAM) improves boundary details by dynamically synthesizing a boundary-aware view through the aggregation of multiple views. In the View Correction Module (VCM), this synthesized view is paired with other views for information interaction, effectively emphasizing salient regions while minimizing noise. Extensive experiments demonstrate the effectiveness of our solution for CdZnTe materials. Leveraging DeepLabV3+ with a ResNet-101 backbone as our segmentation model, we achieve a 70.6% mIoU on the CdZnTe dataset using only 2 group-annotated data (5‰). The code is available at https://github.com/pipixiapipi/ICAF
{"title":"Harnessing Group-Oriented Consistency Constraints for Semi-Supervised Semantic Segmentation in CdZnTe Semiconductors","authors":"Peihao Li;Yan Fang;Man Liu;Huihui Bai;Anhong Wang;Yunchao Wei;Yao Zhao","doi":"10.1109/TIP.2025.3646474","DOIUrl":"10.1109/TIP.2025.3646474","url":null,"abstract":"Labeling Cadmium Zinc Telluride (CdZnTe) semiconductor images is challenging due to the low-contrast defect boundaries, necessitating annotators to cross-reference multiple views. These views share a single ground truth (GT), forming a unique “many-to-one” relationship. This characteristic renders advanced semi-supervised semantic segmentation (SSS) methods suboptimal, as they are generally limited by a “one-to-one” relationship, where each image is independently associated with its GT. Such limitation may lead to error accumulation in low-contrast regions, further exacerbating confirmation bias. To address this issue, we revisit the SSS pipeline from a group-oriented perspective and propose a human-inspired solution: the Intra-group Consistency Augmentation Framework (ICAF). First, we experimentally validate the inherent consistency constraints within CdZnTe groups, establishing a group-oriented baseline using the Intra-group View Sampling (IVS). Building on this insight, we introduce the Pseudo-label Correction Network (PCN) to enhance consistency representation, which consists of two key modules. The View Augmentation Module (VAM) improves boundary details by dynamically synthesizing a boundary-aware view through the aggregation of multiple views. In the View Correction Module (VCM), this synthesized view is paired with other views for information interaction, effectively emphasizing salient regions while minimizing noise. Extensive experiments demonstrate the effectiveness of our solution for CdZnTe materials. Leveraging DeepLabV3+ with a ResNet-101 backbone as our segmentation model, we achieve a 70.6% mIoU on the CdZnTe dataset using only 2 group-annotated data (5‰). The code is available at <uri>https://github.com/pipixiapipi/ICAF</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"759-769"},"PeriodicalIF":13.7,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145972028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vector-Quantization (VQ) based discrete generative models are widely used to learn powerful high-quality (HQ) priors for blind image restoration (BIR). In this paper, we diagnose the side-effects of discrete VQ process essential to VQ-based BIR methods: 1) confining the representation capacity of HQ codebook, 2) being error-prone for code index prediction on low-quality (LQ) images, and 3) under-valuing the importance of input LQ image. These motivate us to learn continuous feature representation of HQ codebook for better restoration performance than using discrete VQ process. To further improve the restoration fidelity, we propose a new Self-in-Cross-Attention (SinCA) module to augment the HQ codebook with the feature of input LQ image, and perform cross-attention between LQ feature and input-augmented codebook. By this way, our SinCA leverages the input LQ image to enhance the representation of codebook for restoration fidelity. Experiments on four typical VQ-based BIR methods demonstrate that, by replacing the VQ process with a transformer using our SinCA, they achieve better quantitative and qualitative performance on blind image super-resolution and blind face restoration. The code and pre-trained models are publicly released at https://github.com/lhy-85/SinCA
{"title":"Diagnosing and Improving Vector-Quantization-Based Blind Image Restoration","authors":"Hongyu Li;Tianyi Xu;Zengyou Wang;Xiantong Zhen;Ran Gu;David Zhang;Jun Xu","doi":"10.1109/TIP.2026.3651985","DOIUrl":"10.1109/TIP.2026.3651985","url":null,"abstract":"Vector-Quantization (VQ) based discrete generative models are widely used to learn powerful high-quality (HQ) priors for blind image restoration (BIR). In this paper, we diagnose the side-effects of discrete VQ process essential to VQ-based BIR methods: 1) confining the representation capacity of HQ codebook, 2) being error-prone for code index prediction on low-quality (LQ) images, and 3) under-valuing the importance of input LQ image. These motivate us to learn continuous feature representation of HQ codebook for better restoration performance than using discrete VQ process. To further improve the restoration fidelity, we propose a new Self-in-Cross-Attention (SinCA) module to augment the HQ codebook with the feature of input LQ image, and perform cross-attention between LQ feature and input-augmented codebook. By this way, our SinCA leverages the input LQ image to enhance the representation of codebook for restoration fidelity. Experiments on four typical VQ-based BIR methods demonstrate that, by replacing the VQ process with a transformer using our SinCA, they achieve better quantitative and qualitative performance on blind image super-resolution and blind face restoration. The code and pre-trained models are publicly released at <uri>https://github.com/lhy-85/SinCA</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"844-857"},"PeriodicalIF":13.7,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145961421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, incorporating Retinex theory with unfolding networks has attracted increasing attention in the low-light image enhancement field. However, existing methods have two limitations, i.e., ignoring the modeling of the physical prior of Retinex theory and relying on a large amount of paired data. To advance this field, we propose a novel self-supervised unfolding network, named S2UNet, for the LIE task. Specifically, we formulate a novel optimization model based on the principle that content-consistent images under different illumination should share the same reflectance. The model simultaneously decomposes two illumination-different images into a shared reflectance component and two independent illumination components. Due to the absence of the normal-light image, we process the low-light image with gamma correction to create the illumination-different image pair. Then, we translate this model into a multi-stage unfolding network, in which each stage alternately optimizes the shared reflectance component and the respective illumination components of the two images. During progressive multi-stage optimization, the network inherently encodes the reflectance consistency prior by jointly estimating an optimal reflectance across varying illumination conditions. Finally, considering the presence of noise in low-light images and to suppress noise amplification, we propose a self-supervised denoising mechanism. Extensive experiments on nine benchmark datasets demonstrate that our proposed S2UNet outperforms state-of-the-art unsupervised methods in terms of both quantitative metrics and visual quality, while achieving competitive performance compared to supervised methods. The source code will be available at https://github.com/J-Liu-DL/S2UNet
{"title":"Self-Supervised Unfolding Network With Shared Reflectance Learning for Low-Light Image Enhancement","authors":"Jia Liu;Yu Luo;Guanghui Yue;Jie Ling;Liang Liao;Chia-Wen Lin;Guangtao Zhai;Wei Zhou","doi":"10.1109/TIP.2026.3652021","DOIUrl":"10.1109/TIP.2026.3652021","url":null,"abstract":"Recently, incorporating Retinex theory with unfolding networks has attracted increasing attention in the low-light image enhancement field. However, existing methods have two limitations, i.e., ignoring the modeling of the physical prior of Retinex theory and relying on a large amount of paired data. To advance this field, we propose a novel self-supervised unfolding network, named S2UNet, for the LIE task. Specifically, we formulate a novel optimization model based on the principle that content-consistent images under different illumination should share the same reflectance. The model simultaneously decomposes two illumination-different images into a shared reflectance component and two independent illumination components. Due to the absence of the normal-light image, we process the low-light image with gamma correction to create the illumination-different image pair. Then, we translate this model into a multi-stage unfolding network, in which each stage alternately optimizes the shared reflectance component and the respective illumination components of the two images. During progressive multi-stage optimization, the network inherently encodes the reflectance consistency prior by jointly estimating an optimal reflectance across varying illumination conditions. Finally, considering the presence of noise in low-light images and to suppress noise amplification, we propose a self-supervised denoising mechanism. Extensive experiments on nine benchmark datasets demonstrate that our proposed S2UNet outperforms state-of-the-art unsupervised methods in terms of both quantitative metrics and visual quality, while achieving competitive performance compared to supervised methods. The source code will be available at <uri>https://github.com/J-Liu-DL/S2UNet</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"800-815"},"PeriodicalIF":13.7,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145961419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Segment Anything Model 2 (SAM 2) has demonstrated exceptional performance in object segmentation tasks but encounters challenges in visual object tracking, particularly in handling crowded scenes with fast-moving or self-occluding objects. Additionally, its fixed-window memory mechanism indiscriminately retains past frames, leading to error accumulation. This issue results in incorrect memory retention during occlusions, causing the model to condition future predictions on unreliable features and leading to identity switches or drift in crowded scenes. This paper introduces SAMURAI, an enhanced adaptation of SAM 2 that integrates temporal motion cues with a novel motion-aware memory selection strategy. SAMURAI effectively predicts object motion and refines mask selection, achieving robust and precise tracking without requiring retraining or fine-tuning. It demonstrates strong training-free performance across multiple VOT benchmark datasets, underscoring its generalization capability. SAMURAI achieves state-of-the-art performance on LaSOText, GOT-10k, and TrackingNet, while also delivering competitive results on LaSOT, VOT2020-ST, VOT2022-ST, and VOS benchmarks such as SA-V. These results highlight SAMURAI’s robustness in complex tracking scenarios and its potential for real-world applications in dynamic environments with an optimized memory selection mechanism. Code and results are available at https://github.com/yangchris11/samurai
{"title":"SAMURAI: Motion-Aware Memory for Training-Free Visual Object Tracking With SAM 2","authors":"Cheng-Yeng Yang;Hsiang-Wei Huang;Wenhao Chai;Zhongyu Jiang;Jenq-Neng Hwang","doi":"10.1109/TIP.2026.3651835","DOIUrl":"10.1109/TIP.2026.3651835","url":null,"abstract":"The Segment Anything Model 2 (SAM 2) has demonstrated exceptional performance in object segmentation tasks but encounters challenges in visual object tracking, particularly in handling crowded scenes with fast-moving or self-occluding objects. Additionally, its fixed-window memory mechanism indiscriminately retains past frames, leading to error accumulation. This issue results in incorrect memory retention during occlusions, causing the model to condition future predictions on unreliable features and leading to identity switches or drift in crowded scenes. This paper introduces SAMURAI, an enhanced adaptation of SAM 2 that integrates temporal motion cues with a novel motion-aware memory selection strategy. SAMURAI effectively predicts object motion and refines mask selection, achieving robust and precise tracking without requiring retraining or fine-tuning. It demonstrates strong training-free performance across multiple VOT benchmark datasets, underscoring its generalization capability. SAMURAI achieves state-of-the-art performance on LaSOText, GOT-10k, and TrackingNet, while also delivering competitive results on LaSOT, VOT2020-ST, VOT2022-ST, and VOS benchmarks such as SA-V. These results highlight SAMURAI’s robustness in complex tracking scenarios and its potential for real-world applications in dynamic environments with an optimized memory selection mechanism. Code and results are available at <uri>https://github.com/yangchris11/samurai</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"970-982"},"PeriodicalIF":13.7,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11351313","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145961384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-12DOI: 10.1109/TIP.2025.3650664
{"title":"Reviewer Summary for Transactions on Image Processing","authors":"","doi":"10.1109/TIP.2025.3650664","DOIUrl":"10.1109/TIP.2025.3650664","url":null,"abstract":"","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8684-8708"},"PeriodicalIF":13.7,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11346802","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145955219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-temporal hyperspectral imagery (HSI) has become a powerful tool for change detection (CD) owing to its rich spectral signatures and detailed spatial information. Nevertheless, the application of paired HSIs is constrained by the scarcity of annotated training data. While unsupervised domain adaptation (UDA) offers a potential solution by transferring change detection knowledge from source to target domains, two critical limitations persist: 1) the labor-intensive process of acquiring and annotating source-domain paired samples, and 2) the suboptimal transfer performance caused by substantial cross-domain distribution discrepancies. To address these challenges, we present a Temporal Self-Construction Cross-Domain learning (TSCCD) framework for UDA-based HSI-CD. Our TSCCD framework introduces an innovative temporal self-construction mechanism that synthesizes bi-temporal source-domain data from existing HSI classification datasets while simultaneously performing initial data-level alignment. Furthermore, we develop a reweighted amplitude maximum mean discrepancy (MMD) metric to enhance feature-level domain adaptation. The proposed architecture incorporates an attention-based Kolmogorov-Arnold network (KAN) with high-frequency feature augmentation within an encoder-decoder structure to effectively capture change characteristics. Comprehensive experiments conducted on three benchmark HSI datasets demonstrate that TSCCD achieves superior performance compared to current state-of-the-art methods in HSI change detection tasks. Codes are available at https://github.com/Zhoutya/TSCCD.
{"title":"TSCCD: Temporal Self-Construction Cross-Domain Learning for Unsupervised Hyperspectral Change Detection","authors":"Tianyuan Zhou;Fulin Luo;Chuan Fu;Tan Guo;Bo Du;Xinbo Gao;Liangpei Zhang","doi":"10.1109/TIP.2025.3650387","DOIUrl":"10.1109/TIP.2025.3650387","url":null,"abstract":"Multi-temporal hyperspectral imagery (HSI) has become a powerful tool for change detection (CD) owing to its rich spectral signatures and detailed spatial information. Nevertheless, the application of paired HSIs is constrained by the scarcity of annotated training data. While unsupervised domain adaptation (UDA) offers a potential solution by transferring change detection knowledge from source to target domains, two critical limitations persist: 1) the labor-intensive process of acquiring and annotating source-domain paired samples, and 2) the suboptimal transfer performance caused by substantial cross-domain distribution discrepancies. To address these challenges, we present a Temporal Self-Construction Cross-Domain learning (TSCCD) framework for UDA-based HSI-CD. Our TSCCD framework introduces an innovative temporal self-construction mechanism that synthesizes bi-temporal source-domain data from existing HSI classification datasets while simultaneously performing initial data-level alignment. Furthermore, we develop a reweighted amplitude maximum mean discrepancy (MMD) metric to enhance feature-level domain adaptation. The proposed architecture incorporates an attention-based Kolmogorov-Arnold network (KAN) with high-frequency feature augmentation within an encoder-decoder structure to effectively capture change characteristics. Comprehensive experiments conducted on three benchmark HSI datasets demonstrate that TSCCD achieves superior performance compared to current state-of-the-art methods in HSI change detection tasks. Codes are available at <uri>https://github.com/Zhoutya/TSCCD</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"830-843"},"PeriodicalIF":13.7,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145955874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent pre-trained vision-language models (PT-VLMs) often face a Multi-Domain Task Incremental Learning (MTIL) scenario in practice, where several classes and domains of multi-modal tasks are arrive incrementally. Without access to previously seen tasks and unseen tasks, memory-constrained MTIL suffers from forward and backward forgetting. To alleviate the above challenges, parameter-efficient fine-tuning techniques (PEFT), such as prompt tuning, are employed to adapt the PT-VLM to the diverse incrementally learned tasks. To achieve effective new task adaptation, existing methods only consider the effect of PEFT strategy selection, but neglect the influence of PEFT parameter setting (e.g., prompting). In this paper, we tackle the challenge of optimizing prompt designs for diverse tasks in MTIL and propose an Instance-Aware Prompting (IAP) framework. Specifically, our Instance-Aware Gated Prompting (IA-GP) strategy enhances adaptation to new tasks while mitigating forgetting by adaptively assigning prompts across transformer layers at the instance level. Our Instance-Aware Class-Distribution-Driven Prompting (IA-CDDP) improves the task adaptation process by determining an accurate task-label-related confidence score for each instance. Experimental evaluations across 11 datasets, using three performance metrics, demonstrate the effectiveness of our proposed method. The source codes are available at https://github.com/FerdinandZJU/IAP
{"title":"IAP: Improving Continual Learning of Vision-Language Models via Instance-Aware Prompting","authors":"Hao Fu;Hanbin Zhao;Jiahua Dong;Henghui Ding;Chao Zhang;Hui Qian","doi":"10.1109/TIP.2025.3650045","DOIUrl":"10.1109/TIP.2025.3650045","url":null,"abstract":"Recent pre-trained vision-language models (PT-VLMs) often face a Multi-Domain Task Incremental Learning (MTIL) scenario in practice, where several classes and domains of multi-modal tasks are arrive incrementally. Without access to previously seen tasks and unseen tasks, memory-constrained MTIL suffers from forward and backward forgetting. To alleviate the above challenges, parameter-efficient fine-tuning techniques (PEFT), such as prompt tuning, are employed to adapt the PT-VLM to the diverse incrementally learned tasks. To achieve effective new task adaptation, existing methods only consider the effect of PEFT strategy selection, but neglect the influence of PEFT parameter setting (e.g., prompting). In this paper, we tackle the challenge of optimizing prompt designs for diverse tasks in MTIL and propose an Instance-Aware Prompting (IAP) framework. Specifically, our Instance-Aware Gated Prompting (IA-GP) strategy enhances adaptation to new tasks while mitigating forgetting by adaptively assigning prompts across transformer layers at the instance level. Our Instance-Aware Class-Distribution-Driven Prompting (IA-CDDP) improves the task adaptation process by determining an accurate task-label-related confidence score for each instance. Experimental evaluations across 11 datasets, using three performance metrics, demonstrate the effectiveness of our proposed method. The source codes are available at <uri>https://github.com/FerdinandZJU/IAP</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"717-731"},"PeriodicalIF":13.7,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145955228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1109/TIP.2025.3648203
Hao Jing, Anhong Wang, Yifan Zhang, Donghan Bu, Junhui Hou
Regarding intelligent transportation systems, low-bitrate transmission via lossy point cloud compression is vital for facilitating real-time collaborative perception among connected agents, such as vehicles and infrastructures, under restricted bandwidth. In existing compression transmission systems, the sender lossily compresses point coordinates and reflectance to generate a transmission code stream, which faces transmission burdens from reflectance encoding and limited detection robustness due to information loss. To address these issues, this paper proposes a 3D object detection framework with reflectance prediction-based knowledge distillation (RPKD). We compress point coordinates while discarding reflectance during low-bitrate transmission, and feed the decoded non-reflectance compressed point clouds into a student detector. The discarded reflectance is then reconstructed by a geometry-based reflectance prediction (RP) module within the student detector for precise detection. A teacher detector with the same structure as the student detector is designed for performing reflectance knowledge distillation (RKD) and detection knowledge distillation (DKD) from raw to compressed point clouds. Our cross-source distillation training strategy (CDTS) equips the student detector with robustness to low-quality compressed data while preserving the accuracy benefits of raw data through transferred distillation knowledge. Experimental results on the KITTI and DAIR-V2X-V datasets demonstrate that our method can boost detection accuracy for compressed point clouds across multiple code rates. We will release the code publicly at https://github.com/HaoJing-SX/RPKD.
{"title":"Reflectance Prediction-Based Knowledge Distillation for Robust 3D Object Detection in Compressed Point Clouds.","authors":"Hao Jing, Anhong Wang, Yifan Zhang, Donghan Bu, Junhui Hou","doi":"10.1109/TIP.2025.3648203","DOIUrl":"10.1109/TIP.2025.3648203","url":null,"abstract":"<p><p>Regarding intelligent transportation systems, low-bitrate transmission via lossy point cloud compression is vital for facilitating real-time collaborative perception among connected agents, such as vehicles and infrastructures, under restricted bandwidth. In existing compression transmission systems, the sender lossily compresses point coordinates and reflectance to generate a transmission code stream, which faces transmission burdens from reflectance encoding and limited detection robustness due to information loss. To address these issues, this paper proposes a 3D object detection framework with reflectance prediction-based knowledge distillation (RPKD). We compress point coordinates while discarding reflectance during low-bitrate transmission, and feed the decoded non-reflectance compressed point clouds into a student detector. The discarded reflectance is then reconstructed by a geometry-based reflectance prediction (RP) module within the student detector for precise detection. A teacher detector with the same structure as the student detector is designed for performing reflectance knowledge distillation (RKD) and detection knowledge distillation (DKD) from raw to compressed point clouds. Our cross-source distillation training strategy (CDTS) equips the student detector with robustness to low-quality compressed data while preserving the accuracy benefits of raw data through transferred distillation knowledge. Experimental results on the KITTI and DAIR-V2X-V datasets demonstrate that our method can boost detection accuracy for compressed point clouds across multiple code rates. We will release the code publicly at https://github.com/HaoJing-SX/RPKD.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":"85-97"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145893537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}