Pub Date : 2026-01-01DOI: 10.1109/TIP.2026.3657214
Yun Zhang, Feifan Chen, Na Li, Zhiwei Guo, Xu Wang, Fen Miao, Sam Kwong
Colored point cloud comprising geometry and attribute components is one of the mainstream representations enabling realistic and immersive 3D applications. To generate large-scale and denser colored point clouds, we propose a deep learning-based Joint Geometry and Attribute Up-sampling (JGAU) method, which learns to model both geometry and attribute patterns and leverages the spatial attribute correlation. Firstly, we establish and release a large-scale dataset for colored point cloud up-sampling, named SYSU-PCUD, which has 121 large-scale colored point clouds with diverse geometry and attribute complexities in six categories and four sampling rates. Secondly, to improve the quality of up-sampled point clouds, we propose a deep learning-based JGAU framework to up-sample the geometry and attribute jointly. It consists of a geometry up-sampling network and an attribute up-sampling network, where the latter leverages the up-sampled auxiliary geometry to model neighborhood correlations of the attributes. Thirdly, we propose two coarse attribute up-sampling methods, Geometric Distance Weighted Attribute Interpolation (GDWAI) and Deep Learning-based Attribute Interpolation (DLAI), to generate coarsely up-sampled attributes for each point. Then, we propose an attribute enhancement module to refine the up-sampled attributes and generate high quality point clouds by further exploiting intrinsic attribute and geometry patterns. Extensive experiments show that Peak Signal-to-Noise Ratio (PSNR) achieved by the proposed JGAU are 33.90 dB, 32.10 dB, 31.10 dB, and 30.39 dB when up-sampling rates are $4times $ , $8times $ , $12times $ , and $16times $ , respectively. Compared to the state-of-the-art schemes, the JGAU achieves an average of 2.32 dB, 2.47 dB, 2.28 dB and 2.11 dB PSNR gains at four up-sampling rates, respectively, which are significant. The code is released with https://github.com/SYSU-Video/JGAU.
{"title":"Deep Learning-Based Joint Geometry and Attribute Up-Sampling for Large-Scale Colored Point Clouds.","authors":"Yun Zhang, Feifan Chen, Na Li, Zhiwei Guo, Xu Wang, Fen Miao, Sam Kwong","doi":"10.1109/TIP.2026.3657214","DOIUrl":"10.1109/TIP.2026.3657214","url":null,"abstract":"<p><p>Colored point cloud comprising geometry and attribute components is one of the mainstream representations enabling realistic and immersive 3D applications. To generate large-scale and denser colored point clouds, we propose a deep learning-based Joint Geometry and Attribute Up-sampling (JGAU) method, which learns to model both geometry and attribute patterns and leverages the spatial attribute correlation. Firstly, we establish and release a large-scale dataset for colored point cloud up-sampling, named SYSU-PCUD, which has 121 large-scale colored point clouds with diverse geometry and attribute complexities in six categories and four sampling rates. Secondly, to improve the quality of up-sampled point clouds, we propose a deep learning-based JGAU framework to up-sample the geometry and attribute jointly. It consists of a geometry up-sampling network and an attribute up-sampling network, where the latter leverages the up-sampled auxiliary geometry to model neighborhood correlations of the attributes. Thirdly, we propose two coarse attribute up-sampling methods, Geometric Distance Weighted Attribute Interpolation (GDWAI) and Deep Learning-based Attribute Interpolation (DLAI), to generate coarsely up-sampled attributes for each point. Then, we propose an attribute enhancement module to refine the up-sampled attributes and generate high quality point clouds by further exploiting intrinsic attribute and geometry patterns. Extensive experiments show that Peak Signal-to-Noise Ratio (PSNR) achieved by the proposed JGAU are 33.90 dB, 32.10 dB, 31.10 dB, and 30.39 dB when up-sampling rates are $4times $ , $8times $ , $12times $ , and $16times $ , respectively. Compared to the state-of-the-art schemes, the JGAU achieves an average of 2.32 dB, 2.47 dB, 2.28 dB and 2.11 dB PSNR gains at four up-sampling rates, respectively, which are significant. The code is released with https://github.com/SYSU-Video/JGAU.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":"1305-1320"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146088455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Diffusion-based image super-resolution (SR) has shown strong potential in recovering high-fidelity details from low-resolution inputs. However, the need for tens or hundreds of sampling steps leads to substantial inference latency. Recent works attempt to accelerate this process via knowledge distillation, but often rely solely on pixel-level loss or overlook the fact that diffusion models capture different information across time steps. To address this, we propose TAD-SR, a time-aware diffusion distillation framework. Specifically, we introduce a novel score distillation strategy to align the score functions between the outputs of the student and teacher models after minor noise perturbation. This distillation strategy eliminates the inherent bias in score distillation sampling (SDS) and enables the student models to focus more on high-frequency image details by sampling at smaller time steps. We further introduce a time-aware discriminator that exploits the teacher's knowledge to differentiate real and synthetic samples across different noise scales, using explicit temporal conditioning. Extensive experiments on SR tasks demonstrate that TAD-SR outperforms existing single-step diffusion methods and achieves performance on par with multi-step state-of-the-art models.
{"title":"One Step Diffusion-Based Super-Resolution With Time-Aware Distillation.","authors":"Xiao He, Huaao Tang, Zhijun Tu, Junchao Zhang, Kun Cheng, Hanting Chen, Yong Guo, Mingrui Zhu, Jie Hu, Nannan Wang, Xinbo Gao","doi":"10.1109/TIP.2026.3672376","DOIUrl":"10.1109/TIP.2026.3672376","url":null,"abstract":"<p><p>Diffusion-based image super-resolution (SR) has shown strong potential in recovering high-fidelity details from low-resolution inputs. However, the need for tens or hundreds of sampling steps leads to substantial inference latency. Recent works attempt to accelerate this process via knowledge distillation, but often rely solely on pixel-level loss or overlook the fact that diffusion models capture different information across time steps. To address this, we propose TAD-SR, a time-aware diffusion distillation framework. Specifically, we introduce a novel score distillation strategy to align the score functions between the outputs of the student and teacher models after minor noise perturbation. This distillation strategy eliminates the inherent bias in score distillation sampling (SDS) and enables the student models to focus more on high-frequency image details by sampling at smaller time steps. We further introduce a time-aware discriminator that exploits the teacher's knowledge to differentiate real and synthetic samples across different noise scales, using explicit temporal conditioning. Extensive experiments on SR tasks demonstrate that TAD-SR outperforms existing single-step diffusion methods and achieves performance on par with multi-step state-of-the-art models.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":"2928-2940"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147470602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although video generation and editing models have advanced significantly, individual models remain restricted to specific tasks, often failing to meet diverse user needs. Effectively coordinating these models in pipelines can unlock a wide range of video generation and editing capabilities. However, manual orchestration is complex, time-consuming, and requires deep expertise in model performance and limitations. To address these challenges, we propose the Semantic Planning Agent (SPAgent), a novel system that automatically coordinates state-of-the-art open-source models to fulfill complex user intents. To equip SPAgent with robust orchestration capabilities, we introduce a three-step framework: 1) decoupled intent recognition to accurately parse multi-modal inputs; 2) principle-guided route planning to design effective execution chains; and 3) capability-based model selection to identify the optimal tools for each sub-task. To facilitate training, we curate a comprehensive multi-task generative video dataset. Furthermore, we enhance SPAgent with a video quality evaluation module, enabling it to autonomously assess and incorporate new models into its tool library without human intervention. Experimental results demonstrate that SPAgent effectively coordinates models to generate and edit high-quality videos, exhibiting superior versatility and adaptability across various tasks.
{"title":"SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing.","authors":"Rong-Cheng Tu, Wenhao Sun, Zhao Jin, Jingyi Liao, Jiaxing Huang, Dacheng Tao","doi":"10.1109/TIP.2026.3673949","DOIUrl":"10.1109/TIP.2026.3673949","url":null,"abstract":"<p><p>Although video generation and editing models have advanced significantly, individual models remain restricted to specific tasks, often failing to meet diverse user needs. Effectively coordinating these models in pipelines can unlock a wide range of video generation and editing capabilities. However, manual orchestration is complex, time-consuming, and requires deep expertise in model performance and limitations. To address these challenges, we propose the Semantic Planning Agent (SPAgent), a novel system that automatically coordinates state-of-the-art open-source models to fulfill complex user intents. To equip SPAgent with robust orchestration capabilities, we introduce a three-step framework: 1) decoupled intent recognition to accurately parse multi-modal inputs; 2) principle-guided route planning to design effective execution chains; and 3) capability-based model selection to identify the optimal tools for each sub-task. To facilitate training, we curate a comprehensive multi-task generative video dataset. Furthermore, we enhance SPAgent with a video quality evaluation module, enabling it to autonomously assess and incorporate new models into its tool library without human intervention. Experimental results demonstrate that SPAgent effectively coordinates models to generate and edit high-quality videos, exhibiting superior versatility and adaptability across various tasks.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":"3085-3098"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147489129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unsupervised feature selection (UFS) has recently gained attention for its effectiveness in processing unlabeled high-dimensional data. However, existing methods overlook the intrinsic causal mechanisms within the data, resulting in the selection of irrelevant features and poor interpretability. Additionally, previous graph-based methods fail to account for the differing impacts of non-causal and causal features in constructing the similarity graph, which leads to false links in the generated graph. To address these issues, a novel UFS method, called Causally-Aware UnSupErvised Feature Selection learning (CAUSE-FS), is proposed. CAUSE-FS introduces a novel causal regularizer that reweights samples to balance the confounding distribution of each treatment feature. This regularizer is subsequently integrated into a generalized unsupervised spectral regression model to mitigate spurious associations between features and clustering labels, thus achieving causal feature selection. Furthermore, CAUSE-FS employs causality-guided hierarchical clustering to partition features with varying causal contributions into multiple granularities. By integrating similarity graphs learned adaptively at different granularities, CAUSE-FS increases the importance of causal features when constructing the fused similarity graph to capture the reliable local structure of data. Extensive experimental results demonstrate the superiority of CAUSE-FS over state-of-the-art methods, with its interpretability further validated through feature visualization.
{"title":"Causally-Aware Unsupervised Feature Selection Learning.","authors":"Zongxin Shen, Yanyong Huang, Dongjie Wang, Minbo Ma, Fengmao Lv, Tianrui Li","doi":"10.1109/TIP.2026.3654354","DOIUrl":"10.1109/TIP.2026.3654354","url":null,"abstract":"<p><p>Unsupervised feature selection (UFS) has recently gained attention for its effectiveness in processing unlabeled high-dimensional data. However, existing methods overlook the intrinsic causal mechanisms within the data, resulting in the selection of irrelevant features and poor interpretability. Additionally, previous graph-based methods fail to account for the differing impacts of non-causal and causal features in constructing the similarity graph, which leads to false links in the generated graph. To address these issues, a novel UFS method, called Causally-Aware UnSupErvised Feature Selection learning (CAUSE-FS), is proposed. CAUSE-FS introduces a novel causal regularizer that reweights samples to balance the confounding distribution of each treatment feature. This regularizer is subsequently integrated into a generalized unsupervised spectral regression model to mitigate spurious associations between features and clustering labels, thus achieving causal feature selection. Furthermore, CAUSE-FS employs causality-guided hierarchical clustering to partition features with varying causal contributions into multiple granularities. By integrating similarity graphs learned adaptively at different granularities, CAUSE-FS increases the importance of causal features when constructing the fused similarity graph to capture the reliable local structure of data. Extensive experimental results demonstrate the superiority of CAUSE-FS over state-of-the-art methods, with its interpretability further validated through feature visualization.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":"1011-1024"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146042205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unsupervised Domain Adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain by learning domain-invariant representations. Motivated by the recent success of Vision Transformers (ViTs), several UDA approaches have adopted ViT architectures to exploit fine-grained patch-level representations, which are unified as Transformer-based $D$ omain $A$ daptation (TransDA) independent of CNN-based. However, we have a key observation in TransDA: due to inherent domain shifts, patches (tokens) from different semantic categories across domains may exhibit abnormally high similarities, which can mislead the self-attention mechanism and degrade adaptation performance. To solve that, we propose a novel $P$ atch-$A$ daptation Transformer (PATrans), which first identifies similarity-anomalous patches and then adaptively suppresses their negative impact to domain alignment, i.e. token calibration. Specifically, we introduce a $P$ atch-$A$ daptation $A$ ttention (PAA) mechanism to replace the standard self-attention mechanism, which consists of a weight-shared triple-branch mixed attention mechanism and a patch-level domain discriminator. The mixed attention integrates self-attention and cross-attention to enhance intra-domain feature modeling and inter-domain similarity estimation. Meanwhile, the patch-level domain discriminator quantifies the anomaly probability of each patch, enabling dynamic reweighting to mitigate the impact of unreliable patch correspondences. Furthermore, we introduce a contrastive attention regularization strategy, which leverages category-level information in a contrastive learning framework to promote class-consistent attention distributions. Extensive experiments on four benchmark datasets demonstrate that PATrans attains significant improvements over existing state-of-the-art UDA methods (e.g., 89.2% on the VisDA-2017). Code is available at: https://github.com/YSY145/PATrans
{"title":"Token Calibration for Transformer-Based Domain Adaptation","authors":"Xiaowei Fu;Shiyu Ye;Chenxu Zhang;Fuxiang Huang;Xin Xu;Lei Zhang","doi":"10.1109/TIP.2025.3647367","DOIUrl":"10.1109/TIP.2025.3647367","url":null,"abstract":"Unsupervised Domain Adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain by learning domain-invariant representations. Motivated by the recent success of Vision Transformers (ViTs), several UDA approaches have adopted ViT architectures to exploit fine-grained patch-level representations, which are unified as <italic>Trans</i>former-based <inline-formula> <tex-math>$D$ </tex-math></inline-formula>omain <inline-formula> <tex-math>$A$ </tex-math></inline-formula>daptation (TransDA) independent of CNN-based. However, we have a key observation in TransDA: due to inherent domain shifts, patches (tokens) from different semantic categories across domains may exhibit abnormally high similarities, which can mislead the self-attention mechanism and degrade adaptation performance. To solve that, we propose a novel <inline-formula> <tex-math>$P$ </tex-math></inline-formula>atch-<inline-formula> <tex-math>$A$ </tex-math></inline-formula>daptation <italic>Trans</i>former (PATrans), which first <italic>identifies</i> similarity-anomalous patches and then adaptively <italic>suppresses</i> their negative impact to domain alignment, i.e. <italic>token calibration</i>. Specifically, we introduce a <inline-formula> <tex-math>$P$ </tex-math></inline-formula>atch-<inline-formula> <tex-math>$A$ </tex-math></inline-formula>daptation <inline-formula> <tex-math>$A$ </tex-math></inline-formula>ttention (<italic>PAA</i>) mechanism to replace the standard self-attention mechanism, which consists of a weight-shared triple-branch mixed attention mechanism and a patch-level domain discriminator. The mixed attention integrates self-attention and cross-attention to enhance intra-domain feature modeling and inter-domain similarity estimation. Meanwhile, the patch-level domain discriminator quantifies the anomaly probability of each patch, enabling dynamic reweighting to mitigate the impact of unreliable patch correspondences. Furthermore, we introduce a contrastive attention regularization strategy, which leverages category-level information in a contrastive learning framework to promote class-consistent attention distributions. Extensive experiments on four benchmark datasets demonstrate that PATrans attains significant improvements over existing state-of-the-art UDA methods (e.g., 89.2% on the VisDA-2017). Code is available at: <uri>https://github.com/YSY145/PATrans</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"57-68"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1109/TIP.2025.3647207
Yang Xu;Jian Zhu;Danfeng Hong;Zhihui Wei;Zebin Wu
Hyperspectral images (HSIs) and multispectral images (MSIs) fusion is a hot topic in the remote sensing society. A high-resolution HSI (HR-HSI) can be obtained by fusing a low-resolution HSI (LR-HSI) and a high-resolution MSI (HR-MSI) or RGB image. However, most deep learning-based methods require a large amount of HR-HSIs for supervised training, which is very rare in practice. In this paper, we propose a coupled diffusion posterior sampling (CDPS) method for HSI and MSI fusion in which the HR-HSIs are no longer required in the training process. Because the LR-HSI contains the spectral information and HR-MSI contains the spatial information of the captured scene, we design an unsupervised strategy that learns the required diffusion priors directly and solely from the input test image pair (the LR-HSI and HR-MSI themselves). Then, a coupled diffusion posterior sampling method is proposed to introduce the two priors in the diffusion posterior sampling which leverages the observed LR-HSI and HR-MSI as fidelity terms. Experimental results demonstrate that the proposed method outperforms other state-of-the-art unsupervised HSI and MSI fusion methods. Additionally, this method utilizes smaller networks that are simpler and easier to train without other data.
{"title":"Coupled Diffusion Posterior Sampling for Unsupervised Hyperspectral and Multispectral Images Fusion","authors":"Yang Xu;Jian Zhu;Danfeng Hong;Zhihui Wei;Zebin Wu","doi":"10.1109/TIP.2025.3647207","DOIUrl":"10.1109/TIP.2025.3647207","url":null,"abstract":"Hyperspectral images (HSIs) and multispectral images (MSIs) fusion is a hot topic in the remote sensing society. A high-resolution HSI (HR-HSI) can be obtained by fusing a low-resolution HSI (LR-HSI) and a high-resolution MSI (HR-MSI) or RGB image. However, most deep learning-based methods require a large amount of HR-HSIs for supervised training, which is very rare in practice. In this paper, we propose a coupled diffusion posterior sampling (CDPS) method for HSI and MSI fusion in which the HR-HSIs are no longer required in the training process. Because the LR-HSI contains the spectral information and HR-MSI contains the spatial information of the captured scene, we design an unsupervised strategy that learns the required diffusion priors directly and solely from the input test image pair (the LR-HSI and HR-MSI themselves). Then, a coupled diffusion posterior sampling method is proposed to introduce the two priors in the diffusion posterior sampling which leverages the observed LR-HSI and HR-MSI as fidelity terms. Experimental results demonstrate that the proposed method outperforms other state-of-the-art unsupervised HSI and MSI fusion methods. Additionally, this method utilizes smaller networks that are simpler and easier to train without other data.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"69-84"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Text removal is an important task in processing both scene and document images. However, existing scene text removal (STR) methods are primarily focus on scene text images. The STR models (trained by scene text images) perform poorly on document images with dense, complex textured backgrounds. We discover that the limitations of existing methods can be attributed to the difficuties in background features estimation in the regions to be erased, which is based on the knowledge from neighboring regions in the input images and priors learned from the training data. The background features estimation performance degrades under the cross-domain scenarios, and compromises the quality of STR results. To address these issues, we introduce DiffEraser, a novel text removal framework that leverages prior knowledge from the Latent Diffusion Model (LDM) for removing text in both scene and document images. Our DiffEraser incorporates two key innovations to fully exploit the prior knowledge of LDM. First, we replace the conventional Variational Auto-Encoders (VAE) encoder with a Diffusion-Prior (DP) encoder, aiming to integrate the heterogeneous information from the LDM prior knowledge in latent space with the multi-level encoded features of the input image. Second, we introduce a Latent-Fusion (LF) decoder that integrates the heterogeneous features from both the LDM and DP encoders to generate high-quality text-erased results. To evaluate the generalization performance of our DiffEraser, we focus on the cross-domain protocols and construct a document image dataset, NPID295, which contains 295 types of passports and identity cards. Notably, when trained on a scene text dataset, DiffEraser significantly outperforms existing STR methods in the challenging NPID295 dataset. The resources of this work will be available online upon acceptance.
{"title":"DiffEraser: Generalized Text Erasure Based on Latent Diffusion Prior.","authors":"Zhihao Chen, Yongqi Chen, Changsheng Chen, Shunquan Tan, Jiwu Huang","doi":"10.1109/TIP.2026.3659329","DOIUrl":"10.1109/TIP.2026.3659329","url":null,"abstract":"<p><p>Text removal is an important task in processing both scene and document images. However, existing scene text removal (STR) methods are primarily focus on scene text images. The STR models (trained by scene text images) perform poorly on document images with dense, complex textured backgrounds. We discover that the limitations of existing methods can be attributed to the difficuties in background features estimation in the regions to be erased, which is based on the knowledge from neighboring regions in the input images and priors learned from the training data. The background features estimation performance degrades under the cross-domain scenarios, and compromises the quality of STR results. To address these issues, we introduce DiffEraser, a novel text removal framework that leverages prior knowledge from the Latent Diffusion Model (LDM) for removing text in both scene and document images. Our DiffEraser incorporates two key innovations to fully exploit the prior knowledge of LDM. First, we replace the conventional Variational Auto-Encoders (VAE) encoder with a Diffusion-Prior (DP) encoder, aiming to integrate the heterogeneous information from the LDM prior knowledge in latent space with the multi-level encoded features of the input image. Second, we introduce a Latent-Fusion (LF) decoder that integrates the heterogeneous features from both the LDM and DP encoders to generate high-quality text-erased results. To evaluate the generalization performance of our DiffEraser, we focus on the cross-domain protocols and construct a document image dataset, NPID295, which contains 295 types of passports and identity cards. Notably, when trained on a scene text dataset, DiffEraser significantly outperforms existing STR methods in the challenging NPID295 dataset. The resources of this work will be available online upon acceptance.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":"2138-2151"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146222727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1109/TIP.2026.3663926
Jintao Li, Xinming Wu, Xianwen Zhang, Xin Du, Xiaoming Sun, Bao Deng, Guangyu Wang
The limitations of seismic vertical resolution pose significant challenges for the identification of thin beds. Improving the vertical resolution of seismic data using deep learning methods often encounters challenges related to unrealistic outputs and limited generalization. To address these challenges, we propose a novel framework that improves the fidelity and generalization of seismic super-resolution. Our approach begins with the generation of realistic synthetic training data that aligns with the structural and amplitude characteristics of field surveys. We then introduce an enhanced 2D network with 3D awareness, which builds on the 2D Swin-Transformer and 3D convolution blocks to effectively capture 3D spatial features while maintaining computational efficiency. This network addresses the limitations of traditional 2D approaches by reducing stitching artifacts and improving spatial consistency. Finally, we develop a prior-informed fine-tuning strategy using field data without the need for labels, which incorporates a self-supervised data consistency loss and a spectral matching loss based on prior knowledge. This strategy ensures that the super-resolution results preserve the original low frequency information while yielding a spectral distribution as expected. Experiments on multiple field datasets demonstrate the robustness and generalization capability of our method, making it a practical solution for seismic resolution enhancement in diverse field datasets.
{"title":"High-Fidelity Seismic Super-Resolution Using Prior-Informed Deep Learning With 3D Awareness.","authors":"Jintao Li, Xinming Wu, Xianwen Zhang, Xin Du, Xiaoming Sun, Bao Deng, Guangyu Wang","doi":"10.1109/TIP.2026.3663926","DOIUrl":"10.1109/TIP.2026.3663926","url":null,"abstract":"<p><p>The limitations of seismic vertical resolution pose significant challenges for the identification of thin beds. Improving the vertical resolution of seismic data using deep learning methods often encounters challenges related to unrealistic outputs and limited generalization. To address these challenges, we propose a novel framework that improves the fidelity and generalization of seismic super-resolution. Our approach begins with the generation of realistic synthetic training data that aligns with the structural and amplitude characteristics of field surveys. We then introduce an enhanced 2D network with 3D awareness, which builds on the 2D Swin-Transformer and 3D convolution blocks to effectively capture 3D spatial features while maintaining computational efficiency. This network addresses the limitations of traditional 2D approaches by reducing stitching artifacts and improving spatial consistency. Finally, we develop a prior-informed fine-tuning strategy using field data without the need for labels, which incorporates a self-supervised data consistency loss and a spectral matching loss based on prior knowledge. This strategy ensures that the super-resolution results preserve the original low frequency information while yielding a spectral distribution as expected. Experiments on multiple field datasets demonstrate the robustness and generalization capability of our method, making it a practical solution for seismic resolution enhancement in diverse field datasets.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":"2152-2166"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146222681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1109/TIP.2026.3652431
Shuai Gong, Chaoran Cui, Xiaolin Dong, Xiushan Nie, Lei Zhu, Xiaojun Chang
Federated Domain Generalization (FedDG) aims to train a globally generalizable model on data from decentralized, heterogeneous clients. While recent work has adapted vision-language models for FedDG using prompt learning, the prevailing "one-prompt-fits-all" paradigm struggles with sample diversity, causing a marked performance decline on personalized samples. The Mixture of Experts (MoE) architecture offers a promising solution for specialization. However, existing MoE-based prompt learning methods suffer from two key limitations: coarse image-level expert assignment and high communication costs from parameterized routers. To address these limitations, we propose TRIP, a Token-level pRompt mIxture with Parameter-free routing framework for FedDG. TRIP treats prompts as multiple experts, and assigns individual tokens within an image to distinct experts, facilitating the capture of fine-grained visual patterns. To ensure communication efficiency, TRIP introduces a parameter-free routing mechanism based on capacity-aware clustering and Optimal Transport (OT). First, tokens are grouped into capacity-aware clusters to ensure balanced workloads. These clusters are then assigned to experts via OT, stabilized by mapping cluster centroids to static, non-learnable keys. The final instance-specific prompt is synthesized by aggregating experts, weighted by the number of tokens assigned to each. Extensive experiments across four benchmarks demonstrate that TRIP achieves optimal generalization results, with communicating as few as 1K parameters. Our code is available at https://github.com/GongShuai8210/TRIP.
{"title":"Token-Level Prompt Mixture With Parameter-Free Routing for Federated Domain Generalization.","authors":"Shuai Gong, Chaoran Cui, Xiaolin Dong, Xiushan Nie, Lei Zhu, Xiaojun Chang","doi":"10.1109/TIP.2026.3652431","DOIUrl":"10.1109/TIP.2026.3652431","url":null,"abstract":"<p><p>Federated Domain Generalization (FedDG) aims to train a globally generalizable model on data from decentralized, heterogeneous clients. While recent work has adapted vision-language models for FedDG using prompt learning, the prevailing \"one-prompt-fits-all\" paradigm struggles with sample diversity, causing a marked performance decline on personalized samples. The Mixture of Experts (MoE) architecture offers a promising solution for specialization. However, existing MoE-based prompt learning methods suffer from two key limitations: coarse image-level expert assignment and high communication costs from parameterized routers. To address these limitations, we propose TRIP, a Token-level pRompt mIxture with Parameter-free routing framework for FedDG. TRIP treats prompts as multiple experts, and assigns individual tokens within an image to distinct experts, facilitating the capture of fine-grained visual patterns. To ensure communication efficiency, TRIP introduces a parameter-free routing mechanism based on capacity-aware clustering and Optimal Transport (OT). First, tokens are grouped into capacity-aware clusters to ensure balanced workloads. These clusters are then assigned to experts via OT, stabilized by mapping cluster centroids to static, non-learnable keys. The final instance-specific prompt is synthesized by aggregating experts, weighted by the number of tokens assigned to each. Extensive experiments across four benchmarks demonstrate that TRIP achieves optimal generalization results, with communicating as few as 1K parameters. Our code is available at https://github.com/GongShuai8210/TRIP.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":"656-669"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145991945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1109/TIP.2026.3657165
Huizhang Yang, Liyuan Chen, Shao-Shan Zuo, Zhong Liu, Jian Yang
Synthetic Aperture Radar (SAR) imaging relies on using focusing algorithms to transform raw measurement data into radar images. These algorithms require knowledge of SAR system parameters, such as wavelength, center slant range, fast time sampling rate, pulse repetition interval, waveform, and platform speed. However, in non-cooperative scenarios or when metadata is corrupted, these parameters are unavailable, rendering traditional algorithms ineffective. To address this challenge, this article presents a novel parameter-free method for recovering SAR images from raw data without the requirement of any SAR system parameters. Firstly, we introduce an approximated matched filtering model that leverages the shift-invariance properties of SAR echoes, enabling image formation via convolving the raw data with an unknown reference echo. Secondly, we develop a Principal Component Maximization (PCM) method that exploits the low-dimensional structure of SAR signals to estimate the reference echo. The PCM method employs a three-stage procedure: 1) segment raw data into blocks; 2) normalize the energy of each block; and 3) maximize the principal component's energy across all blocks, enabling robust estimation of the reference echo under non-stationary clutter. Experimental results on various SAR datasets demonstrate that our method can effectively recover SAR images from raw data without any system parameters. To facilitate reproducibility, the Matlab program is available at https://github.com/huizhangyang/pcm.
{"title":"Principal Component Maximization: A Novel Method for SAR Image Recovery From Raw Data Without System Parameters.","authors":"Huizhang Yang, Liyuan Chen, Shao-Shan Zuo, Zhong Liu, Jian Yang","doi":"10.1109/TIP.2026.3657165","DOIUrl":"10.1109/TIP.2026.3657165","url":null,"abstract":"<p><p>Synthetic Aperture Radar (SAR) imaging relies on using focusing algorithms to transform raw measurement data into radar images. These algorithms require knowledge of SAR system parameters, such as wavelength, center slant range, fast time sampling rate, pulse repetition interval, waveform, and platform speed. However, in non-cooperative scenarios or when metadata is corrupted, these parameters are unavailable, rendering traditional algorithms ineffective. To address this challenge, this article presents a novel parameter-free method for recovering SAR images from raw data without the requirement of any SAR system parameters. Firstly, we introduce an approximated matched filtering model that leverages the shift-invariance properties of SAR echoes, enabling image formation via convolving the raw data with an unknown reference echo. Secondly, we develop a Principal Component Maximization (PCM) method that exploits the low-dimensional structure of SAR signals to estimate the reference echo. The PCM method employs a three-stage procedure: 1) segment raw data into blocks; 2) normalize the energy of each block; and 3) maximize the principal component's energy across all blocks, enabling robust estimation of the reference echo under non-stationary clutter. Experimental results on various SAR datasets demonstrate that our method can effectively recover SAR images from raw data without any system parameters. To facilitate reproducibility, the Matlab program is available at https://github.com/huizhangyang/pcm.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":"1231-1245"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146088469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}