Two-view correspondence learning aims to discern true and false correspondences between image pairs by recognizing their underlying different information. Previous methods either treat the information equally or require the explicit storage of the entire context, tending to be laborious in real-world scenarios. Inspired by Mamba's inherent selectivity, we propose CorrMamba, a Correspondence filter leveraging Mamba's ability to selectively mine information from true correspondences while mitigating interference from false ones, thus achieving adaptive focus at a lower cost. To prevent Mamba from being potentially impacted by unordered keypoints that obscured its ability to mine spatial information, we customize a causal sequential learning approach based on the Gumbel-Softmax technique to establish causal dependencies between features in a fully autonomous and differentiable manner. Additionally, a local-context enhancement module is designed to capture critical contextual cues essential for correspondence pruning, complementing the core framework. Extensive experiments on relative pose estimation, visual localization, and analysis demonstrate that CorrMamba achieves state-of-the-art performance. Notably, in outdoor relative pose estimation, our method surpasses the previous SOTA by 2.58 absolute percentage points in AUC@20°, highlighting its practical superiority. Our code will be publicly available.
{"title":"Selecting and Pruning: A Differentiable Causal Sequentialized State-Space Model for Two-View Correspondence Learning.","authors":"Xiang Fang, Shihua Zhang, Hao Zhang, Xiaoguang Mei, Huabing Zhou, Jiayi Ma","doi":"10.1109/TIP.2026.3653189","DOIUrl":"https://doi.org/10.1109/TIP.2026.3653189","url":null,"abstract":"<p><p>Two-view correspondence learning aims to discern true and false correspondences between image pairs by recognizing their underlying different information. Previous methods either treat the information equally or require the explicit storage of the entire context, tending to be laborious in real-world scenarios. Inspired by Mamba's inherent selectivity, we propose CorrMamba, a Correspondence filter leveraging Mamba's ability to selectively mine information from true correspondences while mitigating interference from false ones, thus achieving adaptive focus at a lower cost. To prevent Mamba from being potentially impacted by unordered keypoints that obscured its ability to mine spatial information, we customize a causal sequential learning approach based on the Gumbel-Softmax technique to establish causal dependencies between features in a fully autonomous and differentiable manner. Additionally, a local-context enhancement module is designed to capture critical contextual cues essential for correspondence pruning, complementing the core framework. Extensive experiments on relative pose estimation, visual localization, and analysis demonstrate that CorrMamba achieves state-of-the-art performance. Notably, in outdoor relative pose estimation, our method surpasses the previous SOTA by 2.58 absolute percentage points in AUC@20°, highlighting its practical superiority. Our code will be publicly available.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145991971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recovering High Dynamic Range (HDR) images from multiple Standard Dynamic Range (SDR) images becomes challenging when the SDR images exhibit noticeable degradation and missing content. Leveraging scene-specific semantic priors offers a promising solution for restoring heavily degraded regions. However, these priors are typically extracted from sRGB SDR images, the domain/format gap poses a significant challenge when applying it to HDR imaging. To address this issue, we propose a general framework that transfers semantic knowledge derived from SDR domain via self-distillation to boost existing HDR reconstruction. Specifically, the proposed framework first introduces the Semantic Priors Guided Reconstruction Model (SPGRM), which leverages SDR image semantic knowledge to address ill-posed problems in the initial HDR reconstruction results. Subsequently, we leverage a self-distillation mechanism that constrains the color and content information with semantic knowledge, aligning the external outputs between the baseline and SPGRM. Furthermore, to transfer the semantic knowledge of the internal features, we utilize a Semantic Knowledge Alignment Module (SKAM) to fill the missing semantic contents with the complementary masks. Extensive experiments demonstrate that our framework significantly boosts HDR imaging quality for existing methods without altering the network architecture.
{"title":"Boosting HDR Image Reconstruction via Semantic Knowledge Transfer.","authors":"Tao Hu, Longyao Wu, Wei Dong, Peng Wu, Jinqiu Sun, Xiaogang Xu, Qingsen Yan, Yanning Zhang","doi":"10.1109/TIP.2026.3652360","DOIUrl":"https://doi.org/10.1109/TIP.2026.3652360","url":null,"abstract":"<p><p>Recovering High Dynamic Range (HDR) images from multiple Standard Dynamic Range (SDR) images becomes challenging when the SDR images exhibit noticeable degradation and missing content. Leveraging scene-specific semantic priors offers a promising solution for restoring heavily degraded regions. However, these priors are typically extracted from sRGB SDR images, the domain/format gap poses a significant challenge when applying it to HDR imaging. To address this issue, we propose a general framework that transfers semantic knowledge derived from SDR domain via self-distillation to boost existing HDR reconstruction. Specifically, the proposed framework first introduces the Semantic Priors Guided Reconstruction Model (SPGRM), which leverages SDR image semantic knowledge to address ill-posed problems in the initial HDR reconstruction results. Subsequently, we leverage a self-distillation mechanism that constrains the color and content information with semantic knowledge, aligning the external outputs between the baseline and SPGRM. Furthermore, to transfer the semantic knowledge of the internal features, we utilize a Semantic Knowledge Alignment Module (SKAM) to fill the missing semantic contents with the complementary masks. Extensive experiments demonstrate that our framework significantly boosts HDR imaging quality for existing methods without altering the network architecture.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145991925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advancements in multimodal large language models (MLLMs) have achieved significant multimodal generation capabilities, akin to GPT-4. These models predominantly map visual information into language representation space, leveraging the vast knowledge and powerful text generation abilities of LLMs to produce multimodal instruction-following responses. We could term this method as LLMs for Vision because of its employing LLMs for visual understanding and reasoning, yet observe that these MLLMs neglect the potential of harnessing visual knowledge to enhance the overall capabilities of LLMs, which could be regarded as Vision Enhancing LLMs. In this paper, we propose an approach called MKS2, aimed at enhancing LLMs through empowering Multimodal Knowledge Storage and Sharing in LLMs. Specifically, we introduce Modular Visual Memory (MVM), a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently. Additionally, we present a soft Mixture of Multimodal Experts (MoMEs) architecture in LLMs to invoke multimodal knowledge collaboration during text generation. Our comprehensive experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge. It also delivers competitive results on image-text understanding multimodal benchmarks. The codes will be available at: https://github.com/HITsz-TMG/ MKS2-Multimodal-Knowledge-Storage-and-Sharing.
{"title":"Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs.","authors":"Yunxin Li, Zhenyu Liu, Baotian Hu, Wei Wang, Yuxin Ding, Xiaochun Cao, Min Zhang","doi":"10.1109/TIP.2025.3649356","DOIUrl":"https://doi.org/10.1109/TIP.2025.3649356","url":null,"abstract":"<p><p>Recent advancements in multimodal large language models (MLLMs) have achieved significant multimodal generation capabilities, akin to GPT-4. These models predominantly map visual information into language representation space, leveraging the vast knowledge and powerful text generation abilities of LLMs to produce multimodal instruction-following responses. We could term this method as LLMs for Vision because of its employing LLMs for visual understanding and reasoning, yet observe that these MLLMs neglect the potential of harnessing visual knowledge to enhance the overall capabilities of LLMs, which could be regarded as Vision Enhancing LLMs. In this paper, we propose an approach called MKS2, aimed at enhancing LLMs through empowering Multimodal Knowledge Storage and Sharing in LLMs. Specifically, we introduce Modular Visual Memory (MVM), a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently. Additionally, we present a soft Mixture of Multimodal Experts (MoMEs) architecture in LLMs to invoke multimodal knowledge collaboration during text generation. Our comprehensive experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge. It also delivers competitive results on image-text understanding multimodal benchmarks. The codes will be available at: https://github.com/HITsz-TMG/ MKS2-Multimodal-Knowledge-Storage-and-Sharing.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145992007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-16DOI: 10.1109/TIP.2026.3652431
Shuai Gong, Chaoran Cui, Xiaolin Dong, Xiushan Nie, Lei Zhu, Xiaojun Chang
Domain Generalization (FedDG) aims to train a globally generalizable model on data from decentralized, heterogeneous clients. While recent work has adapted vision-language models for FedDG using prompt learning, the prevailing "one-prompt-fits-all" paradigm struggles with sample diversity, causing a marked performance decline on personalized samples. The Mixture of Experts (MoE) architecture offers a promising solution for specialization. However, existing MoE-based prompt learning methods suffer from two key limitations: coarse image-level expert assignment and high communication costs from parameterized routers. To address these limitations, we propose TRIP, a Token-level pRompt mIxture with Parameter-free routing framework for FedDG. TRIP treats prompts as multiple experts, and assigns individual tokens within an image to distinct experts, facilitating the capture of fine-grained visual patterns. To ensure communication efficiency, TRIP introduces a parameter-free routing mechanism based on capacity-aware clustering and Optimal Transport (OT). First, tokens are grouped into capacity-aware clusters to ensure balanced workloads. These clusters are then assigned to experts via OT, stabilized by mapping cluster centroids to static, non-learnable keys. The final instance-specific prompt is synthesized by aggregating experts, weighted by the number of tokens assigned to each. Extensive experiments across four benchmarks demonstrate that TRIP achieves optimal generalization results, with communicating as few as 1K parameters. Our code is available at https://github.com/GongShuai8210/TRIP.
{"title":"Token-Level Prompt Mixture with Parameter-Free Routing for Federated Domain Generalization.","authors":"Shuai Gong, Chaoran Cui, Xiaolin Dong, Xiushan Nie, Lei Zhu, Xiaojun Chang","doi":"10.1109/TIP.2026.3652431","DOIUrl":"https://doi.org/10.1109/TIP.2026.3652431","url":null,"abstract":"<p><p>Domain Generalization (FedDG) aims to train a globally generalizable model on data from decentralized, heterogeneous clients. While recent work has adapted vision-language models for FedDG using prompt learning, the prevailing \"one-prompt-fits-all\" paradigm struggles with sample diversity, causing a marked performance decline on personalized samples. The Mixture of Experts (MoE) architecture offers a promising solution for specialization. However, existing MoE-based prompt learning methods suffer from two key limitations: coarse image-level expert assignment and high communication costs from parameterized routers. To address these limitations, we propose TRIP, a Token-level pRompt mIxture with Parameter-free routing framework for FedDG. TRIP treats prompts as multiple experts, and assigns individual tokens within an image to distinct experts, facilitating the capture of fine-grained visual patterns. To ensure communication efficiency, TRIP introduces a parameter-free routing mechanism based on capacity-aware clustering and Optimal Transport (OT). First, tokens are grouped into capacity-aware clusters to ensure balanced workloads. These clusters are then assigned to experts via OT, stabilized by mapping cluster centroids to static, non-learnable keys. The final instance-specific prompt is synthesized by aggregating experts, weighted by the number of tokens assigned to each. Extensive experiments across four benchmarks demonstrate that TRIP achieves optimal generalization results, with communicating as few as 1K parameters. Our code is available at https://github.com/GongShuai8210/TRIP.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145991945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vision foundation models in remote sensing have been extensively studied due to their superior generalization on various downstream tasks. Synthetic Aperture Radar (SAR) offers all-day, all-weather imaging capabilities, providing significant advantages for Earth observation. However, establishing a foundation model for SAR image interpretation inevitably encounters the challenges of insufficient information utilization and poor interpretability. In this paper, we propose a remote sensing foundation model based on complex-valued SAR data, which simulates the polarimetric decomposition process for pre-training, i.e., characterizing pixel scattering intensity as a weighted combination of scattering bases and scattering coefficients, thereby endowing the foundation model with physical interpretability. Specifically, we construct a series of scattering queries, each representing an independent and meaningful scattering basis, which interact with SAR features in the scattering query decoder and output the corresponding scattering coefficient. To guide the pre-training process, polarimetric decomposition loss and power self-supervised loss are constructed. The former aligns the predicted coefficients with Yamaguchi coefficients, while the latter reconstructs power from the predicted coefficients and compares it to the input image's power. The performance of our foundation model is validated on nine typical downstream tasks, achieving state-of-the-art results. Notably, the foundation model can extract stable feature representations and exhibits strong generalization, even in data-scarce conditions.
{"title":"A Complex-valued SAR Foundation Model Based on Physically Inspired Representation Learning.","authors":"Mengyu Wang, Hanbo Bi, Yingchao Feng, Linlin Xin, Shuo Gong, Tianqi Wang, Zhiyuan Yan, Peijin Wang, Wenhui Diao, Xian Sun","doi":"10.1109/TIP.2026.3652417","DOIUrl":"https://doi.org/10.1109/TIP.2026.3652417","url":null,"abstract":"<p><p>Vision foundation models in remote sensing have been extensively studied due to their superior generalization on various downstream tasks. Synthetic Aperture Radar (SAR) offers all-day, all-weather imaging capabilities, providing significant advantages for Earth observation. However, establishing a foundation model for SAR image interpretation inevitably encounters the challenges of insufficient information utilization and poor interpretability. In this paper, we propose a remote sensing foundation model based on complex-valued SAR data, which simulates the polarimetric decomposition process for pre-training, i.e., characterizing pixel scattering intensity as a weighted combination of scattering bases and scattering coefficients, thereby endowing the foundation model with physical interpretability. Specifically, we construct a series of scattering queries, each representing an independent and meaningful scattering basis, which interact with SAR features in the scattering query decoder and output the corresponding scattering coefficient. To guide the pre-training process, polarimetric decomposition loss and power self-supervised loss are constructed. The former aligns the predicted coefficients with Yamaguchi coefficients, while the latter reconstructs power from the predicted coefficients and compares it to the input image's power. The performance of our foundation model is validated on nine typical downstream tasks, achieving state-of-the-art results. Notably, the foundation model can extract stable feature representations and exhibits strong generalization, even in data-scarce conditions.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145991968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-12DOI: 10.1109/TIP.2025.3650664
{"title":"Reviewer Summary for Transactions on Image Processing","authors":"","doi":"10.1109/TIP.2025.3650664","DOIUrl":"10.1109/TIP.2025.3650664","url":null,"abstract":"","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8684-8708"},"PeriodicalIF":13.7,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11346802","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145955219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1109/TIP.2025.3648203
Hao Jing, Anhong Wang, Yifan Zhang, Donghan Bu, Junhui Hou
Regarding intelligent transportation systems, low-bitrate transmission via lossy point cloud compression is vital for facilitating real-time collaborative perception among connected agents, such as vehicles and infrastructures, under restricted bandwidth. In existing compression transmission systems, the sender lossily compresses point coordinates and reflectance to generate a transmission code stream, which faces transmission burdens from reflectance encoding and limited detection robustness due to information loss. To address these issues, this paper proposes a 3D object detection framework with reflectance prediction-based knowledge distillation (RPKD). We compress point coordinates while discarding reflectance during low-bitrate transmission, and feed the decoded non-reflectance compressed point clouds into a student detector. The discarded reflectance is then reconstructed by a geometry-based reflectance prediction (RP) module within the student detector for precise detection. A teacher detector with the same structure as the student detector is designed for performing reflectance knowledge distillation (RKD) and detection knowledge distillation (DKD) from raw to compressed point clouds. Our cross-source distillation training strategy (CDTS) equips the student detector with robustness to low-quality compressed data while preserving the accuracy benefits of raw data through transferred distillation knowledge. Experimental results on the KITTI and DAIR-V2X-V datasets demonstrate that our method can boost detection accuracy for compressed point clouds across multiple code rates. We will release the code publicly at https://github.com/HaoJing-SX/RPKD.
{"title":"Reflectance Prediction-based Knowledge Distillation for Robust 3D Object Detection in Compressed Point Clouds.","authors":"Hao Jing, Anhong Wang, Yifan Zhang, Donghan Bu, Junhui Hou","doi":"10.1109/TIP.2025.3648203","DOIUrl":"https://doi.org/10.1109/TIP.2025.3648203","url":null,"abstract":"<p><p>Regarding intelligent transportation systems, low-bitrate transmission via lossy point cloud compression is vital for facilitating real-time collaborative perception among connected agents, such as vehicles and infrastructures, under restricted bandwidth. In existing compression transmission systems, the sender lossily compresses point coordinates and reflectance to generate a transmission code stream, which faces transmission burdens from reflectance encoding and limited detection robustness due to information loss. To address these issues, this paper proposes a 3D object detection framework with reflectance prediction-based knowledge distillation (RPKD). We compress point coordinates while discarding reflectance during low-bitrate transmission, and feed the decoded non-reflectance compressed point clouds into a student detector. The discarded reflectance is then reconstructed by a geometry-based reflectance prediction (RP) module within the student detector for precise detection. A teacher detector with the same structure as the student detector is designed for performing reflectance knowledge distillation (RKD) and detection knowledge distillation (DKD) from raw to compressed point clouds. Our cross-source distillation training strategy (CDTS) equips the student detector with robustness to low-quality compressed data while preserving the accuracy benefits of raw data through transferred distillation knowledge. Experimental results on the KITTI and DAIR-V2X-V datasets demonstrate that our method can boost detection accuracy for compressed point clouds across multiple code rates. We will release the code publicly at https://github.com/HaoJing-SX/RPKD.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145893537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unsupervised Domain Adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain by learning domain-invariant representations. Motivated by the recent success of Vision Transformers (ViTs), several UDA approaches have adopted ViT architectures to exploit fine-grained patch-level representations, which are unified as Transformer-based $D$ omain $A$ daptation (TransDA) independent of CNN-based. However, we have a key observation in TransDA: due to inherent domain shifts, patches (tokens) from different semantic categories across domains may exhibit abnormally high similarities, which can mislead the self-attention mechanism and degrade adaptation performance. To solve that, we propose a novel $P$ atch-$A$ daptation Transformer (PATrans), which first identifies similarity-anomalous patches and then adaptively suppresses their negative impact to domain alignment, i.e. token calibration. Specifically, we introduce a $P$ atch-$A$ daptation $A$ ttention (PAA) mechanism to replace the standard self-attention mechanism, which consists of a weight-shared triple-branch mixed attention mechanism and a patch-level domain discriminator. The mixed attention integrates self-attention and cross-attention to enhance intra-domain feature modeling and inter-domain similarity estimation. Meanwhile, the patch-level domain discriminator quantifies the anomaly probability of each patch, enabling dynamic reweighting to mitigate the impact of unreliable patch correspondences. Furthermore, we introduce a contrastive attention regularization strategy, which leverages category-level information in a contrastive learning framework to promote class-consistent attention distributions. Extensive experiments on four benchmark datasets demonstrate that PATrans attains significant improvements over existing state-of-the-art UDA methods (e.g., 89.2% on the VisDA-2017). Code is available at: https://github.com/YSY145/PATrans
{"title":"Token Calibration for Transformer-Based Domain Adaptation","authors":"Xiaowei Fu;Shiyu Ye;Chenxu Zhang;Fuxiang Huang;Xin Xu;Lei Zhang","doi":"10.1109/TIP.2025.3647367","DOIUrl":"10.1109/TIP.2025.3647367","url":null,"abstract":"Unsupervised Domain Adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain by learning domain-invariant representations. Motivated by the recent success of Vision Transformers (ViTs), several UDA approaches have adopted ViT architectures to exploit fine-grained patch-level representations, which are unified as <italic>Trans</i>former-based <inline-formula> <tex-math>$D$ </tex-math></inline-formula>omain <inline-formula> <tex-math>$A$ </tex-math></inline-formula>daptation (TransDA) independent of CNN-based. However, we have a key observation in TransDA: due to inherent domain shifts, patches (tokens) from different semantic categories across domains may exhibit abnormally high similarities, which can mislead the self-attention mechanism and degrade adaptation performance. To solve that, we propose a novel <inline-formula> <tex-math>$P$ </tex-math></inline-formula>atch-<inline-formula> <tex-math>$A$ </tex-math></inline-formula>daptation <italic>Trans</i>former (PATrans), which first <italic>identifies</i> similarity-anomalous patches and then adaptively <italic>suppresses</i> their negative impact to domain alignment, i.e. <italic>token calibration</i>. Specifically, we introduce a <inline-formula> <tex-math>$P$ </tex-math></inline-formula>atch-<inline-formula> <tex-math>$A$ </tex-math></inline-formula>daptation <inline-formula> <tex-math>$A$ </tex-math></inline-formula>ttention (<italic>PAA</i>) mechanism to replace the standard self-attention mechanism, which consists of a weight-shared triple-branch mixed attention mechanism and a patch-level domain discriminator. The mixed attention integrates self-attention and cross-attention to enhance intra-domain feature modeling and inter-domain similarity estimation. Meanwhile, the patch-level domain discriminator quantifies the anomaly probability of each patch, enabling dynamic reweighting to mitigate the impact of unreliable patch correspondences. Furthermore, we introduce a contrastive attention regularization strategy, which leverages category-level information in a contrastive learning framework to promote class-consistent attention distributions. Extensive experiments on four benchmark datasets demonstrate that PATrans attains significant improvements over existing state-of-the-art UDA methods (e.g., 89.2% on the VisDA-2017). Code is available at: <uri>https://github.com/YSY145/PATrans</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"57-68"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1109/TIP.2025.3647207
Yang Xu;Jian Zhu;Danfeng Hong;Zhihui Wei;Zebin Wu
Hyperspectral images (HSIs) and multispectral images (MSIs) fusion is a hot topic in the remote sensing society. A high-resolution HSI (HR-HSI) can be obtained by fusing a low-resolution HSI (LR-HSI) and a high-resolution MSI (HR-MSI) or RGB image. However, most deep learning-based methods require a large amount of HR-HSIs for supervised training, which is very rare in practice. In this paper, we propose a coupled diffusion posterior sampling (CDPS) method for HSI and MSI fusion in which the HR-HSIs are no longer required in the training process. Because the LR-HSI contains the spectral information and HR-MSI contains the spatial information of the captured scene, we design an unsupervised strategy that learns the required diffusion priors directly and solely from the input test image pair (the LR-HSI and HR-MSI themselves). Then, a coupled diffusion posterior sampling method is proposed to introduce the two priors in the diffusion posterior sampling which leverages the observed LR-HSI and HR-MSI as fidelity terms. Experimental results demonstrate that the proposed method outperforms other state-of-the-art unsupervised HSI and MSI fusion methods. Additionally, this method utilizes smaller networks that are simpler and easier to train without other data.
{"title":"Coupled Diffusion Posterior Sampling for Unsupervised Hyperspectral and Multispectral Images Fusion","authors":"Yang Xu;Jian Zhu;Danfeng Hong;Zhihui Wei;Zebin Wu","doi":"10.1109/TIP.2025.3647207","DOIUrl":"10.1109/TIP.2025.3647207","url":null,"abstract":"Hyperspectral images (HSIs) and multispectral images (MSIs) fusion is a hot topic in the remote sensing society. A high-resolution HSI (HR-HSI) can be obtained by fusing a low-resolution HSI (LR-HSI) and a high-resolution MSI (HR-MSI) or RGB image. However, most deep learning-based methods require a large amount of HR-HSIs for supervised training, which is very rare in practice. In this paper, we propose a coupled diffusion posterior sampling (CDPS) method for HSI and MSI fusion in which the HR-HSIs are no longer required in the training process. Because the LR-HSI contains the spectral information and HR-MSI contains the spatial information of the captured scene, we design an unsupervised strategy that learns the required diffusion priors directly and solely from the input test image pair (the LR-HSI and HR-MSI themselves). Then, a coupled diffusion posterior sampling method is proposed to introduce the two priors in the diffusion posterior sampling which leverages the observed LR-HSI and HR-MSI as fidelity terms. Experimental results demonstrate that the proposed method outperforms other state-of-the-art unsupervised HSI and MSI fusion methods. Additionally, this method utilizes smaller networks that are simpler and easier to train without other data.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"69-84"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Point clouds have gained prominence across numerous applications due to their ability to accurately represent 3D objects and scenes. However, efficiently compressing unstructured, high-precision point cloud data remains a significant challenge. In this paper, we propose NeRC3, a novel point cloud compression framework that leverages implicit neural representations (INRs) to encode both geometry and attributes of dense point clouds. Our approach employs two coordinate-based neural networks: one maps spatial coordinates to voxel occupancy, while the other maps occupied voxels to their attributes, thereby implicitly representing the geometry and attributes of a voxelized point cloud. The encoder quantizes and compresses network parameters alongside auxiliary information required for reconstruction, while the decoder reconstructs the original point cloud by inputting voxel coordinates into the neural networks. Furthermore, we extend our method to dynamic point cloud compression through techniques that reduce temporal redundancy, including a 4D spatio-temporal representation termed 4D-NeRC3. Experimental results validate the effectiveness of our approach: For static point clouds, NeRC3 outperforms octree-based G-PCC standard and existing INR-based methods. For dynamic point clouds, 4D-NeRC3 achieves superior geometry compression performance compared to the latest G-PCC and V-PCC standards, while matching state-of-the-art learning-based methods. It also demonstrates competitive performance in joint geometry and attribute compression.
{"title":"Implicit Neural Compression of Point Clouds.","authors":"Hongning Ruan, Yulin Shao, Qianqian Yang, Liang Zhao, Zhaoyang Zhang, Dusit Niyato","doi":"10.1109/TIP.2025.3648141","DOIUrl":"https://doi.org/10.1109/TIP.2025.3648141","url":null,"abstract":"<p><p>Point clouds have gained prominence across numerous applications due to their ability to accurately represent 3D objects and scenes. However, efficiently compressing unstructured, high-precision point cloud data remains a significant challenge. In this paper, we propose NeRC<sup>3</sup>, a novel point cloud compression framework that leverages implicit neural representations (INRs) to encode both geometry and attributes of dense point clouds. Our approach employs two coordinate-based neural networks: one maps spatial coordinates to voxel occupancy, while the other maps occupied voxels to their attributes, thereby implicitly representing the geometry and attributes of a voxelized point cloud. The encoder quantizes and compresses network parameters alongside auxiliary information required for reconstruction, while the decoder reconstructs the original point cloud by inputting voxel coordinates into the neural networks. Furthermore, we extend our method to dynamic point cloud compression through techniques that reduce temporal redundancy, including a 4D spatio-temporal representation termed 4D-NeRC<sup>3</sup>. Experimental results validate the effectiveness of our approach: For static point clouds, NeRC<sup>3</sup> outperforms octree-based G-PCC standard and existing INR-based methods. For dynamic point clouds, 4D-NeRC<sup>3</sup> achieves superior geometry compression performance compared to the latest G-PCC and V-PCC standards, while matching state-of-the-art learning-based methods. It also demonstrates competitive performance in joint geometry and attribute compression.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}