Pub Date : 2026-01-28DOI: 10.1109/TIP.2026.3657171
Hao Chen;Haoran Zhou;Yunshu Zhang;Zheng Lin;Yongjian Deng
In the RGB-D vision community, extensive research has been focused on designing multi-modal learning strategies and fusion structures. However, the complementary and fusion mechanisms in RGB-D models remain a opaque box. In this paper, we present an analytical framework and a novel score to dissect the RGB-D vision community. Our approach involves measuring proposed semantic variance and feature similarity across modalities and levels, conducting visual and quantitative analyzes on multi-modal learning through comprehensive experiments. Specifically, we investigate the consistency and specialty of features across modalities, evolution rules within each modality, and the collaboration logic used when optimizing a RGB-D model. Our studies reveal/verify several important findings, such as the discrepancy in cross-modal features and the hybrid multi-modal cooperation rule, which highlights consistency and specialty simultaneously for complementary inference. We also showcase the versatility of the proposed RGB-D dissection method and introduce a straightforward fusion strategy based on our findings, which delivers significant enhancements across various tasks and even other multi-modal data.
{"title":"Dissecting RGB-D Learning for Improved Multi-Modal Fusion","authors":"Hao Chen;Haoran Zhou;Yunshu Zhang;Zheng Lin;Yongjian Deng","doi":"10.1109/TIP.2026.3657171","DOIUrl":"10.1109/TIP.2026.3657171","url":null,"abstract":"In the RGB-D vision community, extensive research has been focused on designing multi-modal learning strategies and fusion structures. However, the complementary and fusion mechanisms in RGB-D models remain a opaque box. In this paper, we present an analytical framework and a novel score to dissect the RGB-D vision community. Our approach involves measuring proposed semantic variance and feature similarity across modalities and levels, conducting visual and quantitative analyzes on multi-modal learning through comprehensive experiments. Specifically, we investigate the consistency and specialty of features across modalities, evolution rules within each modality, and the collaboration logic used when optimizing a RGB-D model. Our studies reveal/verify several important findings, such as the discrepancy in cross-modal features and the hybrid multi-modal cooperation rule, which highlights consistency and specialty simultaneously for complementary inference. We also showcase the versatility of the proposed RGB-D dissection method and introduce a straightforward fusion strategy based on our findings, which delivers significant enhancements across various tasks and even other multi-modal data.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1846-1857"},"PeriodicalIF":13.7,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1109/TIP.2026.3652003
Guangzhao Dai;Shuo Wang;Hao Zhao;Bin Zhu;Qianru Sun;Xiangbo Shu
Vision-and-Language Navigation in continuous environments (VLN-CE) requires an embodied robot to navigate the target destination following the natural language instruction. Most existing methods use panoramic RGB-D cameras for 360° observation of environments. However, these methods struggle in real-world applications because of the higher cost of panoramic RGB-D cameras. This paper studies a low-cost and practical VLN-CE setting, e.g., using monocular cameras of limited field of view, which means “Look Less” for visual observations and environment semantics. In this paper, we propose a ThinkMatter framework for monocular VLN-CE, where we motivate monocular robots to “Think More” by 1) generating novel views and 2) integrating instruction semantics. Specifically, we achieve the former by the proposed 3DGS-based panoramic generation to render novel views at each step, based on past observation collections. We achieve the latter by the proposed enhancement of the occupancy-instruction semantics, which integrates the spatial semantics of occupancy maps with the textual semantics of language instructions. These operations promote monocular robots with wider environment perceptions as well as transparent semantic connections with the instruction. Both extensive experiments in the simulators and real-world environments demonstrate the effectiveness of ThinkMatter, providing a promising practice for real-world navigation.
{"title":"ThinkMatter: Panoramic-Aware Instructional Semantics for Monocular Vision-and-Language Navigation","authors":"Guangzhao Dai;Shuo Wang;Hao Zhao;Bin Zhu;Qianru Sun;Xiangbo Shu","doi":"10.1109/TIP.2026.3652003","DOIUrl":"10.1109/TIP.2026.3652003","url":null,"abstract":"Vision-and-Language Navigation in continuous environments (VLN-CE) requires an embodied robot to navigate the target destination following the natural language instruction. Most existing methods use panoramic RGB-D cameras for 360° observation of environments. However, these methods struggle in real-world applications because of the higher cost of panoramic RGB-D cameras. This paper studies a low-cost and practical VLN-CE setting, e.g., using monocular cameras of limited field of view, which means “Look Less” for visual observations and environment semantics. In this paper, we propose a ThinkMatter framework for monocular VLN-CE, where we motivate monocular robots to “Think More” by 1) generating novel views and 2) integrating instruction semantics. Specifically, we achieve the former by the proposed 3DGS-based panoramic generation to render novel views at each step, based on past observation collections. We achieve the latter by the proposed enhancement of the occupancy-instruction semantics, which integrates the spatial semantics of occupancy maps with the textual semantics of language instructions. These operations promote monocular robots with wider environment perceptions as well as transparent semantic connections with the instruction. Both extensive experiments in the simulators and real-world environments demonstrate the effectiveness of ThinkMatter, providing a promising practice for real-world navigation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1937-1950"},"PeriodicalIF":13.7,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1109/TIP.2026.3654367
Xiang Fang;Zizhuo Li;Jiayi Ma
Recent advancements have led the image matching community to increasingly focus on obtaining subpixel-level correspondences in a detector-free manner, i.e., semi-dense feature matching. Existing methods tend to overfocus on low-level local features while ignoring equally important high-level semantic information. To tackle these shortcomings, we propose SigMa, a semantic similarity-guided semi-dense feature matching method, which leverages the strengths of both local features and high-level semantic features. First, we design a dual-branch feature extractor, comprising a convolutional network and a vision foundation model, to extract low-level local features and high-level semantic features, respectively. To fully retain the advantages of these two features and effectively integrate them, we also introduce a cross-domain feature adapter, which could overcome their spatial resolution mismatches, channel dimensionality variations, and inter-domain gaps. Furthermore, we observe that performing the transformer on the whole feature map is unnecessary because of the similarity of local representations. We design a guided pooling method based on semantic similarity. This strategy performs attention computation by selecting highly semantically similar regions, aiming to minimize information loss while maintaining computational efficiency. Extensive experiments on multiple datasets demonstrate that our method achieves a competitive accuracy-efficiency trade-off across various tasks and exhibits strong generalization capabilities across different datasets. Additionally, we conduct a series of ablation studies and analysis experiments to validate the effectiveness and rationality of our method’s design. Our code is publicly available at https://github.com/ShineFox/SigMa
{"title":"SigMa: Semantic Similarity-Guided Semi-Dense Feature Matching","authors":"Xiang Fang;Zizhuo Li;Jiayi Ma","doi":"10.1109/TIP.2026.3654367","DOIUrl":"10.1109/TIP.2026.3654367","url":null,"abstract":"Recent advancements have led the image matching community to increasingly focus on obtaining subpixel-level correspondences in a detector-free manner, i.e., semi-dense feature matching. Existing methods tend to overfocus on low-level local features while ignoring equally important high-level semantic information. To tackle these shortcomings, we propose SigMa, a semantic similarity-guided semi-dense feature matching method, which leverages the strengths of both local features and high-level semantic features. First, we design a dual-branch feature extractor, comprising a convolutional network and a vision foundation model, to extract low-level local features and high-level semantic features, respectively. To fully retain the advantages of these two features and effectively integrate them, we also introduce a cross-domain feature adapter, which could overcome their spatial resolution mismatches, channel dimensionality variations, and inter-domain gaps. Furthermore, we observe that performing the transformer on the whole feature map is unnecessary because of the similarity of local representations. We design a guided pooling method based on semantic similarity. This strategy performs attention computation by selecting highly semantically similar regions, aiming to minimize information loss while maintaining computational efficiency. Extensive experiments on multiple datasets demonstrate that our method achieves a competitive accuracy-efficiency trade-off across various tasks and exhibits strong generalization capabilities across different datasets. Additionally, we conduct a series of ablation studies and analysis experiments to validate the effectiveness and rationality of our method’s design. Our code is publicly available at <uri>https://github.com/ShineFox/SigMa</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"872-887"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unsupervised Domain Adaptation (UDA) person search aims to adapt models trained on labeled source data to unlabeled target domains. Existing approaches typically rely on clustering-based proxy learning, but their performance is often undermined by unreliable pseudo-supervision. This unreliability mainly stems from two challenges: (i) spectral shift bias, where low- and high-frequency components behave differently under domain shifts but are rarely considered, degrading feature stability; and (ii) static proxy updates, which make clustering proxies highly sensitive to noise and less adaptable to domain shifts. To address these challenges, we propose the Reliable Pseudo-supervision in UDA Person Search (RPPS) framework. At the feature level, a Dual-branch Wavelet Enhancement Module (DWEM) embedded in the backbone applies discrete wavelet transform (DWT) to decompose features into low- and high-frequency components, followed by differentiated enhancements that improve cross-domain robustness and discriminability. At the proxy level, a Dynamic Confidence-weighted Clustering Proxy (DCCP) employs confidence-guided initialization and a two-stage online–offline update strategy to stabilize proxy optimization and suppress proxy noise. Extensive experiments on the CUHK-SYSU and PRW benchmarks demonstrate that RPPS achieves state-of-the-art performance and strong robustness, underscoring the importance of enhancing pseudo-supervision reliability in UDA person search. Our code is accessible at https://github.com/zqx951102/RPPS
{"title":"Reliable Pseudo-Supervision for Unsupervised Domain Adaptive Person Search","authors":"Qixian Zhang;Duoqian Miao;Qi Zhang;Xuan Tan;Hongyun Zhang;Cairong Zhao","doi":"10.1109/TIP.2026.3654373","DOIUrl":"10.1109/TIP.2026.3654373","url":null,"abstract":"Unsupervised Domain Adaptation (UDA) person search aims to adapt models trained on labeled source data to unlabeled target domains. Existing approaches typically rely on clustering-based proxy learning, but their performance is often undermined by unreliable pseudo-supervision. This unreliability mainly stems from two challenges: (i) spectral shift bias, where low- and high-frequency components behave differently under domain shifts but are rarely considered, degrading feature stability; and (ii) static proxy updates, which make clustering proxies highly sensitive to noise and less adaptable to domain shifts. To address these challenges, we propose the Reliable Pseudo-supervision in UDA Person Search (RPPS) framework. At the feature level, a Dual-branch Wavelet Enhancement Module (DWEM) embedded in the backbone applies discrete wavelet transform (DWT) to decompose features into low- and high-frequency components, followed by differentiated enhancements that improve cross-domain robustness and discriminability. At the proxy level, a Dynamic Confidence-weighted Clustering Proxy (DCCP) employs confidence-guided initialization and a two-stage online–offline update strategy to stabilize proxy optimization and suppress proxy noise. Extensive experiments on the CUHK-SYSU and PRW benchmarks demonstrate that RPPS achieves state-of-the-art performance and strong robustness, underscoring the importance of enhancing pseudo-supervision reliability in UDA person search. Our code is accessible at <uri>https://github.com/zqx951102/RPPS</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"915-929"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1109/TIP.2026.3654473
Zhong Ji;Rongshuai Wei;Jingren Liu;Yanwei Pang;Jungong Han
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable, but they often struggle in data-scarce settings where insufficient training samples lead to suboptimal performance. To address this limitation, we propose a Few-Shot Prototypical Concept Classification (FSPCC) framework that systematically mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment. Specifically, our approach leverages a Mixture of LoRA Experts (MoLE) for parameter-efficient adaptation, ensuring a balanced allocation of trainable parameters between the backbone and the PCL module. Meanwhile, cross-module concept guidance enforces tight alignment between the backbone’s feature representations and the prototypical concept activation patterns. In addition, we incorporate a multi-level feature preservation strategy that fuses spatial and semantic cues across various layers, thereby enriching the learned representations and mitigating the challenges posed by limited data availability. Finally, to enhance interpretability and minimize concept overlap, we introduce a geometry-aware concept discrimination loss that enforces orthogonality among concepts, encouraging more disentangled and transparent decision boundaries. Experimental results on six popular benchmarks (CUB-200-2011, mini-ImageNet, CIFAR-FS, Stanford Cars, FGVC-Aircraft, and DTD) demonstrate that our approach consistently outperforms existing SEMs by a notable margin, with 4.2%–8.7% relative gains in 5-way 5-shot classification. These findings highlight the efficacy of coupling concept learning with few-shot adaptation to achieve both higher accuracy and clearer model interpretability, paving the way for more transparent visual recognition systems.
{"title":"Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts","authors":"Zhong Ji;Rongshuai Wei;Jingren Liu;Yanwei Pang;Jungong Han","doi":"10.1109/TIP.2026.3654473","DOIUrl":"10.1109/TIP.2026.3654473","url":null,"abstract":"Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable, but they often struggle in data-scarce settings where insufficient training samples lead to suboptimal performance. To address this limitation, we propose a Few-Shot Prototypical Concept Classification (FSPCC) framework that systematically mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment. Specifically, our approach leverages a Mixture of LoRA Experts (MoLE) for parameter-efficient adaptation, ensuring a balanced allocation of trainable parameters between the backbone and the PCL module. Meanwhile, cross-module concept guidance enforces tight alignment between the backbone’s feature representations and the prototypical concept activation patterns. In addition, we incorporate a multi-level feature preservation strategy that fuses spatial and semantic cues across various layers, thereby enriching the learned representations and mitigating the challenges posed by limited data availability. Finally, to enhance interpretability and minimize concept overlap, we introduce a geometry-aware concept discrimination loss that enforces orthogonality among concepts, encouraging more disentangled and transparent decision boundaries. Experimental results on six popular benchmarks (CUB-200-2011, mini-ImageNet, CIFAR-FS, Stanford Cars, FGVC-Aircraft, and DTD) demonstrate that our approach consistently outperforms existing SEMs by a notable margin, with 4.2%–8.7% relative gains in 5-way 5-shot classification. These findings highlight the efficacy of coupling concept learning with few-shot adaptation to achieve both higher accuracy and clearer model interpretability, paving the way for more transparent visual recognition systems.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"930-942"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146021122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The multi-classification of histopathological images under imbalanced sample conditions remains a long-standing unresolved challenge in computational pathology. In this paper, we propose for the first time a cross-patient pseudo-bag generation technique to address this challenge. Our key innovation lies in a cross-patient pseudo-bag generation framework that extracts complementary pathological features to construct distributionally consistent pseudo-bags. To resolve the critical challenge of distributional alignment in pseudo-bag generation, we propose an affinity-driven curriculum contrastive learning strategy, integrating sample affinity metrics with progressive training to stabilize representation learning. Unlike prior methods focused on bag-level embeddings, our framework pioneers a paradigm shift toward multi-instance feature distribution mining, explicitly modeling inter-bag heterogeneity to address class imbalance. Our method demonstrates significant performance improvements on three datasets with multiple classification difficulties, outperforming the second-best method by an average of 1.95 percentage points in F1 score and 2.07 percentage points in ACC.
{"title":"Imbalanced Multiclassification Challenges in Whole Slide Image: Cross-Patient Pseudo Bags Generation and Curriculum Contrastive Learning With Dynamic Rebalancing","authors":"Yonghuang Wu;Xuan Xie;Chengqian Zhao;Pengfei Song;Feiyu Yin;Guoqing Wu;Jinhua Yu","doi":"10.1109/TIP.2026.3654402","DOIUrl":"10.1109/TIP.2026.3654402","url":null,"abstract":"The multi-classification of histopathological images under imbalanced sample conditions remains a long-standing unresolved challenge in computational pathology. In this paper, we propose for the first time a cross-patient pseudo-bag generation technique to address this challenge. Our key innovation lies in a cross-patient pseudo-bag generation framework that extracts complementary pathological features to construct distributionally consistent pseudo-bags. To resolve the critical challenge of distributional alignment in pseudo-bag generation, we propose an affinity-driven curriculum contrastive learning strategy, integrating sample affinity metrics with progressive training to stabilize representation learning. Unlike prior methods focused on bag-level embeddings, our framework pioneers a paradigm shift toward multi-instance feature distribution mining, explicitly modeling inter-bag heterogeneity to address class imbalance. Our method demonstrates significant performance improvements on three datasets with multiple classification difficulties, outperforming the second-best method by an average of 1.95 percentage points in F1 score and 2.07 percentage points in ACC.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"904-914"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-19DOI: 10.1109/TIP.2025.3650052
Yuming Yang;Wei Wang
Multi-exposure image fusion (MEF) is the main method to obtain High Dynamic Range (HDR) images by fusing multiple images taken under various exposure values. In this paper, we propose and develop a novel variational model based on detail-base decomposition for MEF. The main idea is to incorporate the decomposition procedure and the reconstruction procedure into a unified framework, and to interact the detail information and the base information at the same time. Specifically, we make use of Tikhonov regularization to model the base layer, and we present an efficient design to obtain the detail layer, which is able to capture more detailed information effectively. Meanwhile, we incorporate multi-scale techniques to remove halo artifacts. Numerically, we apply alternating direction method of multipliers (ADMM) to solve the proposed minimization problem. Theoretically, we study the existence of the solution of the proposed model and the convergence of the proposed ADMM algorithm. Experimental examples are presented to demonstrate that the performance of the proposed model is better than that by using other testing methods in terms of visual quality and some criteria, e. g., the proposed model gives the best Natural image quality evaluator (NIQE) values with 1% - 10% improvement for real image fusion experiments and gives the best PSNR values with 13% - 20% improvement for the synthetic image fusion experiment.
{"title":"A Variational Multi-Scale Model for Multi-Exposure Image Fusion","authors":"Yuming Yang;Wei Wang","doi":"10.1109/TIP.2025.3650052","DOIUrl":"10.1109/TIP.2025.3650052","url":null,"abstract":"Multi-exposure image fusion (MEF) is the main method to obtain High Dynamic Range (HDR) images by fusing multiple images taken under various exposure values. In this paper, we propose and develop a novel variational model based on detail-base decomposition for MEF. The main idea is to incorporate the decomposition procedure and the reconstruction procedure into a unified framework, and to interact the detail information and the base information at the same time. Specifically, we make use of Tikhonov regularization to model the base layer, and we present an efficient design to obtain the detail layer, which is able to capture more detailed information effectively. Meanwhile, we incorporate multi-scale techniques to remove halo artifacts. Numerically, we apply alternating direction method of multipliers (ADMM) to solve the proposed minimization problem. Theoretically, we study the existence of the solution of the proposed model and the convergence of the proposed ADMM algorithm. Experimental examples are presented to demonstrate that the performance of the proposed model is better than that by using other testing methods in terms of visual quality and some criteria, e. g., the proposed model gives the best Natural image quality evaluator (NIQE) values with 1% - 10% improvement for real image fusion experiments and gives the best PSNR values with 13% - 20% improvement for the synthetic image fusion experiment.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"701-716"},"PeriodicalIF":13.7,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146000606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-16DOI: 10.1109/TIP.2026.3653189
Xiang Fang;Shihua Zhang;Hao Zhang;Xiaoguang Mei;Huabing Zhou;Jiayi Ma
Two-view correspondence learning aims to discern true and false correspondences between image pairs by recognizing their underlying different information. Previous methods either treat the information equally or require the explicit storage of the entire context, tending to be laborious in real-world scenarios. Inspired by Mamba’s inherent selectivity, we propose CorrMamba, a Correspondence filter leveraging Mamba’s ability to selectively mine information from true correspondences while mitigating interference from false ones, thus achieving adaptive focus at a lower cost. To prevent Mamba from being potentially impacted by unordered keypoints that obscured its ability to mine spatial information, we customize a causal sequential learning approach based on the Gumbel-Softmax technique to establish causal dependencies between features in a fully autonomous and differentiable manner. Additionally, a local-context enhancement module is designed to capture critical contextual cues essential for correspondence pruning, complementing the core framework. Extensive experiments on relative pose estimation, visual localization, and analysis demonstrate that CorrMamba achieves state-of-the-art performance. Notably, in outdoor relative pose estimation, our method surpasses the previous SOTA by 2.58 absolute percentage points in AUC@20°, highlighting its practical superiority. Our code is publicly available at https://github.com/ShineFox/CorrMamba
{"title":"Selecting and Pruning: A Differentiable Causal Sequentialized State-Space Model for Two-View Correspondence Learning","authors":"Xiang Fang;Shihua Zhang;Hao Zhang;Xiaoguang Mei;Huabing Zhou;Jiayi Ma","doi":"10.1109/TIP.2026.3653189","DOIUrl":"10.1109/TIP.2026.3653189","url":null,"abstract":"Two-view correspondence learning aims to discern true and false correspondences between image pairs by recognizing their underlying different information. Previous methods either treat the information equally or require the explicit storage of the entire context, tending to be laborious in real-world scenarios. Inspired by Mamba’s inherent selectivity, we propose CorrMamba, a Correspondence filter leveraging Mamba’s ability to selectively mine information from true correspondences while mitigating interference from false ones, thus achieving adaptive focus at a lower cost. To prevent Mamba from being potentially impacted by unordered keypoints that obscured its ability to mine spatial information, we customize a causal sequential learning approach based on the Gumbel-Softmax technique to establish causal dependencies between features in a fully autonomous and differentiable manner. Additionally, a local-context enhancement module is designed to capture critical contextual cues essential for correspondence pruning, complementing the core framework. Extensive experiments on relative pose estimation, visual localization, and analysis demonstrate that CorrMamba achieves state-of-the-art performance. Notably, in outdoor relative pose estimation, our method surpasses the previous SOTA by 2.58 absolute percentage points in AUC@20°, highlighting its practical superiority. Our code is publicly available at <uri>https://github.com/ShineFox/CorrMamba</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"816-829"},"PeriodicalIF":13.7,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145991971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advancements in multimodal large language models (MLLMs) have achieved significant multimodal generation capabilities, akin to GPT-4. These models predominantly map visual information into language representation space, leveraging the vast knowledge and powerful text generation abilities of LLMs to produce multimodal instruction-following responses. We could term this method as LLMs for Vision because of its employing LLMs for visual understanding and reasoning, yet observe that these MLLMs neglect the potential of harnessing visual knowledge to enhance the overall capabilities of LLMs, which could be regarded as Vision Enhancing LLMs. In this paper, we propose an approach called MKS2, aimed at enhancing LLMs through empowering Multimodal Knowledge Storage and Sharing in LLMs. Specifically, we introduce Modular Visual Memory (MVM), a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently. Additionally, we present a soft Mixture of Multimodal Experts (MoMEs) architecture in LLMs to invoke multimodal knowledge collaboration during text generation. Our comprehensive experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge. It also delivers competitive results on image-text understanding multimodal benchmarks. The codes will be available at: https://github.com/HITsz-TMG/MKS2-Multimodal-Knowledge-Storage-and-Sharing
{"title":"Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs","authors":"Yunxin Li;Zhenyu Liu;Baotian Hu;Wei Wang;Yuxin Ding;Xiaochun Cao;Min Zhang","doi":"10.1109/TIP.2025.3649356","DOIUrl":"10.1109/TIP.2025.3649356","url":null,"abstract":"Recent advancements in multimodal large language models (MLLMs) have achieved significant multimodal generation capabilities, akin to GPT-4. These models predominantly map visual information into language representation space, leveraging the vast knowledge and powerful text generation abilities of LLMs to produce multimodal instruction-following responses. We could term this method as LLMs for Vision because of its employing LLMs for visual understanding and reasoning, yet observe that these MLLMs neglect the potential of harnessing visual knowledge to enhance the overall capabilities of LLMs, which could be regarded as Vision Enhancing LLMs. In this paper, we propose an approach called MKS2, aimed at enhancing LLMs through empowering Multimodal Knowledge Storage and Sharing in LLMs. Specifically, we introduce Modular Visual Memory (MVM), a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently. Additionally, we present a soft Mixture of Multimodal Experts (MoMEs) architecture in LLMs to invoke multimodal knowledge collaboration during text generation. Our comprehensive experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge. It also delivers competitive results on image-text understanding multimodal benchmarks. The codes will be available at: <uri>https://github.com/HITsz-TMG/MKS2-Multimodal-Knowledge-Storage-and-Sharing</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"858-871"},"PeriodicalIF":13.7,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145992007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-16DOI: 10.1109/TIP.2026.3652360
Tao Hu;Longyao Wu;Wei Dong;Peng Wu;Jinqiu Sun;Xiaogang Xu;Qingsen Yan;Yanning Zhang
Recovering High Dynamic Range (HDR) images from multiple Standard Dynamic Range (SDR) images becomes challenging when the SDR images exhibit noticeable degradation and missing content. Leveraging scene-specific semantic priors offers a promising solution for restoring heavily degraded regions. However, these priors are typically extracted from sRGB SDR images, the domain/format gap poses a significant challenge when applying it to HDR imaging. To address this issue, we propose a general framework that transfers semantic knowledge derived from SDR domain via self-distillation to boost existing HDR reconstruction. Specifically, the proposed framework first introduces the Semantic Priors Guided Reconstruction Model (SPGRM), which leverages SDR image semantic knowledge to address ill-posed problems in the initial HDR reconstruction results. Subsequently, we leverage a self-distillation mechanism that constrains the color and content information with semantic knowledge, aligning the external outputs between the baseline and SPGRM. Furthermore, to transfer the semantic knowledge of the internal features, we utilize a Semantic Knowledge Alignment Module (SKAM) to fill the missing semantic contents with the complementary masks. Extensive experiments demonstrate that our framework significantly boosts HDR imaging quality for existing methods without altering the network architecture.
{"title":"Boosting HDR Image Reconstruction via Semantic Knowledge Transfer","authors":"Tao Hu;Longyao Wu;Wei Dong;Peng Wu;Jinqiu Sun;Xiaogang Xu;Qingsen Yan;Yanning Zhang","doi":"10.1109/TIP.2026.3652360","DOIUrl":"10.1109/TIP.2026.3652360","url":null,"abstract":"Recovering High Dynamic Range (HDR) images from multiple Standard Dynamic Range (SDR) images becomes challenging when the SDR images exhibit noticeable degradation and missing content. Leveraging scene-specific semantic priors offers a promising solution for restoring heavily degraded regions. However, these priors are typically extracted from sRGB SDR images, the domain/format gap poses a significant challenge when applying it to HDR imaging. To address this issue, we propose a general framework that transfers semantic knowledge derived from SDR domain via self-distillation to boost existing HDR reconstruction. Specifically, the proposed framework first introduces the Semantic Priors Guided Reconstruction Model (SPGRM), which leverages SDR image semantic knowledge to address ill-posed problems in the initial HDR reconstruction results. Subsequently, we leverage a self-distillation mechanism that constrains the color and content information with semantic knowledge, aligning the external outputs between the baseline and SPGRM. Furthermore, to transfer the semantic knowledge of the internal features, we utilize a Semantic Knowledge Alignment Module (SKAM) to fill the missing semantic contents with the complementary masks. Extensive experiments demonstrate that our framework significantly boosts HDR imaging quality for existing methods without altering the network architecture.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1910-1922"},"PeriodicalIF":13.7,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145991925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}