Pub Date : 2026-02-06DOI: 10.1109/TIP.2026.3659302
Liang Wu, Jianjun Wang, Wei-Shi Zheng, Guangming Shi
Tensor robust principal component analysis (TRPCA), as a popular linear low-rank method, has been widely applied to various visual tasks. The mathematical process of the low-rank prior is derived from the linear latent variable model. However, for nonlinear tensor data with rich information, their nonlinear structures may break through the assumption of low-rankness and lead to the large approximation error for TRPCA. Motivated by the latent low-dimensionality of nonlinear tensors, the general paradigm of the nonlinear tensor plus sparse tensor decomposition problem, called tensor robust kernel principal component analysis (TRKPCA), is first established in this paper. To efficiently tackle TRKPCA problem, two novel nonconvex regularizers the kernelized tensor Schatten-p norm (KTSPN) and generalized nonconvex regularization are designed, where the former KTSPN with tighter theoretical support adequately captures nonlinear features (i.e., implicit low-rankness) and the latter ensures the sparser structural coding, guaranteeing more robust separation results. Then by integrating their strengths, we propose a double nonconvex TRKPCA (DNTRKPCA) method to achieve our expectation. Finally, we develop an efficient optimization framework via the alternating direction multiplier method (ADMM) to implement the proposed nonconvex kernel method. Experimental results on synthetic data and several real databases show the higher competitiveness of our method compared with other state-of-the-art regularization methods. The code has been released in our ResearchGate homepage: https://www.researchgate.net/publication/397181729 DNTRKPCA code.
{"title":"Double Nonconvex Tensor Robust Kernel Principal Component Analysis and Its Visual Applications.","authors":"Liang Wu, Jianjun Wang, Wei-Shi Zheng, Guangming Shi","doi":"10.1109/TIP.2026.3659302","DOIUrl":"https://doi.org/10.1109/TIP.2026.3659302","url":null,"abstract":"<p><p>Tensor robust principal component analysis (TRPCA), as a popular linear low-rank method, has been widely applied to various visual tasks. The mathematical process of the low-rank prior is derived from the linear latent variable model. However, for nonlinear tensor data with rich information, their nonlinear structures may break through the assumption of low-rankness and lead to the large approximation error for TRPCA. Motivated by the latent low-dimensionality of nonlinear tensors, the general paradigm of the nonlinear tensor plus sparse tensor decomposition problem, called tensor robust kernel principal component analysis (TRKPCA), is first established in this paper. To efficiently tackle TRKPCA problem, two novel nonconvex regularizers the kernelized tensor Schatten-p norm (KTSPN) and generalized nonconvex regularization are designed, where the former KTSPN with tighter theoretical support adequately captures nonlinear features (i.e., implicit low-rankness) and the latter ensures the sparser structural coding, guaranteeing more robust separation results. Then by integrating their strengths, we propose a double nonconvex TRKPCA (DNTRKPCA) method to achieve our expectation. Finally, we develop an efficient optimization framework via the alternating direction multiplier method (ADMM) to implement the proposed nonconvex kernel method. Experimental results on synthetic data and several real databases show the higher competitiveness of our method compared with other state-of-the-art regularization methods. The code has been released in our ResearchGate homepage: https://www.researchgate.net/publication/397181729 DNTRKPCA code.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-06DOI: 10.1109/TIP.2026.3659733
Wang Xu, Yeqiang Qian, Yun-Fu Liu, Lei Tuo, Huiyong Chen, Ming Yang
In recent years, with the development of autonomous driving, 3D reconstruction for unbounded large-scale scenes has attracted researchers' attention. Existing methods have achieved outstanding reconstruction accuracy in autonomous driving scenes, but most of them lack the ability to edit scenes. Although some methods have the capability to edit scenarios, they are highly dependent on manually annotated 3D bounding boxes, leading to their poor scalability. To address the issues, we introduce a new Gaussian representation, called DrivingEditor, which decouples the scene into two parts and handles them by separate branches to individually model the dynamic foreground objects and the static background during the training process. By proposing a framework for decoupled modeling of scenarios, we can achieve accurate editing of any dynamic target, such as dynamic objects removal, adding and etc, meanwhile improving the reconstruction quality of autonomous driving scenes especially the dynamic foreground objects, without resorting to 3D bounding boxes. Extensive experiments on Waymo Open Dataset and KITTI benchmarks demonstrate the performance in 3D reconstruction for both dynamic and static scenes. Besides, we conduct extra experiments on unstructured large-scale scenarios, which can more convincingly demonstrate the performance and robustness of our proposed model when rendering the unstructured scenes. Our code is available at https://github.com/WangXu-xxx/DrivingEditor.
{"title":"DrivingEditor: 4D Composite Gaussian Splatting for Reconstruction and Edition of Dynamic Autonomous Driving Scenes.","authors":"Wang Xu, Yeqiang Qian, Yun-Fu Liu, Lei Tuo, Huiyong Chen, Ming Yang","doi":"10.1109/TIP.2026.3659733","DOIUrl":"https://doi.org/10.1109/TIP.2026.3659733","url":null,"abstract":"<p><p>In recent years, with the development of autonomous driving, 3D reconstruction for unbounded large-scale scenes has attracted researchers' attention. Existing methods have achieved outstanding reconstruction accuracy in autonomous driving scenes, but most of them lack the ability to edit scenes. Although some methods have the capability to edit scenarios, they are highly dependent on manually annotated 3D bounding boxes, leading to their poor scalability. To address the issues, we introduce a new Gaussian representation, called DrivingEditor, which decouples the scene into two parts and handles them by separate branches to individually model the dynamic foreground objects and the static background during the training process. By proposing a framework for decoupled modeling of scenarios, we can achieve accurate editing of any dynamic target, such as dynamic objects removal, adding and etc, meanwhile improving the reconstruction quality of autonomous driving scenes especially the dynamic foreground objects, without resorting to 3D bounding boxes. Extensive experiments on Waymo Open Dataset and KITTI benchmarks demonstrate the performance in 3D reconstruction for both dynamic and static scenes. Besides, we conduct extra experiments on unstructured large-scale scenarios, which can more convincingly demonstrate the performance and robustness of our proposed model when rendering the unstructured scenes. Our code is available at https://github.com/WangXu-xxx/DrivingEditor.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-06DOI: 10.1109/TIP.2026.3653206
Nimrod Shabtay, Eli Schwartz, Raja Giryes
In Deep Image Prior (DIP), a Convolutional Neural Network (CNN) is fitted to map a latent space to a degraded (e.g. noisy) image but in the process learns to reconstruct the clean image. This phenomenon is attributed to CNN's internal image prior. We revisit the DIP framework, examining it from the perspective of a neural implicit representation. Motivated by this perspective, we replace the random latent with Fourier-Features (Positional Encoding). We empirically demonstrate that the convolution layers in DIP can be replaced with simple pixel-level MLPs thanks to the Fourier features properties. We also prove that they are equivalent in the case of linear networks. We name our scheme "Positional Encoding Image Prior" (PIP) and exhibit that it performs very similar to DIP on various image-reconstruction tasks with much fewer parameters. Furthermore, we demonstrate that PIP can be easily extended to videos, an area where methods based on image-priors and certain INR approaches face challenges with stability. Code and additional examples for all tasks, including videos, are available on the project page nimrodshabtay.github.io/PIP.
{"title":"Positional Encoding Image Prior.","authors":"Nimrod Shabtay, Eli Schwartz, Raja Giryes","doi":"10.1109/TIP.2026.3653206","DOIUrl":"https://doi.org/10.1109/TIP.2026.3653206","url":null,"abstract":"<p><p>In Deep Image Prior (DIP), a Convolutional Neural Network (CNN) is fitted to map a latent space to a degraded (e.g. noisy) image but in the process learns to reconstruct the clean image. This phenomenon is attributed to CNN's internal image prior. We revisit the DIP framework, examining it from the perspective of a neural implicit representation. Motivated by this perspective, we replace the random latent with Fourier-Features (Positional Encoding). We empirically demonstrate that the convolution layers in DIP can be replaced with simple pixel-level MLPs thanks to the Fourier features properties. We also prove that they are equivalent in the case of linear networks. We name our scheme \"Positional Encoding Image Prior\" (PIP) and exhibit that it performs very similar to DIP on various image-reconstruction tasks with much fewer parameters. Furthermore, we demonstrate that PIP can be easily extended to videos, an area where methods based on image-priors and certain INR approaches face challenges with stability. Code and additional examples for all tasks, including videos, are available on the project page nimrodshabtay.github.io/PIP.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1109/TIP.2026.3654367
Xiang Fang;Zizhuo Li;Jiayi Ma
Recent advancements have led the image matching community to increasingly focus on obtaining subpixel-level correspondences in a detector-free manner, i.e., semi-dense feature matching. Existing methods tend to overfocus on low-level local features while ignoring equally important high-level semantic information. To tackle these shortcomings, we propose SigMa, a semantic similarity-guided semi-dense feature matching method, which leverages the strengths of both local features and high-level semantic features. First, we design a dual-branch feature extractor, comprising a convolutional network and a vision foundation model, to extract low-level local features and high-level semantic features, respectively. To fully retain the advantages of these two features and effectively integrate them, we also introduce a cross-domain feature adapter, which could overcome their spatial resolution mismatches, channel dimensionality variations, and inter-domain gaps. Furthermore, we observe that performing the transformer on the whole feature map is unnecessary because of the similarity of local representations. We design a guided pooling method based on semantic similarity. This strategy performs attention computation by selecting highly semantically similar regions, aiming to minimize information loss while maintaining computational efficiency. Extensive experiments on multiple datasets demonstrate that our method achieves a competitive accuracy-efficiency trade-off across various tasks and exhibits strong generalization capabilities across different datasets. Additionally, we conduct a series of ablation studies and analysis experiments to validate the effectiveness and rationality of our method’s design. Our code is publicly available at https://github.com/ShineFox/SigMa
{"title":"SigMa: Semantic Similarity-Guided Semi-Dense Feature Matching","authors":"Xiang Fang;Zizhuo Li;Jiayi Ma","doi":"10.1109/TIP.2026.3654367","DOIUrl":"10.1109/TIP.2026.3654367","url":null,"abstract":"Recent advancements have led the image matching community to increasingly focus on obtaining subpixel-level correspondences in a detector-free manner, i.e., semi-dense feature matching. Existing methods tend to overfocus on low-level local features while ignoring equally important high-level semantic information. To tackle these shortcomings, we propose SigMa, a semantic similarity-guided semi-dense feature matching method, which leverages the strengths of both local features and high-level semantic features. First, we design a dual-branch feature extractor, comprising a convolutional network and a vision foundation model, to extract low-level local features and high-level semantic features, respectively. To fully retain the advantages of these two features and effectively integrate them, we also introduce a cross-domain feature adapter, which could overcome their spatial resolution mismatches, channel dimensionality variations, and inter-domain gaps. Furthermore, we observe that performing the transformer on the whole feature map is unnecessary because of the similarity of local representations. We design a guided pooling method based on semantic similarity. This strategy performs attention computation by selecting highly semantically similar regions, aiming to minimize information loss while maintaining computational efficiency. Extensive experiments on multiple datasets demonstrate that our method achieves a competitive accuracy-efficiency trade-off across various tasks and exhibits strong generalization capabilities across different datasets. Additionally, we conduct a series of ablation studies and analysis experiments to validate the effectiveness and rationality of our method’s design. Our code is publicly available at <uri>https://github.com/ShineFox/SigMa</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"872-887"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unsupervised Domain Adaptation (UDA) person search aims to adapt models trained on labeled source data to unlabeled target domains. Existing approaches typically rely on clustering-based proxy learning, but their performance is often undermined by unreliable pseudo-supervision. This unreliability mainly stems from two challenges: (i) spectral shift bias, where low- and high-frequency components behave differently under domain shifts but are rarely considered, degrading feature stability; and (ii) static proxy updates, which make clustering proxies highly sensitive to noise and less adaptable to domain shifts. To address these challenges, we propose the Reliable Pseudo-supervision in UDA Person Search (RPPS) framework. At the feature level, a Dual-branch Wavelet Enhancement Module (DWEM) embedded in the backbone applies discrete wavelet transform (DWT) to decompose features into low- and high-frequency components, followed by differentiated enhancements that improve cross-domain robustness and discriminability. At the proxy level, a Dynamic Confidence-weighted Clustering Proxy (DCCP) employs confidence-guided initialization and a two-stage online–offline update strategy to stabilize proxy optimization and suppress proxy noise. Extensive experiments on the CUHK-SYSU and PRW benchmarks demonstrate that RPPS achieves state-of-the-art performance and strong robustness, underscoring the importance of enhancing pseudo-supervision reliability in UDA person search. Our code is accessible at https://github.com/zqx951102/RPPS
{"title":"Reliable Pseudo-Supervision for Unsupervised Domain Adaptive Person Search","authors":"Qixian Zhang;Duoqian Miao;Qi Zhang;Xuan Tan;Hongyun Zhang;Cairong Zhao","doi":"10.1109/TIP.2026.3654373","DOIUrl":"10.1109/TIP.2026.3654373","url":null,"abstract":"Unsupervised Domain Adaptation (UDA) person search aims to adapt models trained on labeled source data to unlabeled target domains. Existing approaches typically rely on clustering-based proxy learning, but their performance is often undermined by unreliable pseudo-supervision. This unreliability mainly stems from two challenges: (i) spectral shift bias, where low- and high-frequency components behave differently under domain shifts but are rarely considered, degrading feature stability; and (ii) static proxy updates, which make clustering proxies highly sensitive to noise and less adaptable to domain shifts. To address these challenges, we propose the Reliable Pseudo-supervision in UDA Person Search (RPPS) framework. At the feature level, a Dual-branch Wavelet Enhancement Module (DWEM) embedded in the backbone applies discrete wavelet transform (DWT) to decompose features into low- and high-frequency components, followed by differentiated enhancements that improve cross-domain robustness and discriminability. At the proxy level, a Dynamic Confidence-weighted Clustering Proxy (DCCP) employs confidence-guided initialization and a two-stage online–offline update strategy to stabilize proxy optimization and suppress proxy noise. Extensive experiments on the CUHK-SYSU and PRW benchmarks demonstrate that RPPS achieves state-of-the-art performance and strong robustness, underscoring the importance of enhancing pseudo-supervision reliability in UDA person search. Our code is accessible at <uri>https://github.com/zqx951102/RPPS</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"915-929"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1109/TIP.2026.3654473
Zhong Ji;Rongshuai Wei;Jingren Liu;Yanwei Pang;Jungong Han
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable, but they often struggle in data-scarce settings where insufficient training samples lead to suboptimal performance. To address this limitation, we propose a Few-Shot Prototypical Concept Classification (FSPCC) framework that systematically mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment. Specifically, our approach leverages a Mixture of LoRA Experts (MoLE) for parameter-efficient adaptation, ensuring a balanced allocation of trainable parameters between the backbone and the PCL module. Meanwhile, cross-module concept guidance enforces tight alignment between the backbone’s feature representations and the prototypical concept activation patterns. In addition, we incorporate a multi-level feature preservation strategy that fuses spatial and semantic cues across various layers, thereby enriching the learned representations and mitigating the challenges posed by limited data availability. Finally, to enhance interpretability and minimize concept overlap, we introduce a geometry-aware concept discrimination loss that enforces orthogonality among concepts, encouraging more disentangled and transparent decision boundaries. Experimental results on six popular benchmarks (CUB-200-2011, mini-ImageNet, CIFAR-FS, Stanford Cars, FGVC-Aircraft, and DTD) demonstrate that our approach consistently outperforms existing SEMs by a notable margin, with 4.2%–8.7% relative gains in 5-way 5-shot classification. These findings highlight the efficacy of coupling concept learning with few-shot adaptation to achieve both higher accuracy and clearer model interpretability, paving the way for more transparent visual recognition systems.
{"title":"Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts","authors":"Zhong Ji;Rongshuai Wei;Jingren Liu;Yanwei Pang;Jungong Han","doi":"10.1109/TIP.2026.3654473","DOIUrl":"10.1109/TIP.2026.3654473","url":null,"abstract":"Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable, but they often struggle in data-scarce settings where insufficient training samples lead to suboptimal performance. To address this limitation, we propose a Few-Shot Prototypical Concept Classification (FSPCC) framework that systematically mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment. Specifically, our approach leverages a Mixture of LoRA Experts (MoLE) for parameter-efficient adaptation, ensuring a balanced allocation of trainable parameters between the backbone and the PCL module. Meanwhile, cross-module concept guidance enforces tight alignment between the backbone’s feature representations and the prototypical concept activation patterns. In addition, we incorporate a multi-level feature preservation strategy that fuses spatial and semantic cues across various layers, thereby enriching the learned representations and mitigating the challenges posed by limited data availability. Finally, to enhance interpretability and minimize concept overlap, we introduce a geometry-aware concept discrimination loss that enforces orthogonality among concepts, encouraging more disentangled and transparent decision boundaries. Experimental results on six popular benchmarks (CUB-200-2011, mini-ImageNet, CIFAR-FS, Stanford Cars, FGVC-Aircraft, and DTD) demonstrate that our approach consistently outperforms existing SEMs by a notable margin, with 4.2%–8.7% relative gains in 5-way 5-shot classification. These findings highlight the efficacy of coupling concept learning with few-shot adaptation to achieve both higher accuracy and clearer model interpretability, paving the way for more transparent visual recognition systems.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"930-942"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146021122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The multi-classification of histopathological images under imbalanced sample conditions remains a long-standing unresolved challenge in computational pathology. In this paper, we propose for the first time a cross-patient pseudo-bag generation technique to address this challenge. Our key innovation lies in a cross-patient pseudo-bag generation framework that extracts complementary pathological features to construct distributionally consistent pseudo-bags. To resolve the critical challenge of distributional alignment in pseudo-bag generation, we propose an affinity-driven curriculum contrastive learning strategy, integrating sample affinity metrics with progressive training to stabilize representation learning. Unlike prior methods focused on bag-level embeddings, our framework pioneers a paradigm shift toward multi-instance feature distribution mining, explicitly modeling inter-bag heterogeneity to address class imbalance. Our method demonstrates significant performance improvements on three datasets with multiple classification difficulties, outperforming the second-best method by an average of 1.95 percentage points in F1 score and 2.07 percentage points in ACC.
{"title":"Imbalanced Multiclassification Challenges in Whole Slide Image: Cross-Patient Pseudo Bags Generation and Curriculum Contrastive Learning With Dynamic Rebalancing","authors":"Yonghuang Wu;Xuan Xie;Chengqian Zhao;Pengfei Song;Feiyu Yin;Guoqing Wu;Jinhua Yu","doi":"10.1109/TIP.2026.3654402","DOIUrl":"10.1109/TIP.2026.3654402","url":null,"abstract":"The multi-classification of histopathological images under imbalanced sample conditions remains a long-standing unresolved challenge in computational pathology. In this paper, we propose for the first time a cross-patient pseudo-bag generation technique to address this challenge. Our key innovation lies in a cross-patient pseudo-bag generation framework that extracts complementary pathological features to construct distributionally consistent pseudo-bags. To resolve the critical challenge of distributional alignment in pseudo-bag generation, we propose an affinity-driven curriculum contrastive learning strategy, integrating sample affinity metrics with progressive training to stabilize representation learning. Unlike prior methods focused on bag-level embeddings, our framework pioneers a paradigm shift toward multi-instance feature distribution mining, explicitly modeling inter-bag heterogeneity to address class imbalance. Our method demonstrates significant performance improvements on three datasets with multiple classification difficulties, outperforming the second-best method by an average of 1.95 percentage points in F1 score and 2.07 percentage points in ACC.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"904-914"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-19DOI: 10.1109/TIP.2025.3650052
Yuming Yang;Wei Wang
Multi-exposure image fusion (MEF) is the main method to obtain High Dynamic Range (HDR) images by fusing multiple images taken under various exposure values. In this paper, we propose and develop a novel variational model based on detail-base decomposition for MEF. The main idea is to incorporate the decomposition procedure and the reconstruction procedure into a unified framework, and to interact the detail information and the base information at the same time. Specifically, we make use of Tikhonov regularization to model the base layer, and we present an efficient design to obtain the detail layer, which is able to capture more detailed information effectively. Meanwhile, we incorporate multi-scale techniques to remove halo artifacts. Numerically, we apply alternating direction method of multipliers (ADMM) to solve the proposed minimization problem. Theoretically, we study the existence of the solution of the proposed model and the convergence of the proposed ADMM algorithm. Experimental examples are presented to demonstrate that the performance of the proposed model is better than that by using other testing methods in terms of visual quality and some criteria, e. g., the proposed model gives the best Natural image quality evaluator (NIQE) values with 1% - 10% improvement for real image fusion experiments and gives the best PSNR values with 13% - 20% improvement for the synthetic image fusion experiment.
{"title":"A Variational Multi-Scale Model for Multi-Exposure Image Fusion","authors":"Yuming Yang;Wei Wang","doi":"10.1109/TIP.2025.3650052","DOIUrl":"10.1109/TIP.2025.3650052","url":null,"abstract":"Multi-exposure image fusion (MEF) is the main method to obtain High Dynamic Range (HDR) images by fusing multiple images taken under various exposure values. In this paper, we propose and develop a novel variational model based on detail-base decomposition for MEF. The main idea is to incorporate the decomposition procedure and the reconstruction procedure into a unified framework, and to interact the detail information and the base information at the same time. Specifically, we make use of Tikhonov regularization to model the base layer, and we present an efficient design to obtain the detail layer, which is able to capture more detailed information effectively. Meanwhile, we incorporate multi-scale techniques to remove halo artifacts. Numerically, we apply alternating direction method of multipliers (ADMM) to solve the proposed minimization problem. Theoretically, we study the existence of the solution of the proposed model and the convergence of the proposed ADMM algorithm. Experimental examples are presented to demonstrate that the performance of the proposed model is better than that by using other testing methods in terms of visual quality and some criteria, e. g., the proposed model gives the best Natural image quality evaluator (NIQE) values with 1% - 10% improvement for real image fusion experiments and gives the best PSNR values with 13% - 20% improvement for the synthetic image fusion experiment.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"701-716"},"PeriodicalIF":13.7,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146000606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-16DOI: 10.1109/TIP.2026.3653189
Xiang Fang;Shihua Zhang;Hao Zhang;Xiaoguang Mei;Huabing Zhou;Jiayi Ma
Two-view correspondence learning aims to discern true and false correspondences between image pairs by recognizing their underlying different information. Previous methods either treat the information equally or require the explicit storage of the entire context, tending to be laborious in real-world scenarios. Inspired by Mamba’s inherent selectivity, we propose CorrMamba, a Correspondence filter leveraging Mamba’s ability to selectively mine information from true correspondences while mitigating interference from false ones, thus achieving adaptive focus at a lower cost. To prevent Mamba from being potentially impacted by unordered keypoints that obscured its ability to mine spatial information, we customize a causal sequential learning approach based on the Gumbel-Softmax technique to establish causal dependencies between features in a fully autonomous and differentiable manner. Additionally, a local-context enhancement module is designed to capture critical contextual cues essential for correspondence pruning, complementing the core framework. Extensive experiments on relative pose estimation, visual localization, and analysis demonstrate that CorrMamba achieves state-of-the-art performance. Notably, in outdoor relative pose estimation, our method surpasses the previous SOTA by 2.58 absolute percentage points in AUC@20°, highlighting its practical superiority. Our code is publicly available at https://github.com/ShineFox/CorrMamba
{"title":"Selecting and Pruning: A Differentiable Causal Sequentialized State-Space Model for Two-View Correspondence Learning","authors":"Xiang Fang;Shihua Zhang;Hao Zhang;Xiaoguang Mei;Huabing Zhou;Jiayi Ma","doi":"10.1109/TIP.2026.3653189","DOIUrl":"10.1109/TIP.2026.3653189","url":null,"abstract":"Two-view correspondence learning aims to discern true and false correspondences between image pairs by recognizing their underlying different information. Previous methods either treat the information equally or require the explicit storage of the entire context, tending to be laborious in real-world scenarios. Inspired by Mamba’s inherent selectivity, we propose CorrMamba, a Correspondence filter leveraging Mamba’s ability to selectively mine information from true correspondences while mitigating interference from false ones, thus achieving adaptive focus at a lower cost. To prevent Mamba from being potentially impacted by unordered keypoints that obscured its ability to mine spatial information, we customize a causal sequential learning approach based on the Gumbel-Softmax technique to establish causal dependencies between features in a fully autonomous and differentiable manner. Additionally, a local-context enhancement module is designed to capture critical contextual cues essential for correspondence pruning, complementing the core framework. Extensive experiments on relative pose estimation, visual localization, and analysis demonstrate that CorrMamba achieves state-of-the-art performance. Notably, in outdoor relative pose estimation, our method surpasses the previous SOTA by 2.58 absolute percentage points in AUC@20°, highlighting its practical superiority. Our code is publicly available at <uri>https://github.com/ShineFox/CorrMamba</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"816-829"},"PeriodicalIF":13.7,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145991971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advancements in multimodal large language models (MLLMs) have achieved significant multimodal generation capabilities, akin to GPT-4. These models predominantly map visual information into language representation space, leveraging the vast knowledge and powerful text generation abilities of LLMs to produce multimodal instruction-following responses. We could term this method as LLMs for Vision because of its employing LLMs for visual understanding and reasoning, yet observe that these MLLMs neglect the potential of harnessing visual knowledge to enhance the overall capabilities of LLMs, which could be regarded as Vision Enhancing LLMs. In this paper, we propose an approach called MKS2, aimed at enhancing LLMs through empowering Multimodal Knowledge Storage and Sharing in LLMs. Specifically, we introduce Modular Visual Memory (MVM), a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently. Additionally, we present a soft Mixture of Multimodal Experts (MoMEs) architecture in LLMs to invoke multimodal knowledge collaboration during text generation. Our comprehensive experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge. It also delivers competitive results on image-text understanding multimodal benchmarks. The codes will be available at: https://github.com/HITsz-TMG/MKS2-Multimodal-Knowledge-Storage-and-Sharing
{"title":"Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs","authors":"Yunxin Li;Zhenyu Liu;Baotian Hu;Wei Wang;Yuxin Ding;Xiaochun Cao;Min Zhang","doi":"10.1109/TIP.2025.3649356","DOIUrl":"10.1109/TIP.2025.3649356","url":null,"abstract":"Recent advancements in multimodal large language models (MLLMs) have achieved significant multimodal generation capabilities, akin to GPT-4. These models predominantly map visual information into language representation space, leveraging the vast knowledge and powerful text generation abilities of LLMs to produce multimodal instruction-following responses. We could term this method as LLMs for Vision because of its employing LLMs for visual understanding and reasoning, yet observe that these MLLMs neglect the potential of harnessing visual knowledge to enhance the overall capabilities of LLMs, which could be regarded as Vision Enhancing LLMs. In this paper, we propose an approach called MKS2, aimed at enhancing LLMs through empowering Multimodal Knowledge Storage and Sharing in LLMs. Specifically, we introduce Modular Visual Memory (MVM), a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently. Additionally, we present a soft Mixture of Multimodal Experts (MoMEs) architecture in LLMs to invoke multimodal knowledge collaboration during text generation. Our comprehensive experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge. It also delivers competitive results on image-text understanding multimodal benchmarks. The codes will be available at: <uri>https://github.com/HITsz-TMG/MKS2-Multimodal-Knowledge-Storage-and-Sharing</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"858-871"},"PeriodicalIF":13.7,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145992007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}