Pub Date : 2026-02-02DOI: 10.1109/TIP.2026.3658223
Ming Jin;Richang Hong
In video-text cross-domain retrieval tasks, the generalization ability of the retrieval models is key to improving their performance and is crucial for enhancing their practical applicability. However, existing retrieval models exhibit significant deficiencies in cross-domain generalization. On one hand, models tend to overfit specific training domain data, resulting in poor cross-domain matching and significantly reduced retrieval accuracy when dealing with data from different, new, or mixed domains. On the other hand, although data augmentation is a vital strategy for enhancing model generalization, most existing methods focus on unimodal augmentation and fail to fully exploit the multimodal correlations between video and text. As a result, the augmented data lack semantic diversity, which further limits the model’s ability to understand and perform in complex cross-domain scenarios. To address these challenges, this paper proposes an innovative collaborative augmentation approach named MDA-MAA, which includes two core modules: the Masked Attention Augmentation (MAA) module and the Multimodal Diffusion Augmentation (MDA) module. The MAA module applies masking to the original video frame features and uses an attention mechanism to predict the masked features, effectively reducing overfitting to training data and enhancing model generalization. The MDA module generates subtitles from video frames and uses the LLaMA model to infer comprehensive video captions. These captions, combined with the original video frames, are integrated into a diffusion model for joint learning, ultimately generating semantically enriched augmented video frames. This process leverages the multimodal relationship between video and text to increase the diversity of the training data distribution. Experimental results demonstrate that this collaborative augmentation method significantly improves the performance of video-text cross-domain retrieval models, validating its effectiveness in enhancing model generalization.
{"title":"MDA-MAA: A Collaborative Augmentation Approach for Generalizing Cross-Domain Retrieval","authors":"Ming Jin;Richang Hong","doi":"10.1109/TIP.2026.3658223","DOIUrl":"10.1109/TIP.2026.3658223","url":null,"abstract":"In video-text cross-domain retrieval tasks, the generalization ability of the retrieval models is key to improving their performance and is crucial for enhancing their practical applicability. However, existing retrieval models exhibit significant deficiencies in cross-domain generalization. On one hand, models tend to overfit specific training domain data, resulting in poor cross-domain matching and significantly reduced retrieval accuracy when dealing with data from different, new, or mixed domains. On the other hand, although data augmentation is a vital strategy for enhancing model generalization, most existing methods focus on unimodal augmentation and fail to fully exploit the multimodal correlations between video and text. As a result, the augmented data lack semantic diversity, which further limits the model’s ability to understand and perform in complex cross-domain scenarios. To address these challenges, this paper proposes an innovative collaborative augmentation approach named MDA-MAA, which includes two core modules: the Masked Attention Augmentation (MAA) module and the Multimodal Diffusion Augmentation (MDA) module. The MAA module applies masking to the original video frame features and uses an attention mechanism to predict the masked features, effectively reducing overfitting to training data and enhancing model generalization. The MDA module generates subtitles from video frames and uses the LLaMA model to infer comprehensive video captions. These captions, combined with the original video frames, are integrated into a diffusion model for joint learning, ultimately generating semantically enriched augmented video frames. This process leverages the multimodal relationship between video and text to increase the diversity of the training data distribution. Experimental results demonstrate that this collaborative augmentation method significantly improves the performance of video-text cross-domain retrieval models, validating its effectiveness in enhancing model generalization.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1595-1606"},"PeriodicalIF":13.7,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146101327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-30DOI: 10.1109/TIP.2026.3653202
Onur Keleş;A. Murat Tekalp
Neural networks commonly employ the McCulloch-Pitts neuron model, which is a linear model followed by a point-wise non-linear activation. Various researchers have already advanced inherently non-linear neuron models, such as quadratic neurons, generalized operational neurons, generative neurons, and super neurons, which offer stronger non-linearity compared to point-wise activation functions. In this paper, we introduce a novel and better non-linear neuron model called Padé neurons ($mathrm {textit {Paon}}$ s), inspired by Padé approximants. $mathrm {textit {Paon}}$ s offer several advantages, such as diversity of non-linearity, since each $mathrm {textit {Paon}}$ learns a different non-linear function of its inputs, and layer efficiency, since $mathrm {textit {Paon}}$ s provide stronger non-linearity in much fewer layers compared to piecewise linear approximation. Furthermore, $mathrm {textit {Paon}}$ s include all previously proposed neuron models as special cases, thus any neuron model in any network can be replaced by $mathrm {textit {Paon}}$ s. We note that there has been a proposal to employ the Padé approximation as a generalized point-wise activation function, which is fundamentally different from our model. To validate the efficacy of $mathrm {textit {Paon}}$ s, in our experiments, we replace classic neurons in some well-known neural image super-resolution, compression, and classification models based on the ResNet architecture with $mathrm {textit {Paon}}$ s. Our comprehensive experimental results and analyses demonstrate that neural models built by $mathrm {textit {Paon}}$ s provide better or equal performance than their classic counterparts with a smaller number of layers. The PyTorch implementation code for $mathrm {textit {Paon}}$ is open-sourced at https://github.com/onur-keles/Paon
{"title":"Padé Neurons for Efficient Neural Models","authors":"Onur Keleş;A. Murat Tekalp","doi":"10.1109/TIP.2026.3653202","DOIUrl":"10.1109/TIP.2026.3653202","url":null,"abstract":"Neural networks commonly employ the McCulloch-Pitts neuron model, which is a linear model followed by a point-wise non-linear activation. Various researchers have already advanced inherently non-linear neuron models, such as quadratic neurons, generalized operational neurons, generative neurons, and super neurons, which offer stronger non-linearity compared to point-wise activation functions. In this paper, we introduce a novel and better non-linear neuron model called Padé neurons (<inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s), inspired by Padé approximants. <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s offer several advantages, such as diversity of non-linearity, since each <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula> learns a different non-linear function of its inputs, and layer efficiency, since <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s provide stronger non-linearity in much fewer layers compared to piecewise linear approximation. Furthermore, <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s include all previously proposed neuron models as special cases, thus any neuron model in any network can be replaced by <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s. We note that there has been a proposal to employ the Padé approximation as a generalized point-wise activation function, which is fundamentally different from our model. To validate the efficacy of <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s, in our experiments, we replace classic neurons in some well-known neural image super-resolution, compression, and classification models based on the ResNet architecture with <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s. Our comprehensive experimental results and analyses demonstrate that neural models built by <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s provide better or equal performance than their classic counterparts with a smaller number of layers. The PyTorch implementation code for <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula> is open-sourced at <uri>https://github.com/onur-keles/Paon</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1508-1520"},"PeriodicalIF":13.7,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146089897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-30DOI: 10.1109/TIP.2026.3657183
Xinran Qin;Yuning Cui;Shangquan Sun;Ruoyu Chen;Wenqi Ren;Alois Knoll;Xiaochun Cao
Multi-modal image fusion (MMIF) aims to integrate complementary information from heterogeneous sensor modalities. However, substantial cross-modality discrepancies hinder joint scene representation and lead to semantic degradation in the fused output. To address this limitation, we propose C2MFuse, a novel framework designed to preserve content while ensuring cross-modality consistency. To the best of our knowledge, this is the first MMIF approach to explicitly disentangle style and content representations across modalities for image fusion. C2MFuse introduces a content-preserving style normalization mechanism that suppresses modality-specific variations while maintaining the underlying scene structure. The normalized features are then progressively aggregated to enhance fine-grained details and improve content completeness. In light of the lack of ground truth and the inherent ambiguity of the fused distribution, we further align the fused representation with a well-defined source modality, thereby enhancing semantic consistency and reducing distributional uncertainty. Additionally, we introduce an adaptive consistency loss with learnable transformation, which provides dynamic, modality-aware supervision by enforcing global consistency across heterogeneous inputs. Extensive experiments on five datasets across three representative MMIF tasks demonstrate that C2MFuse achieves efficient and high-quality fusion, surpasses existing methods, and generalizes effectively to downstream visual applications.
{"title":"Disentangle to Fuse: Toward Content Preservation and Cross-Modality Consistency for Multi-Modality Image Fusion","authors":"Xinran Qin;Yuning Cui;Shangquan Sun;Ruoyu Chen;Wenqi Ren;Alois Knoll;Xiaochun Cao","doi":"10.1109/TIP.2026.3657183","DOIUrl":"10.1109/TIP.2026.3657183","url":null,"abstract":"Multi-modal image fusion (MMIF) aims to integrate complementary information from heterogeneous sensor modalities. However, substantial cross-modality discrepancies hinder joint scene representation and lead to semantic degradation in the fused output. To address this limitation, we propose C2MFuse, a novel framework designed to preserve content while ensuring cross-modality consistency. To the best of our knowledge, this is the first MMIF approach to explicitly disentangle style and content representations across modalities for image fusion. C2MFuse introduces a content-preserving style normalization mechanism that suppresses modality-specific variations while maintaining the underlying scene structure. The normalized features are then progressively aggregated to enhance fine-grained details and improve content completeness. In light of the lack of ground truth and the inherent ambiguity of the fused distribution, we further align the fused representation with a well-defined source modality, thereby enhancing semantic consistency and reducing distributional uncertainty. Additionally, we introduce an adaptive consistency loss with learnable transformation, which provides dynamic, modality-aware supervision by enforcing global consistency across heterogeneous inputs. Extensive experiments on five datasets across three representative MMIF tasks demonstrate that C2MFuse achieves efficient and high-quality fusion, surpasses existing methods, and generalizes effectively to downstream visual applications.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1756-1770"},"PeriodicalIF":13.7,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146088997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-29DOI: 10.1109/TIP.2026.3657636
Yun Liu;Tao Li;Chunping Tan;Wenqi Ren;Cosmin Ancuti;Weisi Lin
Image dehazing, a crucial task in low-level vision, supports numerous practical applications, such as autonomous driving, remote sensing, and surveillance. This paper proposes IHDCP, a novel Inverted Haze Density Correction Prior for efficient single image dehazing. It is observed that the medium transmission can be effectively modeled from the inverted haze density map using correction functions with various gamma coefficients. Based on this observation, a pixel-wise gamma correction coefficient is introduced to formulate the transmission as a function of the inverted haze density map. To estimate the transmission, IHDCP is first incorporated into the classic atmospheric scattering model (ASM), leading to a transcendental equation that is subsequently simplified to a quadratic form with a single unknown parameter using the Taylor expansion. Then, boundary constraints are designed to estimate this model parameter, and the gamma correction coefficient map is derived via the Vieta theorem. Finally, the haze-free result is recovered through ASM inversion. Experimental results on diverse synthetic and real-world datasets verify that our algorithm not only provides visually appealing dehazing performance with high computational efficiency, but also outperforms several state-of-the-art dehazing approaches in both subjective and objective evaluations. Moreover, our IHDCP generalizes well to various types of degraded scenes. Our code is available at https://github.com/TaoLi-TL/IHDCP.
{"title":"IHDCP: Single Image Dehazing Using Inverted Haze Density Correction Prior","authors":"Yun Liu;Tao Li;Chunping Tan;Wenqi Ren;Cosmin Ancuti;Weisi Lin","doi":"10.1109/TIP.2026.3657636","DOIUrl":"10.1109/TIP.2026.3657636","url":null,"abstract":"Image dehazing, a crucial task in low-level vision, supports numerous practical applications, such as autonomous driving, remote sensing, and surveillance. This paper proposes IHDCP, a novel Inverted Haze Density Correction Prior for efficient single image dehazing. It is observed that the medium transmission can be effectively modeled from the inverted haze density map using correction functions with various gamma coefficients. Based on this observation, a pixel-wise gamma correction coefficient is introduced to formulate the transmission as a function of the inverted haze density map. To estimate the transmission, IHDCP is first incorporated into the classic atmospheric scattering model (ASM), leading to a transcendental equation that is subsequently simplified to a quadratic form with a single unknown parameter using the Taylor expansion. Then, boundary constraints are designed to estimate this model parameter, and the gamma correction coefficient map is derived via the Vieta theorem. Finally, the haze-free result is recovered through ASM inversion. Experimental results on diverse synthetic and real-world datasets verify that our algorithm not only provides visually appealing dehazing performance with high computational efficiency, but also outperforms several state-of-the-art dehazing approaches in both subjective and objective evaluations. Moreover, our IHDCP generalizes well to various types of degraded scenes. Our code is available at <uri>https://github.com/TaoLi-TL/IHDCP</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1448-1461"},"PeriodicalIF":13.7,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146073159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1109/TIP.2026.3655121
Tianfang Zhang;Lei Li;Yang Zhou;Wentao Liu;Chen Qian;Jenq-Neng Hwang;Xiangyang Ji
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer’s powerful global context capability. However, the pairwise token affinity and complex matrix operations limit its deployment on resource-constrained scenarios and real-time applications, such as mobile devices, although considerable efforts have been made in previous works. In this paper, we introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications. Firstly, we argue that the capability of token mixers to obtain global contextual information hinges on multiple information interactions, such as spatial and channel domains. Subsequently, we propose Convolutional Additive Token Mixer (CATM) employing underlying spatial and channel attention as novel interaction forms. This module eliminates troublesome complex operations such as matrix multiplication and Softmax. We introduce Convolutional Additive Self-attention(CAS) block hybrid architecture and utilize CATM for each block. And further, we build a family of lightweight networks, which can be easily extended to various downstream tasks. Finally, we evaluate CAS-ViT across a variety of vision tasks, including image classification, object detection, instance segmentation, and semantic segmentation. Our M and T model achieves 83.0%/84.1% top-1 with only 12M/21M parameters on ImageNet-1K. Meanwhile, throughput evaluations on GPUs, ONNX, and iPhones also demonstrate superior results compared to other state-of-the-art backbones. Extensive experiments demonstrate that our approach achieves a better balance of performance, efficient inference and easy-to-deploy. Our code and model are available at: https://github.com/Tianfang-Zhang/CAS-ViT
{"title":"CAS-ViT: Convolutional Additive Self-Attention Vision Transformers for Efficient Mobile Applications","authors":"Tianfang Zhang;Lei Li;Yang Zhou;Wentao Liu;Chen Qian;Jenq-Neng Hwang;Xiangyang Ji","doi":"10.1109/TIP.2026.3655121","DOIUrl":"10.1109/TIP.2026.3655121","url":null,"abstract":"Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer’s powerful global context capability. However, the pairwise token affinity and complex matrix operations limit its deployment on resource-constrained scenarios and real-time applications, such as mobile devices, although considerable efforts have been made in previous works. In this paper, we introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications. Firstly, we argue that the capability of token mixers to obtain global contextual information hinges on multiple information interactions, such as spatial and channel domains. Subsequently, we propose Convolutional Additive Token Mixer (CATM) employing underlying spatial and channel attention as novel interaction forms. This module eliminates troublesome complex operations such as matrix multiplication and Softmax. We introduce Convolutional Additive Self-attention(CAS) block hybrid architecture and utilize CATM for each block. And further, we build a family of lightweight networks, which can be easily extended to various downstream tasks. Finally, we evaluate CAS-ViT across a variety of vision tasks, including image classification, object detection, instance segmentation, and semantic segmentation. Our M and T model achieves 83.0%/84.1% top-1 with only 12M/21M parameters on ImageNet-1K. Meanwhile, throughput evaluations on GPUs, ONNX, and iPhones also demonstrate superior results compared to other state-of-the-art backbones. Extensive experiments demonstrate that our approach achieves a better balance of performance, efficient inference and easy-to-deploy. Our code and model are available at: <uri>https://github.com/Tianfang-Zhang/CAS-ViT</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1899-1909"},"PeriodicalIF":13.7,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146070136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1109/TIP.2026.3657171
Hao Chen;Haoran Zhou;Yunshu Zhang;Zheng Lin;Yongjian Deng
In the RGB-D vision community, extensive research has been focused on designing multi-modal learning strategies and fusion structures. However, the complementary and fusion mechanisms in RGB-D models remain a opaque box. In this paper, we present an analytical framework and a novel score to dissect the RGB-D vision community. Our approach involves measuring proposed semantic variance and feature similarity across modalities and levels, conducting visual and quantitative analyzes on multi-modal learning through comprehensive experiments. Specifically, we investigate the consistency and specialty of features across modalities, evolution rules within each modality, and the collaboration logic used when optimizing a RGB-D model. Our studies reveal/verify several important findings, such as the discrepancy in cross-modal features and the hybrid multi-modal cooperation rule, which highlights consistency and specialty simultaneously for complementary inference. We also showcase the versatility of the proposed RGB-D dissection method and introduce a straightforward fusion strategy based on our findings, which delivers significant enhancements across various tasks and even other multi-modal data.
{"title":"Dissecting RGB-D Learning for Improved Multi-Modal Fusion","authors":"Hao Chen;Haoran Zhou;Yunshu Zhang;Zheng Lin;Yongjian Deng","doi":"10.1109/TIP.2026.3657171","DOIUrl":"10.1109/TIP.2026.3657171","url":null,"abstract":"In the RGB-D vision community, extensive research has been focused on designing multi-modal learning strategies and fusion structures. However, the complementary and fusion mechanisms in RGB-D models remain a opaque box. In this paper, we present an analytical framework and a novel score to dissect the RGB-D vision community. Our approach involves measuring proposed semantic variance and feature similarity across modalities and levels, conducting visual and quantitative analyzes on multi-modal learning through comprehensive experiments. Specifically, we investigate the consistency and specialty of features across modalities, evolution rules within each modality, and the collaboration logic used when optimizing a RGB-D model. Our studies reveal/verify several important findings, such as the discrepancy in cross-modal features and the hybrid multi-modal cooperation rule, which highlights consistency and specialty simultaneously for complementary inference. We also showcase the versatility of the proposed RGB-D dissection method and introduce a straightforward fusion strategy based on our findings, which delivers significant enhancements across various tasks and even other multi-modal data.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1846-1857"},"PeriodicalIF":13.7,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1109/TIP.2026.3652003
Guangzhao Dai;Shuo Wang;Hao Zhao;Bin Zhu;Qianru Sun;Xiangbo Shu
Vision-and-Language Navigation in continuous environments (VLN-CE) requires an embodied robot to navigate the target destination following the natural language instruction. Most existing methods use panoramic RGB-D cameras for 360° observation of environments. However, these methods struggle in real-world applications because of the higher cost of panoramic RGB-D cameras. This paper studies a low-cost and practical VLN-CE setting, e.g., using monocular cameras of limited field of view, which means “Look Less” for visual observations and environment semantics. In this paper, we propose a ThinkMatter framework for monocular VLN-CE, where we motivate monocular robots to “Think More” by 1) generating novel views and 2) integrating instruction semantics. Specifically, we achieve the former by the proposed 3DGS-based panoramic generation to render novel views at each step, based on past observation collections. We achieve the latter by the proposed enhancement of the occupancy-instruction semantics, which integrates the spatial semantics of occupancy maps with the textual semantics of language instructions. These operations promote monocular robots with wider environment perceptions as well as transparent semantic connections with the instruction. Both extensive experiments in the simulators and real-world environments demonstrate the effectiveness of ThinkMatter, providing a promising practice for real-world navigation.
{"title":"ThinkMatter: Panoramic-Aware Instructional Semantics for Monocular Vision-and-Language Navigation","authors":"Guangzhao Dai;Shuo Wang;Hao Zhao;Bin Zhu;Qianru Sun;Xiangbo Shu","doi":"10.1109/TIP.2026.3652003","DOIUrl":"10.1109/TIP.2026.3652003","url":null,"abstract":"Vision-and-Language Navigation in continuous environments (VLN-CE) requires an embodied robot to navigate the target destination following the natural language instruction. Most existing methods use panoramic RGB-D cameras for 360° observation of environments. However, these methods struggle in real-world applications because of the higher cost of panoramic RGB-D cameras. This paper studies a low-cost and practical VLN-CE setting, e.g., using monocular cameras of limited field of view, which means “Look Less” for visual observations and environment semantics. In this paper, we propose a ThinkMatter framework for monocular VLN-CE, where we motivate monocular robots to “Think More” by 1) generating novel views and 2) integrating instruction semantics. Specifically, we achieve the former by the proposed 3DGS-based panoramic generation to render novel views at each step, based on past observation collections. We achieve the latter by the proposed enhancement of the occupancy-instruction semantics, which integrates the spatial semantics of occupancy maps with the textual semantics of language instructions. These operations promote monocular robots with wider environment perceptions as well as transparent semantic connections with the instruction. Both extensive experiments in the simulators and real-world environments demonstrate the effectiveness of ThinkMatter, providing a promising practice for real-world navigation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1937-1950"},"PeriodicalIF":13.7,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1109/TIP.2026.3654367
Xiang Fang;Zizhuo Li;Jiayi Ma
Recent advancements have led the image matching community to increasingly focus on obtaining subpixel-level correspondences in a detector-free manner, i.e., semi-dense feature matching. Existing methods tend to overfocus on low-level local features while ignoring equally important high-level semantic information. To tackle these shortcomings, we propose SigMa, a semantic similarity-guided semi-dense feature matching method, which leverages the strengths of both local features and high-level semantic features. First, we design a dual-branch feature extractor, comprising a convolutional network and a vision foundation model, to extract low-level local features and high-level semantic features, respectively. To fully retain the advantages of these two features and effectively integrate them, we also introduce a cross-domain feature adapter, which could overcome their spatial resolution mismatches, channel dimensionality variations, and inter-domain gaps. Furthermore, we observe that performing the transformer on the whole feature map is unnecessary because of the similarity of local representations. We design a guided pooling method based on semantic similarity. This strategy performs attention computation by selecting highly semantically similar regions, aiming to minimize information loss while maintaining computational efficiency. Extensive experiments on multiple datasets demonstrate that our method achieves a competitive accuracy-efficiency trade-off across various tasks and exhibits strong generalization capabilities across different datasets. Additionally, we conduct a series of ablation studies and analysis experiments to validate the effectiveness and rationality of our method’s design. Our code is publicly available at https://github.com/ShineFox/SigMa
{"title":"SigMa: Semantic Similarity-Guided Semi-Dense Feature Matching","authors":"Xiang Fang;Zizhuo Li;Jiayi Ma","doi":"10.1109/TIP.2026.3654367","DOIUrl":"10.1109/TIP.2026.3654367","url":null,"abstract":"Recent advancements have led the image matching community to increasingly focus on obtaining subpixel-level correspondences in a detector-free manner, i.e., semi-dense feature matching. Existing methods tend to overfocus on low-level local features while ignoring equally important high-level semantic information. To tackle these shortcomings, we propose SigMa, a semantic similarity-guided semi-dense feature matching method, which leverages the strengths of both local features and high-level semantic features. First, we design a dual-branch feature extractor, comprising a convolutional network and a vision foundation model, to extract low-level local features and high-level semantic features, respectively. To fully retain the advantages of these two features and effectively integrate them, we also introduce a cross-domain feature adapter, which could overcome their spatial resolution mismatches, channel dimensionality variations, and inter-domain gaps. Furthermore, we observe that performing the transformer on the whole feature map is unnecessary because of the similarity of local representations. We design a guided pooling method based on semantic similarity. This strategy performs attention computation by selecting highly semantically similar regions, aiming to minimize information loss while maintaining computational efficiency. Extensive experiments on multiple datasets demonstrate that our method achieves a competitive accuracy-efficiency trade-off across various tasks and exhibits strong generalization capabilities across different datasets. Additionally, we conduct a series of ablation studies and analysis experiments to validate the effectiveness and rationality of our method’s design. Our code is publicly available at <uri>https://github.com/ShineFox/SigMa</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"872-887"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unsupervised Domain Adaptation (UDA) person search aims to adapt models trained on labeled source data to unlabeled target domains. Existing approaches typically rely on clustering-based proxy learning, but their performance is often undermined by unreliable pseudo-supervision. This unreliability mainly stems from two challenges: (i) spectral shift bias, where low- and high-frequency components behave differently under domain shifts but are rarely considered, degrading feature stability; and (ii) static proxy updates, which make clustering proxies highly sensitive to noise and less adaptable to domain shifts. To address these challenges, we propose the Reliable Pseudo-supervision in UDA Person Search (RPPS) framework. At the feature level, a Dual-branch Wavelet Enhancement Module (DWEM) embedded in the backbone applies discrete wavelet transform (DWT) to decompose features into low- and high-frequency components, followed by differentiated enhancements that improve cross-domain robustness and discriminability. At the proxy level, a Dynamic Confidence-weighted Clustering Proxy (DCCP) employs confidence-guided initialization and a two-stage online–offline update strategy to stabilize proxy optimization and suppress proxy noise. Extensive experiments on the CUHK-SYSU and PRW benchmarks demonstrate that RPPS achieves state-of-the-art performance and strong robustness, underscoring the importance of enhancing pseudo-supervision reliability in UDA person search. Our code is accessible at https://github.com/zqx951102/RPPS
{"title":"Reliable Pseudo-Supervision for Unsupervised Domain Adaptive Person Search","authors":"Qixian Zhang;Duoqian Miao;Qi Zhang;Xuan Tan;Hongyun Zhang;Cairong Zhao","doi":"10.1109/TIP.2026.3654373","DOIUrl":"10.1109/TIP.2026.3654373","url":null,"abstract":"Unsupervised Domain Adaptation (UDA) person search aims to adapt models trained on labeled source data to unlabeled target domains. Existing approaches typically rely on clustering-based proxy learning, but their performance is often undermined by unreliable pseudo-supervision. This unreliability mainly stems from two challenges: (i) spectral shift bias, where low- and high-frequency components behave differently under domain shifts but are rarely considered, degrading feature stability; and (ii) static proxy updates, which make clustering proxies highly sensitive to noise and less adaptable to domain shifts. To address these challenges, we propose the Reliable Pseudo-supervision in UDA Person Search (RPPS) framework. At the feature level, a Dual-branch Wavelet Enhancement Module (DWEM) embedded in the backbone applies discrete wavelet transform (DWT) to decompose features into low- and high-frequency components, followed by differentiated enhancements that improve cross-domain robustness and discriminability. At the proxy level, a Dynamic Confidence-weighted Clustering Proxy (DCCP) employs confidence-guided initialization and a two-stage online–offline update strategy to stabilize proxy optimization and suppress proxy noise. Extensive experiments on the CUHK-SYSU and PRW benchmarks demonstrate that RPPS achieves state-of-the-art performance and strong robustness, underscoring the importance of enhancing pseudo-supervision reliability in UDA person search. Our code is accessible at <uri>https://github.com/zqx951102/RPPS</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"915-929"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1109/TIP.2026.3654473
Zhong Ji;Rongshuai Wei;Jingren Liu;Yanwei Pang;Jungong Han
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable, but they often struggle in data-scarce settings where insufficient training samples lead to suboptimal performance. To address this limitation, we propose a Few-Shot Prototypical Concept Classification (FSPCC) framework that systematically mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment. Specifically, our approach leverages a Mixture of LoRA Experts (MoLE) for parameter-efficient adaptation, ensuring a balanced allocation of trainable parameters between the backbone and the PCL module. Meanwhile, cross-module concept guidance enforces tight alignment between the backbone’s feature representations and the prototypical concept activation patterns. In addition, we incorporate a multi-level feature preservation strategy that fuses spatial and semantic cues across various layers, thereby enriching the learned representations and mitigating the challenges posed by limited data availability. Finally, to enhance interpretability and minimize concept overlap, we introduce a geometry-aware concept discrimination loss that enforces orthogonality among concepts, encouraging more disentangled and transparent decision boundaries. Experimental results on six popular benchmarks (CUB-200-2011, mini-ImageNet, CIFAR-FS, Stanford Cars, FGVC-Aircraft, and DTD) demonstrate that our approach consistently outperforms existing SEMs by a notable margin, with 4.2%–8.7% relative gains in 5-way 5-shot classification. These findings highlight the efficacy of coupling concept learning with few-shot adaptation to achieve both higher accuracy and clearer model interpretability, paving the way for more transparent visual recognition systems.
{"title":"Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts","authors":"Zhong Ji;Rongshuai Wei;Jingren Liu;Yanwei Pang;Jungong Han","doi":"10.1109/TIP.2026.3654473","DOIUrl":"10.1109/TIP.2026.3654473","url":null,"abstract":"Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable, but they often struggle in data-scarce settings where insufficient training samples lead to suboptimal performance. To address this limitation, we propose a Few-Shot Prototypical Concept Classification (FSPCC) framework that systematically mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment. Specifically, our approach leverages a Mixture of LoRA Experts (MoLE) for parameter-efficient adaptation, ensuring a balanced allocation of trainable parameters between the backbone and the PCL module. Meanwhile, cross-module concept guidance enforces tight alignment between the backbone’s feature representations and the prototypical concept activation patterns. In addition, we incorporate a multi-level feature preservation strategy that fuses spatial and semantic cues across various layers, thereby enriching the learned representations and mitigating the challenges posed by limited data availability. Finally, to enhance interpretability and minimize concept overlap, we introduce a geometry-aware concept discrimination loss that enforces orthogonality among concepts, encouraging more disentangled and transparent decision boundaries. Experimental results on six popular benchmarks (CUB-200-2011, mini-ImageNet, CIFAR-FS, Stanford Cars, FGVC-Aircraft, and DTD) demonstrate that our approach consistently outperforms existing SEMs by a notable margin, with 4.2%–8.7% relative gains in 5-way 5-shot classification. These findings highlight the efficacy of coupling concept learning with few-shot adaptation to achieve both higher accuracy and clearer model interpretability, paving the way for more transparent visual recognition systems.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"930-942"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146021122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}