首页 > 最新文献

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society最新文献

英文 中文
MDA-MAA: A Collaborative Augmentation Approach for Generalizing Cross-Domain Retrieval 一种泛化跨域检索的协同增强方法
IF 13.7 Pub Date : 2026-02-02 DOI: 10.1109/TIP.2026.3658223
Ming Jin;Richang Hong
In video-text cross-domain retrieval tasks, the generalization ability of the retrieval models is key to improving their performance and is crucial for enhancing their practical applicability. However, existing retrieval models exhibit significant deficiencies in cross-domain generalization. On one hand, models tend to overfit specific training domain data, resulting in poor cross-domain matching and significantly reduced retrieval accuracy when dealing with data from different, new, or mixed domains. On the other hand, although data augmentation is a vital strategy for enhancing model generalization, most existing methods focus on unimodal augmentation and fail to fully exploit the multimodal correlations between video and text. As a result, the augmented data lack semantic diversity, which further limits the model’s ability to understand and perform in complex cross-domain scenarios. To address these challenges, this paper proposes an innovative collaborative augmentation approach named MDA-MAA, which includes two core modules: the Masked Attention Augmentation (MAA) module and the Multimodal Diffusion Augmentation (MDA) module. The MAA module applies masking to the original video frame features and uses an attention mechanism to predict the masked features, effectively reducing overfitting to training data and enhancing model generalization. The MDA module generates subtitles from video frames and uses the LLaMA model to infer comprehensive video captions. These captions, combined with the original video frames, are integrated into a diffusion model for joint learning, ultimately generating semantically enriched augmented video frames. This process leverages the multimodal relationship between video and text to increase the diversity of the training data distribution. Experimental results demonstrate that this collaborative augmentation method significantly improves the performance of video-text cross-domain retrieval models, validating its effectiveness in enhancing model generalization.
在视频文本跨域检索任务中,检索模型的泛化能力是提高检索模型性能的关键,也是增强检索模型实用性的关键。然而,现有的检索模型在跨域泛化方面存在明显的不足。一方面,模型倾向于过度拟合特定的训练领域数据,导致在处理来自不同领域、新领域或混合领域的数据时,跨领域匹配能力差,检索精度显著降低。另一方面,虽然数据增强是增强模型泛化的重要策略,但大多数现有方法都侧重于单模态增强,未能充分利用视频和文本之间的多模态相关性。因此,增强的数据缺乏语义多样性,这进一步限制了模型在复杂的跨域场景中理解和执行的能力。为了解决这些问题,本文提出了一种创新的协同增强方法,称为MDA-MAA,该方法包括两个核心模块:掩面注意力增强(MAA)模块和多模态扩散增强(MDA)模块。MAA模块对原始视频帧特征进行屏蔽,并利用注意机制对被屏蔽的特征进行预测,有效减少了对训练数据的过拟合,增强了模型的泛化能力。MDA模块从视频帧生成字幕,并使用LLaMA模型推断综合视频字幕。这些字幕与原始视频帧相结合,被集成到一个扩散模型中进行联合学习,最终生成语义丰富的增强视频帧。该过程利用视频和文本之间的多模态关系来增加训练数据分布的多样性。实验结果表明,该协同增强方法显著提高了视频文本跨域检索模型的性能,验证了其增强模型泛化的有效性。
{"title":"MDA-MAA: A Collaborative Augmentation Approach for Generalizing Cross-Domain Retrieval","authors":"Ming Jin;Richang Hong","doi":"10.1109/TIP.2026.3658223","DOIUrl":"10.1109/TIP.2026.3658223","url":null,"abstract":"In video-text cross-domain retrieval tasks, the generalization ability of the retrieval models is key to improving their performance and is crucial for enhancing their practical applicability. However, existing retrieval models exhibit significant deficiencies in cross-domain generalization. On one hand, models tend to overfit specific training domain data, resulting in poor cross-domain matching and significantly reduced retrieval accuracy when dealing with data from different, new, or mixed domains. On the other hand, although data augmentation is a vital strategy for enhancing model generalization, most existing methods focus on unimodal augmentation and fail to fully exploit the multimodal correlations between video and text. As a result, the augmented data lack semantic diversity, which further limits the model’s ability to understand and perform in complex cross-domain scenarios. To address these challenges, this paper proposes an innovative collaborative augmentation approach named MDA-MAA, which includes two core modules: the Masked Attention Augmentation (MAA) module and the Multimodal Diffusion Augmentation (MDA) module. The MAA module applies masking to the original video frame features and uses an attention mechanism to predict the masked features, effectively reducing overfitting to training data and enhancing model generalization. The MDA module generates subtitles from video frames and uses the LLaMA model to infer comprehensive video captions. These captions, combined with the original video frames, are integrated into a diffusion model for joint learning, ultimately generating semantically enriched augmented video frames. This process leverages the multimodal relationship between video and text to increase the diversity of the training data distribution. Experimental results demonstrate that this collaborative augmentation method significantly improves the performance of video-text cross-domain retrieval models, validating its effectiveness in enhancing model generalization.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1595-1606"},"PeriodicalIF":13.7,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146101327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Padé Neurons for Efficient Neural Models 高效神经模型中的神经元
IF 13.7 Pub Date : 2026-01-30 DOI: 10.1109/TIP.2026.3653202
Onur Keleş;A. Murat Tekalp
Neural networks commonly employ the McCulloch-Pitts neuron model, which is a linear model followed by a point-wise non-linear activation. Various researchers have already advanced inherently non-linear neuron models, such as quadratic neurons, generalized operational neurons, generative neurons, and super neurons, which offer stronger non-linearity compared to point-wise activation functions. In this paper, we introduce a novel and better non-linear neuron model called Padé neurons ( $mathrm {textit {Paon}}$ s), inspired by Padé approximants. $mathrm {textit {Paon}}$ s offer several advantages, such as diversity of non-linearity, since each $mathrm {textit {Paon}}$ learns a different non-linear function of its inputs, and layer efficiency, since $mathrm {textit {Paon}}$ s provide stronger non-linearity in much fewer layers compared to piecewise linear approximation. Furthermore, $mathrm {textit {Paon}}$ s include all previously proposed neuron models as special cases, thus any neuron model in any network can be replaced by $mathrm {textit {Paon}}$ s. We note that there has been a proposal to employ the Padé approximation as a generalized point-wise activation function, which is fundamentally different from our model. To validate the efficacy of $mathrm {textit {Paon}}$ s, in our experiments, we replace classic neurons in some well-known neural image super-resolution, compression, and classification models based on the ResNet architecture with $mathrm {textit {Paon}}$ s. Our comprehensive experimental results and analyses demonstrate that neural models built by $mathrm {textit {Paon}}$ s provide better or equal performance than their classic counterparts with a smaller number of layers. The PyTorch implementation code for $mathrm {textit {Paon}}$ is open-sourced at https://github.com/onur-keles/Paon
神经网络通常采用McCulloch-Pitts神经元模型,这是一个线性模型,然后是逐点非线性激活。许多研究者已经提出了固有的非线性神经元模型,如二次神经元、广义运算神经元、生成神经元和超级神经元,它们比点激活函数具有更强的非线性。在本文中,我们引入了一种新的和更好的非线性神经元模型,称为pad神经元($ mathm {textit {Paon}}$ s),灵感来自于pad近似。$mathrm {textit {Paon}}$ s提供了几个优点,例如非线性的多样性,因为每个$mathrm {textit {Paon}}$学习其输入的不同非线性函数,以及层效率,因为$mathrm {textit {Paon}}$ s在更少的层中提供更强的非线性,而不是分段线性近似。此外,$mathrm {textit {Paon}}$ s包含了所有之前提出的神经元模型作为特例,因此任何网络中的任何神经元模型都可以被$mathrm {textit {Paon}}$ s取代。我们注意到,已经有人建议将pad近似作为广义点向激活函数,这与我们的模型有本质的不同。为了验证$mathrm {textit {Paon}}$ s的有效性,在我们的实验中,我们用$mathrm {textit {Paon}}$ s取代了一些基于ResNet架构的知名神经图像超分辨率、压缩和分类模型中的经典神经元。我们的综合实验结果和分析表明,$mathrm {textit {Paon}}$ s构建的神经模型在层数更少的情况下提供了比经典模型更好或相同的性能。$ mathm {textit {Paon}}$的PyTorch实现代码在https://github.com/onur-keles/Paon上是开源的
{"title":"Padé Neurons for Efficient Neural Models","authors":"Onur Keleş;A. Murat Tekalp","doi":"10.1109/TIP.2026.3653202","DOIUrl":"10.1109/TIP.2026.3653202","url":null,"abstract":"Neural networks commonly employ the McCulloch-Pitts neuron model, which is a linear model followed by a point-wise non-linear activation. Various researchers have already advanced inherently non-linear neuron models, such as quadratic neurons, generalized operational neurons, generative neurons, and super neurons, which offer stronger non-linearity compared to point-wise activation functions. In this paper, we introduce a novel and better non-linear neuron model called Padé neurons (<inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s), inspired by Padé approximants. <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s offer several advantages, such as diversity of non-linearity, since each <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula> learns a different non-linear function of its inputs, and layer efficiency, since <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s provide stronger non-linearity in much fewer layers compared to piecewise linear approximation. Furthermore, <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s include all previously proposed neuron models as special cases, thus any neuron model in any network can be replaced by <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s. We note that there has been a proposal to employ the Padé approximation as a generalized point-wise activation function, which is fundamentally different from our model. To validate the efficacy of <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s, in our experiments, we replace classic neurons in some well-known neural image super-resolution, compression, and classification models based on the ResNet architecture with <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s. Our comprehensive experimental results and analyses demonstrate that neural models built by <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s provide better or equal performance than their classic counterparts with a smaller number of layers. The PyTorch implementation code for <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula> is open-sourced at <uri>https://github.com/onur-keles/Paon</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1508-1520"},"PeriodicalIF":13.7,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146089897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Disentangle to Fuse: Toward Content Preservation and Cross-Modality Consistency for Multi-Modality Image Fusion 从分离到融合:面向多模态图像融合的内容保存与跨模态一致性。
IF 13.7 Pub Date : 2026-01-30 DOI: 10.1109/TIP.2026.3657183
Xinran Qin;Yuning Cui;Shangquan Sun;Ruoyu Chen;Wenqi Ren;Alois Knoll;Xiaochun Cao
Multi-modal image fusion (MMIF) aims to integrate complementary information from heterogeneous sensor modalities. However, substantial cross-modality discrepancies hinder joint scene representation and lead to semantic degradation in the fused output. To address this limitation, we propose C2MFuse, a novel framework designed to preserve content while ensuring cross-modality consistency. To the best of our knowledge, this is the first MMIF approach to explicitly disentangle style and content representations across modalities for image fusion. C2MFuse introduces a content-preserving style normalization mechanism that suppresses modality-specific variations while maintaining the underlying scene structure. The normalized features are then progressively aggregated to enhance fine-grained details and improve content completeness. In light of the lack of ground truth and the inherent ambiguity of the fused distribution, we further align the fused representation with a well-defined source modality, thereby enhancing semantic consistency and reducing distributional uncertainty. Additionally, we introduce an adaptive consistency loss with learnable transformation, which provides dynamic, modality-aware supervision by enforcing global consistency across heterogeneous inputs. Extensive experiments on five datasets across three representative MMIF tasks demonstrate that C2MFuse achieves efficient and high-quality fusion, surpasses existing methods, and generalizes effectively to downstream visual applications.
多模态图像融合(MMIF)旨在整合来自不同传感器模态的互补信息。然而,大量的跨模态差异阻碍了联合场景表示,并导致融合输出中的语义退化。为了解决这一限制,我们提出了C2MFuse,这是一个新的框架,旨在保留内容,同时确保跨模态的一致性。据我们所知,这是第一个明确地将样式和内容表示跨图像融合模式分开的MMIF方法。C2MFuse引入了一种内容保留样式规范化机制,该机制在保持底层场景结构的同时抑制特定于模态的变化。然后逐步聚合规范化的特性,以增强细粒度的细节并提高内容的完整性。针对融合分布缺乏基础真值和固有的模糊性,我们进一步将融合表示与定义良好的源模态对齐,从而增强语义一致性,降低分布不确定性。此外,我们引入了具有可学习转换的自适应一致性损失,它通过强制跨异构输入的全局一致性提供动态的、模式感知的监督。在5个数据集、3个具有代表性的MMIF任务上进行的大量实验表明,C2MFuse实现了高效、高质量的融合,超越了现有方法,并有效地推广到下游可视化应用。
{"title":"Disentangle to Fuse: Toward Content Preservation and Cross-Modality Consistency for Multi-Modality Image Fusion","authors":"Xinran Qin;Yuning Cui;Shangquan Sun;Ruoyu Chen;Wenqi Ren;Alois Knoll;Xiaochun Cao","doi":"10.1109/TIP.2026.3657183","DOIUrl":"10.1109/TIP.2026.3657183","url":null,"abstract":"Multi-modal image fusion (MMIF) aims to integrate complementary information from heterogeneous sensor modalities. However, substantial cross-modality discrepancies hinder joint scene representation and lead to semantic degradation in the fused output. To address this limitation, we propose C2MFuse, a novel framework designed to preserve content while ensuring cross-modality consistency. To the best of our knowledge, this is the first MMIF approach to explicitly disentangle style and content representations across modalities for image fusion. C2MFuse introduces a content-preserving style normalization mechanism that suppresses modality-specific variations while maintaining the underlying scene structure. The normalized features are then progressively aggregated to enhance fine-grained details and improve content completeness. In light of the lack of ground truth and the inherent ambiguity of the fused distribution, we further align the fused representation with a well-defined source modality, thereby enhancing semantic consistency and reducing distributional uncertainty. Additionally, we introduce an adaptive consistency loss with learnable transformation, which provides dynamic, modality-aware supervision by enforcing global consistency across heterogeneous inputs. Extensive experiments on five datasets across three representative MMIF tasks demonstrate that C2MFuse achieves efficient and high-quality fusion, surpasses existing methods, and generalizes effectively to downstream visual applications.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1756-1770"},"PeriodicalIF":13.7,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146088997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IHDCP: Single Image Dehazing Using Inverted Haze Density Correction Prior IHDCP:使用反向雾霾密度校正先验的单幅图像去雾。
IF 13.7 Pub Date : 2026-01-29 DOI: 10.1109/TIP.2026.3657636
Yun Liu;Tao Li;Chunping Tan;Wenqi Ren;Cosmin Ancuti;Weisi Lin
Image dehazing, a crucial task in low-level vision, supports numerous practical applications, such as autonomous driving, remote sensing, and surveillance. This paper proposes IHDCP, a novel Inverted Haze Density Correction Prior for efficient single image dehazing. It is observed that the medium transmission can be effectively modeled from the inverted haze density map using correction functions with various gamma coefficients. Based on this observation, a pixel-wise gamma correction coefficient is introduced to formulate the transmission as a function of the inverted haze density map. To estimate the transmission, IHDCP is first incorporated into the classic atmospheric scattering model (ASM), leading to a transcendental equation that is subsequently simplified to a quadratic form with a single unknown parameter using the Taylor expansion. Then, boundary constraints are designed to estimate this model parameter, and the gamma correction coefficient map is derived via the Vieta theorem. Finally, the haze-free result is recovered through ASM inversion. Experimental results on diverse synthetic and real-world datasets verify that our algorithm not only provides visually appealing dehazing performance with high computational efficiency, but also outperforms several state-of-the-art dehazing approaches in both subjective and objective evaluations. Moreover, our IHDCP generalizes well to various types of degraded scenes. Our code is available at https://github.com/TaoLi-TL/IHDCP.
图像去雾是低水平视觉中的一项关键任务,它支持许多实际应用,如自动驾驶、遥感和监视。本文提出了一种新的逆霾密度校正先验算法IHDCP,用于单幅图像的高效去雾。观察到介质传输可以使用具有不同伽马系数的校正函数有效地从倒雾密度图中建模。在此基础上,引入逐像素的伽玛校正系数,将透射率表示为反演雾霾密度图的函数。为了估计传输,首先将IHDCP纳入经典大气散射模型(ASM),得到一个超越方程,随后使用泰勒展开将其简化为具有单个未知参数的二次型。然后,设计边界约束来估计该模型参数,并通过Vieta定理推导出伽马修正系数图。最后,通过ASM反演恢复无雾结果。在各种合成数据集和现实世界数据集上的实验结果验证了我们的算法不仅提供了具有高计算效率的视觉吸引力的除雾性能,而且在主观和客观评估中都优于几种最先进的除雾方法。此外,我们的IHDCP可以很好地推广到各种类型的退化场景。我们的代码可在https://github.com/TaoLi-TL/IHDCP上获得。
{"title":"IHDCP: Single Image Dehazing Using Inverted Haze Density Correction Prior","authors":"Yun Liu;Tao Li;Chunping Tan;Wenqi Ren;Cosmin Ancuti;Weisi Lin","doi":"10.1109/TIP.2026.3657636","DOIUrl":"10.1109/TIP.2026.3657636","url":null,"abstract":"Image dehazing, a crucial task in low-level vision, supports numerous practical applications, such as autonomous driving, remote sensing, and surveillance. This paper proposes IHDCP, a novel Inverted Haze Density Correction Prior for efficient single image dehazing. It is observed that the medium transmission can be effectively modeled from the inverted haze density map using correction functions with various gamma coefficients. Based on this observation, a pixel-wise gamma correction coefficient is introduced to formulate the transmission as a function of the inverted haze density map. To estimate the transmission, IHDCP is first incorporated into the classic atmospheric scattering model (ASM), leading to a transcendental equation that is subsequently simplified to a quadratic form with a single unknown parameter using the Taylor expansion. Then, boundary constraints are designed to estimate this model parameter, and the gamma correction coefficient map is derived via the Vieta theorem. Finally, the haze-free result is recovered through ASM inversion. Experimental results on diverse synthetic and real-world datasets verify that our algorithm not only provides visually appealing dehazing performance with high computational efficiency, but also outperforms several state-of-the-art dehazing approaches in both subjective and objective evaluations. Moreover, our IHDCP generalizes well to various types of degraded scenes. Our code is available at <uri>https://github.com/TaoLi-TL/IHDCP</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1448-1461"},"PeriodicalIF":13.7,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146073159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CAS-ViT: Convolutional Additive Self-Attention Vision Transformers for Efficient Mobile Applications CAS-ViT:高效移动应用的卷积加性自注意视觉变压器
IF 13.7 Pub Date : 2026-01-28 DOI: 10.1109/TIP.2026.3655121
Tianfang Zhang;Lei Li;Yang Zhou;Wentao Liu;Chen Qian;Jenq-Neng Hwang;Xiangyang Ji
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer’s powerful global context capability. However, the pairwise token affinity and complex matrix operations limit its deployment on resource-constrained scenarios and real-time applications, such as mobile devices, although considerable efforts have been made in previous works. In this paper, we introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications. Firstly, we argue that the capability of token mixers to obtain global contextual information hinges on multiple information interactions, such as spatial and channel domains. Subsequently, we propose Convolutional Additive Token Mixer (CATM) employing underlying spatial and channel attention as novel interaction forms. This module eliminates troublesome complex operations such as matrix multiplication and Softmax. We introduce Convolutional Additive Self-attention(CAS) block hybrid architecture and utilize CATM for each block. And further, we build a family of lightweight networks, which can be easily extended to various downstream tasks. Finally, we evaluate CAS-ViT across a variety of vision tasks, including image classification, object detection, instance segmentation, and semantic segmentation. Our M and T model achieves 83.0%/84.1% top-1 with only 12M/21M parameters on ImageNet-1K. Meanwhile, throughput evaluations on GPUs, ONNX, and iPhones also demonstrate superior results compared to other state-of-the-art backbones. Extensive experiments demonstrate that our approach achieves a better balance of performance, efficient inference and easy-to-deploy. Our code and model are available at: https://github.com/Tianfang-Zhang/CAS-ViT
视觉变形器(ViTs)凭借其令牌混合器强大的全局上下文能力标志着神经网络的革命性进步。然而,成对的令牌关联和复杂的矩阵操作限制了其在资源受限场景和实时应用(如移动设备)上的部署,尽管在之前的工作中已经做出了相当大的努力。在本文中,我们介绍了CAS-ViT:卷积加性自注意视觉变压器,以实现移动应用中效率和性能之间的平衡。首先,我们认为令牌混合器获得全局上下文信息的能力取决于多个信息交互,如空间和通道域。随后,我们提出了卷积加性令牌混合器(CATM),采用底层空间和通道注意作为新的交互形式。该模块消除了麻烦的复杂操作,如矩阵乘法和Softmax。我们引入了卷积加性自关注(CAS)块混合架构,并对每个块使用CATM。此外,我们还构建了一系列轻量级网络,可以很容易地扩展到各种下游任务。最后,我们通过各种视觉任务来评估CAS-ViT,包括图像分类、目标检测、实例分割和语义分割。我们的M和T模型在ImageNet-1K上仅使用12M/21M参数即可达到83.0%/84.1%的top-1。与此同时,在gpu、ONNX和iphone上的吞吐量评估也显示出比其他最先进的骨干网更好的结果。大量的实验表明,我们的方法在性能、高效推理和易于部署之间取得了更好的平衡。我们的代码和模型可在:https://github.com/Tianfang-Zhang/CAS-ViT
{"title":"CAS-ViT: Convolutional Additive Self-Attention Vision Transformers for Efficient Mobile Applications","authors":"Tianfang Zhang;Lei Li;Yang Zhou;Wentao Liu;Chen Qian;Jenq-Neng Hwang;Xiangyang Ji","doi":"10.1109/TIP.2026.3655121","DOIUrl":"10.1109/TIP.2026.3655121","url":null,"abstract":"Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer’s powerful global context capability. However, the pairwise token affinity and complex matrix operations limit its deployment on resource-constrained scenarios and real-time applications, such as mobile devices, although considerable efforts have been made in previous works. In this paper, we introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications. Firstly, we argue that the capability of token mixers to obtain global contextual information hinges on multiple information interactions, such as spatial and channel domains. Subsequently, we propose Convolutional Additive Token Mixer (CATM) employing underlying spatial and channel attention as novel interaction forms. This module eliminates troublesome complex operations such as matrix multiplication and Softmax. We introduce Convolutional Additive Self-attention(CAS) block hybrid architecture and utilize CATM for each block. And further, we build a family of lightweight networks, which can be easily extended to various downstream tasks. Finally, we evaluate CAS-ViT across a variety of vision tasks, including image classification, object detection, instance segmentation, and semantic segmentation. Our M and T model achieves 83.0%/84.1% top-1 with only 12M/21M parameters on ImageNet-1K. Meanwhile, throughput evaluations on GPUs, ONNX, and iPhones also demonstrate superior results compared to other state-of-the-art backbones. Extensive experiments demonstrate that our approach achieves a better balance of performance, efficient inference and easy-to-deploy. Our code and model are available at: <uri>https://github.com/Tianfang-Zhang/CAS-ViT</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1899-1909"},"PeriodicalIF":13.7,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146070136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dissecting RGB-D Learning for Improved Multi-Modal Fusion 解剖RGB-D学习改善多模态融合。
IF 13.7 Pub Date : 2026-01-28 DOI: 10.1109/TIP.2026.3657171
Hao Chen;Haoran Zhou;Yunshu Zhang;Zheng Lin;Yongjian Deng
In the RGB-D vision community, extensive research has been focused on designing multi-modal learning strategies and fusion structures. However, the complementary and fusion mechanisms in RGB-D models remain a opaque box. In this paper, we present an analytical framework and a novel score to dissect the RGB-D vision community. Our approach involves measuring proposed semantic variance and feature similarity across modalities and levels, conducting visual and quantitative analyzes on multi-modal learning through comprehensive experiments. Specifically, we investigate the consistency and specialty of features across modalities, evolution rules within each modality, and the collaboration logic used when optimizing a RGB-D model. Our studies reveal/verify several important findings, such as the discrepancy in cross-modal features and the hybrid multi-modal cooperation rule, which highlights consistency and specialty simultaneously for complementary inference. We also showcase the versatility of the proposed RGB-D dissection method and introduce a straightforward fusion strategy based on our findings, which delivers significant enhancements across various tasks and even other multi-modal data.
在RGB-D视觉界,广泛的研究集中在设计多模态学习策略和融合结构上。然而,RGB-D模型中的互补和融合机制仍然是一个黑盒子。在本文中,我们提出了一个分析框架和一个新颖的分数来剖析RGB-D视觉社区。我们的方法包括测量跨模态和层次的建议语义方差和特征相似性,通过综合实验对多模态学习进行视觉和定量分析。具体来说,我们研究了模式之间特征的一致性和特殊性,每种模式内的演化规则,以及优化RGB-D模型时使用的协作逻辑。我们的研究揭示/验证了几个重要的发现,如跨模态特征的差异和混合多模态合作规则,它同时突出了一致性和特殊性,以进行互补推理。我们还展示了所提出的RGB-D解剖方法的多功能性,并根据我们的发现介绍了一种直接的融合策略,该策略在各种任务甚至其他多模态数据中提供了显著的增强。
{"title":"Dissecting RGB-D Learning for Improved Multi-Modal Fusion","authors":"Hao Chen;Haoran Zhou;Yunshu Zhang;Zheng Lin;Yongjian Deng","doi":"10.1109/TIP.2026.3657171","DOIUrl":"10.1109/TIP.2026.3657171","url":null,"abstract":"In the RGB-D vision community, extensive research has been focused on designing multi-modal learning strategies and fusion structures. However, the complementary and fusion mechanisms in RGB-D models remain a opaque box. In this paper, we present an analytical framework and a novel score to dissect the RGB-D vision community. Our approach involves measuring proposed semantic variance and feature similarity across modalities and levels, conducting visual and quantitative analyzes on multi-modal learning through comprehensive experiments. Specifically, we investigate the consistency and specialty of features across modalities, evolution rules within each modality, and the collaboration logic used when optimizing a RGB-D model. Our studies reveal/verify several important findings, such as the discrepancy in cross-modal features and the hybrid multi-modal cooperation rule, which highlights consistency and specialty simultaneously for complementary inference. We also showcase the versatility of the proposed RGB-D dissection method and introduce a straightforward fusion strategy based on our findings, which delivers significant enhancements across various tasks and even other multi-modal data.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1846-1857"},"PeriodicalIF":13.7,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ThinkMatter: Panoramic-Aware Instructional Semantics for Monocular Vision-and-Language Navigation ThinkMatter:单目视觉和语言导航的全景感知教学语义。
IF 13.7 Pub Date : 2026-01-28 DOI: 10.1109/TIP.2026.3652003
Guangzhao Dai;Shuo Wang;Hao Zhao;Bin Zhu;Qianru Sun;Xiangbo Shu
Vision-and-Language Navigation in continuous environments (VLN-CE) requires an embodied robot to navigate the target destination following the natural language instruction. Most existing methods use panoramic RGB-D cameras for 360° observation of environments. However, these methods struggle in real-world applications because of the higher cost of panoramic RGB-D cameras. This paper studies a low-cost and practical VLN-CE setting, e.g., using monocular cameras of limited field of view, which means “Look Less” for visual observations and environment semantics. In this paper, we propose a ThinkMatter framework for monocular VLN-CE, where we motivate monocular robots to “Think More” by 1) generating novel views and 2) integrating instruction semantics. Specifically, we achieve the former by the proposed 3DGS-based panoramic generation to render novel views at each step, based on past observation collections. We achieve the latter by the proposed enhancement of the occupancy-instruction semantics, which integrates the spatial semantics of occupancy maps with the textual semantics of language instructions. These operations promote monocular robots with wider environment perceptions as well as transparent semantic connections with the instruction. Both extensive experiments in the simulators and real-world environments demonstrate the effectiveness of ThinkMatter, providing a promising practice for real-world navigation.
连续环境中的视觉和语言导航(VLN-CE)要求嵌入式机器人按照自然语言指令导航目标目的地。大多数现有的方法使用全景RGB-D相机360°观察环境。然而,由于全景RGB-D相机的成本较高,这些方法在实际应用中存在困难。本文研究了一种低成本和实用的VLN-CE设置,例如使用有限视场的单目相机,这意味着视觉观察和环境语义“少看”。在本文中,我们提出了一个用于单目VLN-CE的ThinkMatter框架,在该框架中,我们通过1)生成新的视图和2)整合指令语义来激励单目机器人“思考更多”。具体来说,我们通过提出的基于3dgs的全景生成来实现前者,以基于过去的观测集合在每一步呈现新的视图。我们提出增强占用-指令语义,将占用地图的空间语义与语言指令的文本语义相结合,从而实现后者。这些操作促进单目机器人具有更广泛的环境感知以及与指令透明的语义连接。在模拟器和现实世界环境中的大量实验都证明了ThinkMatter的有效性,为现实世界的导航提供了一个有前途的实践。
{"title":"ThinkMatter: Panoramic-Aware Instructional Semantics for Monocular Vision-and-Language Navigation","authors":"Guangzhao Dai;Shuo Wang;Hao Zhao;Bin Zhu;Qianru Sun;Xiangbo Shu","doi":"10.1109/TIP.2026.3652003","DOIUrl":"10.1109/TIP.2026.3652003","url":null,"abstract":"Vision-and-Language Navigation in continuous environments (VLN-CE) requires an embodied robot to navigate the target destination following the natural language instruction. Most existing methods use panoramic RGB-D cameras for 360° observation of environments. However, these methods struggle in real-world applications because of the higher cost of panoramic RGB-D cameras. This paper studies a low-cost and practical VLN-CE setting, e.g., using monocular cameras of limited field of view, which means “Look Less” for visual observations and environment semantics. In this paper, we propose a ThinkMatter framework for monocular VLN-CE, where we motivate monocular robots to “Think More” by 1) generating novel views and 2) integrating instruction semantics. Specifically, we achieve the former by the proposed 3DGS-based panoramic generation to render novel views at each step, based on past observation collections. We achieve the latter by the proposed enhancement of the occupancy-instruction semantics, which integrates the spatial semantics of occupancy maps with the textual semantics of language instructions. These operations promote monocular robots with wider environment perceptions as well as transparent semantic connections with the instruction. Both extensive experiments in the simulators and real-world environments demonstrate the effectiveness of ThinkMatter, providing a promising practice for real-world navigation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1937-1950"},"PeriodicalIF":13.7,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SigMa: Semantic Similarity-Guided Semi-Dense Feature Matching 语义相似度引导的半密集特征匹配。
IF 13.7 Pub Date : 2026-01-21 DOI: 10.1109/TIP.2026.3654367
Xiang Fang;Zizhuo Li;Jiayi Ma
Recent advancements have led the image matching community to increasingly focus on obtaining subpixel-level correspondences in a detector-free manner, i.e., semi-dense feature matching. Existing methods tend to overfocus on low-level local features while ignoring equally important high-level semantic information. To tackle these shortcomings, we propose SigMa, a semantic similarity-guided semi-dense feature matching method, which leverages the strengths of both local features and high-level semantic features. First, we design a dual-branch feature extractor, comprising a convolutional network and a vision foundation model, to extract low-level local features and high-level semantic features, respectively. To fully retain the advantages of these two features and effectively integrate them, we also introduce a cross-domain feature adapter, which could overcome their spatial resolution mismatches, channel dimensionality variations, and inter-domain gaps. Furthermore, we observe that performing the transformer on the whole feature map is unnecessary because of the similarity of local representations. We design a guided pooling method based on semantic similarity. This strategy performs attention computation by selecting highly semantically similar regions, aiming to minimize information loss while maintaining computational efficiency. Extensive experiments on multiple datasets demonstrate that our method achieves a competitive accuracy-efficiency trade-off across various tasks and exhibits strong generalization capabilities across different datasets. Additionally, we conduct a series of ablation studies and analysis experiments to validate the effectiveness and rationality of our method’s design. Our code is publicly available at https://github.com/ShineFox/SigMa
最近的进展使得图像匹配界越来越关注于以无检测器的方式获得亚像素级对应,即半密集特征匹配。现有的方法往往过于关注低级的局部特征,而忽略了同样重要的高级语义信息。为了解决这些问题,我们提出了一种基于语义相似度的半密集特征匹配方法SigMa,该方法同时利用了局部特征和高级语义特征的优势。首先,我们设计了一个双分支特征提取器,包括卷积网络和视觉基础模型,分别提取低级局部特征和高级语义特征。为了充分保留这两种特征的优势并有效地将它们整合,我们还引入了一种跨域特征适配器,以克服它们的空间分辨率不匹配、信道维度变化和域间间隙。此外,我们观察到,由于局部表示的相似性,在整个特征映射上执行转换是不必要的。设计了一种基于语义相似度的引导池化方法。该策略通过选择语义高度相似的区域进行注意力计算,在保持计算效率的同时最小化信息损失。在多个数据集上的大量实验表明,我们的方法在各种任务之间实现了具有竞争力的准确性和效率权衡,并在不同数据集上表现出强大的泛化能力。此外,我们进行了一系列的烧蚀研究和分析实验,以验证我们的方法设计的有效性和合理性。我们的代码将是公开的。
{"title":"SigMa: Semantic Similarity-Guided Semi-Dense Feature Matching","authors":"Xiang Fang;Zizhuo Li;Jiayi Ma","doi":"10.1109/TIP.2026.3654367","DOIUrl":"10.1109/TIP.2026.3654367","url":null,"abstract":"Recent advancements have led the image matching community to increasingly focus on obtaining subpixel-level correspondences in a detector-free manner, i.e., semi-dense feature matching. Existing methods tend to overfocus on low-level local features while ignoring equally important high-level semantic information. To tackle these shortcomings, we propose SigMa, a semantic similarity-guided semi-dense feature matching method, which leverages the strengths of both local features and high-level semantic features. First, we design a dual-branch feature extractor, comprising a convolutional network and a vision foundation model, to extract low-level local features and high-level semantic features, respectively. To fully retain the advantages of these two features and effectively integrate them, we also introduce a cross-domain feature adapter, which could overcome their spatial resolution mismatches, channel dimensionality variations, and inter-domain gaps. Furthermore, we observe that performing the transformer on the whole feature map is unnecessary because of the similarity of local representations. We design a guided pooling method based on semantic similarity. This strategy performs attention computation by selecting highly semantically similar regions, aiming to minimize information loss while maintaining computational efficiency. Extensive experiments on multiple datasets demonstrate that our method achieves a competitive accuracy-efficiency trade-off across various tasks and exhibits strong generalization capabilities across different datasets. Additionally, we conduct a series of ablation studies and analysis experiments to validate the effectiveness and rationality of our method’s design. Our code is publicly available at <uri>https://github.com/ShineFox/SigMa</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"872-887"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reliable Pseudo-Supervision for Unsupervised Domain Adaptive Person Search 无监督域自适应人搜索的可靠伪监督。
IF 13.7 Pub Date : 2026-01-21 DOI: 10.1109/TIP.2026.3654373
Qixian Zhang;Duoqian Miao;Qi Zhang;Xuan Tan;Hongyun Zhang;Cairong Zhao
Unsupervised Domain Adaptation (UDA) person search aims to adapt models trained on labeled source data to unlabeled target domains. Existing approaches typically rely on clustering-based proxy learning, but their performance is often undermined by unreliable pseudo-supervision. This unreliability mainly stems from two challenges: (i) spectral shift bias, where low- and high-frequency components behave differently under domain shifts but are rarely considered, degrading feature stability; and (ii) static proxy updates, which make clustering proxies highly sensitive to noise and less adaptable to domain shifts. To address these challenges, we propose the Reliable Pseudo-supervision in UDA Person Search (RPPS) framework. At the feature level, a Dual-branch Wavelet Enhancement Module (DWEM) embedded in the backbone applies discrete wavelet transform (DWT) to decompose features into low- and high-frequency components, followed by differentiated enhancements that improve cross-domain robustness and discriminability. At the proxy level, a Dynamic Confidence-weighted Clustering Proxy (DCCP) employs confidence-guided initialization and a two-stage online–offline update strategy to stabilize proxy optimization and suppress proxy noise. Extensive experiments on the CUHK-SYSU and PRW benchmarks demonstrate that RPPS achieves state-of-the-art performance and strong robustness, underscoring the importance of enhancing pseudo-supervision reliability in UDA person search. Our code is accessible at https://github.com/zqx951102/RPPS
无监督域自适应(UDA)人员搜索的目的是将经过标记的源数据训练的模型适应于未标记的目标域。现有的方法通常依赖于基于聚类的代理学习,但它们的性能往往受到不可靠的伪监督的影响。这种不可靠性主要来自两个方面的挑战:(i)频谱移位偏差,其中低频和高频分量在域移位下表现不同,但很少被考虑,从而降低了特征稳定性;(ii)静态代理更新,这使得聚类代理对噪声高度敏感,对域转移的适应性较差。为了解决这些问题,我们提出了UDA人员搜索(RPPS)框架中的可靠伪监督。在特征级,嵌入在主干中的双分支小波增强模块(DWEM)应用离散小波变换(DWT)将特征分解为低频和高频分量,然后进行差异化增强,提高跨域鲁棒性和可判别性。在代理层面,动态置信度加权聚类代理(DCCP)采用置信度引导初始化和两阶段在线-离线更新策略来稳定代理优化并抑制代理噪声。在中大- sysu和PRW基准上的大量实验表明,RPPS达到了最先进的性能和强大的鲁棒性,强调了提高UDA人员搜索中伪监督可靠性的重要性。我们的代码可以在https://github.com/zqx951102/RPPS上访问。
{"title":"Reliable Pseudo-Supervision for Unsupervised Domain Adaptive Person Search","authors":"Qixian Zhang;Duoqian Miao;Qi Zhang;Xuan Tan;Hongyun Zhang;Cairong Zhao","doi":"10.1109/TIP.2026.3654373","DOIUrl":"10.1109/TIP.2026.3654373","url":null,"abstract":"Unsupervised Domain Adaptation (UDA) person search aims to adapt models trained on labeled source data to unlabeled target domains. Existing approaches typically rely on clustering-based proxy learning, but their performance is often undermined by unreliable pseudo-supervision. This unreliability mainly stems from two challenges: (i) spectral shift bias, where low- and high-frequency components behave differently under domain shifts but are rarely considered, degrading feature stability; and (ii) static proxy updates, which make clustering proxies highly sensitive to noise and less adaptable to domain shifts. To address these challenges, we propose the Reliable Pseudo-supervision in UDA Person Search (RPPS) framework. At the feature level, a Dual-branch Wavelet Enhancement Module (DWEM) embedded in the backbone applies discrete wavelet transform (DWT) to decompose features into low- and high-frequency components, followed by differentiated enhancements that improve cross-domain robustness and discriminability. At the proxy level, a Dynamic Confidence-weighted Clustering Proxy (DCCP) employs confidence-guided initialization and a two-stage online–offline update strategy to stabilize proxy optimization and suppress proxy noise. Extensive experiments on the CUHK-SYSU and PRW benchmarks demonstrate that RPPS achieves state-of-the-art performance and strong robustness, underscoring the importance of enhancing pseudo-supervision reliability in UDA person search. Our code is accessible at <uri>https://github.com/zqx951102/RPPS</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"915-929"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts 基于原型概念引导的LoRA专家混合可解释的少量图像分类。
IF 13.7 Pub Date : 2026-01-21 DOI: 10.1109/TIP.2026.3654473
Zhong Ji;Rongshuai Wei;Jingren Liu;Yanwei Pang;Jungong Han
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable, but they often struggle in data-scarce settings where insufficient training samples lead to suboptimal performance. To address this limitation, we propose a Few-Shot Prototypical Concept Classification (FSPCC) framework that systematically mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment. Specifically, our approach leverages a Mixture of LoRA Experts (MoLE) for parameter-efficient adaptation, ensuring a balanced allocation of trainable parameters between the backbone and the PCL module. Meanwhile, cross-module concept guidance enforces tight alignment between the backbone’s feature representations and the prototypical concept activation patterns. In addition, we incorporate a multi-level feature preservation strategy that fuses spatial and semantic cues across various layers, thereby enriching the learned representations and mitigating the challenges posed by limited data availability. Finally, to enhance interpretability and minimize concept overlap, we introduce a geometry-aware concept discrimination loss that enforces orthogonality among concepts, encouraging more disentangled and transparent decision boundaries. Experimental results on six popular benchmarks (CUB-200-2011, mini-ImageNet, CIFAR-FS, Stanford Cars, FGVC-Aircraft, and DTD) demonstrate that our approach consistently outperforms existing SEMs by a notable margin, with 4.2%–8.7% relative gains in 5-way 5-shot classification. These findings highlight the efficacy of coupling concept learning with few-shot adaptation to achieve both higher accuracy and clearer model interpretability, paving the way for more transparent visual recognition systems.
自我解释模型(SEMs)依赖于原型概念学习(PCL)来使其视觉识别过程更具可解释性,但它们经常在数据稀缺的环境中挣扎,在这些环境中,训练样本不足会导致次优性能。为了解决这一限制,我们提出了一个少镜头原型概念分类(FSPCC)框架,该框架系统地减轻了低数据制度下的两个关键挑战:参数失衡和表示错位。具体来说,我们的方法利用LoRA专家(MoLE)的混合物进行参数高效适应,确保在主干网和PCL模块之间平衡分配可训练参数。同时,跨模块的概念引导强制骨干的特征表示和原型概念激活模式之间的紧密对齐。此外,我们还采用了一种多层次的特征保存策略,融合了不同层的空间和语义线索,从而丰富了学习到的表征,减轻了数据可用性有限带来的挑战。最后,为了增强可解释性和最小化概念重叠,我们引入了一个几何感知的概念区分损失,强制概念之间的正交性,鼓励更多的解纠缠和透明的决策边界。在六个流行的基准测试(CUB-200-2011, mini-ImageNet, CIFAR-FS, Stanford Cars, FGVC-Aircraft和DTD)上的实验结果表明,我们的方法始终优于现有的sem,在5-way 5-shot分类中具有4.2%-8.7%的相对增益。这些发现强调了将概念学习与少镜头自适应相结合的有效性,以实现更高的准确性和更清晰的模型可解释性,为更透明的视觉识别系统铺平了道路。
{"title":"Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts","authors":"Zhong Ji;Rongshuai Wei;Jingren Liu;Yanwei Pang;Jungong Han","doi":"10.1109/TIP.2026.3654473","DOIUrl":"10.1109/TIP.2026.3654473","url":null,"abstract":"Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable, but they often struggle in data-scarce settings where insufficient training samples lead to suboptimal performance. To address this limitation, we propose a Few-Shot Prototypical Concept Classification (FSPCC) framework that systematically mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment. Specifically, our approach leverages a Mixture of LoRA Experts (MoLE) for parameter-efficient adaptation, ensuring a balanced allocation of trainable parameters between the backbone and the PCL module. Meanwhile, cross-module concept guidance enforces tight alignment between the backbone’s feature representations and the prototypical concept activation patterns. In addition, we incorporate a multi-level feature preservation strategy that fuses spatial and semantic cues across various layers, thereby enriching the learned representations and mitigating the challenges posed by limited data availability. Finally, to enhance interpretability and minimize concept overlap, we introduce a geometry-aware concept discrimination loss that enforces orthogonality among concepts, encouraging more disentangled and transparent decision boundaries. Experimental results on six popular benchmarks (CUB-200-2011, mini-ImageNet, CIFAR-FS, Stanford Cars, FGVC-Aircraft, and DTD) demonstrate that our approach consistently outperforms existing SEMs by a notable margin, with 4.2%–8.7% relative gains in 5-way 5-shot classification. These findings highlight the efficacy of coupling concept learning with few-shot adaptation to achieve both higher accuracy and clearer model interpretability, paving the way for more transparent visual recognition systems.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"930-942"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146021122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1