RGB-infrared object detection aims to improve detection performance in complex environments by integrating complementary information from RGB and infrared images. While transformer-based methods have demonstrated significant advancements in this field by directly modeling dense relationships between modality tokens to enable cross-modality long-range interactions, they neglect the inherent discrepancies in feature distributions across modalities. Such discrepancies attenuate the reliability of the established relationships, thereby restricting the effective exploitation of complementary information between modalities. To alleviate this problem, we propose a framework for learning modality knowledge with proxy. The core innovation lies in the design of a proxy-guided cross-modality feature fusion module, which realizes dual-modality interactions by using lightweight proxy tokens as intermediate representations. Specifically, self-attention is firstly utilized to facilitate the proxy tokens to learn the global information of each modality; then, the relationship between dual-modality proxy tokens is constructed to capture modality complementary information while also mitigating the interference of modality discrepancies; and finally, the knowledge in the updated proxy tokens is fed back to each modality through cross-attention for enhancing the features of each modality. Additionally, a mixture of knowledge decoupled experts module is designed to effectively fuse enhanced features of the two modalities. This module leverages multiple gating networks to assign modality-specific and modality-shared knowledge to separate expert groups for learning, thus highlighting the advantageous features of the different modalities. Extensive experiments on four RGB-infrared datasets demonstrate that our method outperforms existing state-of-the-art methods.
扫码关注我们
求助内容:
应助结果提醒方式:
