Person re-identification (ReID) under real-world multi-modal settings remains constrained by the lack of unified, diverse datasets and modality-aware learning strategies. To bridge this gap, we propose Multi-modal ReID (MM-ReID), a large-scale dataset encompassing 8537 unique IDs, 0.85 million images with three aligned image modalities RGB, Infrared (IR), Thermal Infrared (TI), and 0.83 million natural language descriptions. MM-ReID captures diverse scenarios, including indoor/outdoor, cross-camera views, day and night, and cloth changes, offering a comprehensive foundation for multi-modal ReID research. To build one unified multi-modal person Re-ID, we introduce Cross-Modal Semantic Anchoring (CMSA): CMSA injects fixed vision-language embeddings as parameter-free semantic anchors that steer a ViT towards a modality-agnostic, language-aware space, enabling rich semantic transfer through text-vision alignment. Our training incorporates two synergistic loss functions: Caption-Adaptive Triplet loss dynamically adjusts the triplet margin according to caption similarity, forcing harder negatives when textual descriptions overlap and yielding stronger discrimination. Caption-Aware CIM-T loss (Cross-Identity Inter-modal Margin with Text) simultaneously enlarges inter-identity gaps and contracts intra-identity distances across RGB-IR-TI views, guided by caption context to resolve ambiguous appearances. Our method attains 79.4 mAP and 97.5 R-5 on the Market1501-MM dataset, representing improvements of +1.4 mAP and +0.7 R-5 over prior SOTA approaches. Extensive experiments on MM-ReID demonstrate superior generalization and adaptability across unseen modalities and domains. Our approach establishes a new paradigm for modality-extensible and interpretable multi-modal ReID research.
扫码关注我们
求助内容:
应助结果提醒方式:
