Advancing Visible-Infrared Person Re-Identification: Synergizing Visual-Textual Reasoning and Cross-Modal Feature Alignment

IF 8 1区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS IEEE Transactions on Information Forensics and Security Pub Date : 2025-02-11 DOI:10.1109/TIFS.2025.3539946

Yuxuan Qiu;Liyang Wang;Wei Song;Jiawei Liu;Zhiping Shi;Na Jiang

{"title":"Advancing Visible-Infrared Person Re-Identification: Synergizing Visual-Textual Reasoning and Cross-Modal Feature Alignment","authors":"Yuxuan Qiu;Liyang Wang;Wei Song;Jiawei Liu;Zhiping Shi;Na Jiang","doi":"10.1109/TIFS.2025.3539946","DOIUrl":null,"url":null,"abstract":"Visible-infrared person re-identification (VI-ReID) is a critical cross-modality fine-grained classification task with significant implications for public safety and security applications. Existing VI-ReID methods primarily focus on extracting modality-invariant features for person retrieval. However, due to the inherent lack of texture information in infrared images, these modality-invariant features tend to emphasize global contexts. Consequently, individuals with similar silhouettes are often misidentified, posing potential risks to security systems and forensic investigations. To address this problem, this paper innovatively introduces natural language descriptions to learn the global-local contexts for VI-ReID. Specifically, we design a framework that jointly optimizes visible-infrared alignment plus (VIAP) and visual-textual reasoning (VTR), and introduces local-global joint measure (LJM) to enhance the metric, while proposing a human-LLM collaborative approach to incorporate textual descriptions into existing cross-modal person re-identification datasets. VIAP achieves cross-modal alignment between RGB and IR. It can explicitly utilize designed frequency-aware modality alignment and relationship-reinforced fusion to explore the potential of local cues in global features and modality-invariant information. VTR proposes pooling selection and dual-level reasoning mechanisms to force the image encoder to pay attention to significant regions based on textual descriptions. LJM proposes introducing local feature distances into the measure stage metric to enhance the relevance of matching using fine-grained information. Extensive experimental results on the popular SYSU-MM01 and RegDB datasets show that the proposed method significantly outperforms state-of-the-art approaches. The dataset is publicly available at <uri>https://github.com/qyx596/vireid-caption</uri>.","PeriodicalId":13492,"journal":{"name":"IEEE Transactions on Information Forensics and Security","volume":"20 ","pages":"2184-2196"},"PeriodicalIF":8.0000,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Forensics and Security","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10879282/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Visible-infrared person re-identification (VI-ReID) is a critical cross-modality fine-grained classification task with significant implications for public safety and security applications. Existing VI-ReID methods primarily focus on extracting modality-invariant features for person retrieval. However, due to the inherent lack of texture information in infrared images, these modality-invariant features tend to emphasize global contexts. Consequently, individuals with similar silhouettes are often misidentified, posing potential risks to security systems and forensic investigations. To address this problem, this paper innovatively introduces natural language descriptions to learn the global-local contexts for VI-ReID. Specifically, we design a framework that jointly optimizes visible-infrared alignment plus (VIAP) and visual-textual reasoning (VTR), and introduces local-global joint measure (LJM) to enhance the metric, while proposing a human-LLM collaborative approach to incorporate textual descriptions into existing cross-modal person re-identification datasets. VIAP achieves cross-modal alignment between RGB and IR. It can explicitly utilize designed frequency-aware modality alignment and relationship-reinforced fusion to explore the potential of local cues in global features and modality-invariant information. VTR proposes pooling selection and dual-level reasoning mechanisms to force the image encoder to pay attention to significant regions based on textual descriptions. LJM proposes introducing local feature distances into the measure stage metric to enhance the relevance of matching using fine-grained information. Extensive experimental results on the popular SYSU-MM01 and RegDB datasets show that the proposed method significantly outperforms state-of-the-art approaches. The dataset is publicly available at https://github.com/qyx596/vireid-caption.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

推进可见-红外人物再识别：协同视觉-文本推理和跨模态特征对齐

可见红外人员再识别（VI-ReID）是一项关键的跨模态细粒度分类任务，在公共安全和安保应用中具有重要意义。现有的VI-ReID方法主要侧重于提取模态不变特征以进行人物检索。然而，由于红外图像本身缺乏纹理信息，这些模态不变特征往往强调全局背景。因此，具有相似轮廓的个体经常被错误识别，给安全系统和法医调查带来潜在风险。为了解决这一问题，本文创新性地引入自然语言描述来学习VI-ReID的全局-局部上下文。具体而言，我们设计了一个框架，该框架联合优化了可见光-红外校准加（VIAP）和视觉文本推理（VTR），并引入了局部-全局联合度量（LJM）来增强度量，同时提出了一种人类-全局联合度量的协作方法，将文本描述纳入现有的跨模态人再识别数据集。VIAP实现了RGB和IR之间的跨模态对齐。它可以明确地利用设计的频率感知模态对齐和关系增强融合来探索局部线索在全局特征和模态不变信息中的潜力。VTR提出了池化选择和双层推理机制，迫使图像编码器关注基于文本描述的重要区域。LJM提出在度量阶段度量中引入局部特征距离，利用细粒度信息增强匹配的相关性。在流行的SYSU-MM01和RegDB数据集上的大量实验结果表明，所提出的方法显着优于最先进的方法。该数据集可在https://github.com/qyx596/vireid-caption上公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Information Forensics and Security 工程技术-工程：电子与电气

CiteScore

14.40

自引率

7.40%

发文量

234

审稿时长

6.5 months

期刊介绍： The IEEE Transactions on Information Forensics and Security covers the sciences, technologies, and applications relating to information forensics, information security, biometrics, surveillance and systems applications that incorporate these features