文本增强型多模态对比学习用于无监督可见红外人员再识别

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Image and Vision Computing Pub Date : 2024-11-05 DOI:10.1016/j.imavis.2024.105310

Rui Sun , Guoxi Huang , Xuebin Wang , Yun Du , Xudong Zhang

{"title":"文本增强型多模态对比学习用于无监督可见红外人员再识别","authors":"Rui Sun , Guoxi Huang , Xuebin Wang , Yun Du , Xudong Zhang","doi":"10.1016/j.imavis.2024.105310","DOIUrl":null,"url":null,"abstract":"<div><div>Visible-infrared person re-identification holds significant implications for intelligent security. Unsupervised methods can reduce the gap of different modalities without labels. Most previous unsupervised methods only train their models with image information, so that the model cannot obtain powerful deep semantic information. In this paper, we leverage CLIP to extract deep text information. We propose a Text–Image Alignment (TIA) module to align the image and text information and effectively bridge the gap between visible and infrared modality. We produce a Local–Global Image Match (LGIM) module to find homogeneous information. Specifically, we employ the Hungarian algorithm and Simulated Annealing (SA) algorithm to attain original information from image features while mitigating the interference of heterogeneous information. Additionally, we design a Changeable Cross-modality Alignment Loss (CCAL) to enable the model to learn modality-specific features during different training stages. Our method performs well and attains powerful robustness by targeted learning. Extensive experiments demonstrate the effectiveness of our approach, our method achieves a rank-1 accuracy that exceeds state-of-the-art approaches by approximately 10% on the RegDB.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105310"},"PeriodicalIF":4.2000,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Text-augmented Multi-Modality contrastive learning for unsupervised visible-infrared person re-identification\",\"authors\":\"Rui Sun , Guoxi Huang , Xuebin Wang , Yun Du , Xudong Zhang\",\"doi\":\"10.1016/j.imavis.2024.105310\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Visible-infrared person re-identification holds significant implications for intelligent security. Unsupervised methods can reduce the gap of different modalities without labels. Most previous unsupervised methods only train their models with image information, so that the model cannot obtain powerful deep semantic information. In this paper, we leverage CLIP to extract deep text information. We propose a Text–Image Alignment (TIA) module to align the image and text information and effectively bridge the gap between visible and infrared modality. We produce a Local–Global Image Match (LGIM) module to find homogeneous information. Specifically, we employ the Hungarian algorithm and Simulated Annealing (SA) algorithm to attain original information from image features while mitigating the interference of heterogeneous information. Additionally, we design a Changeable Cross-modality Alignment Loss (CCAL) to enable the model to learn modality-specific features during different training stages. Our method performs well and attains powerful robustness by targeted learning. Extensive experiments demonstrate the effectiveness of our approach, our method achieves a rank-1 accuracy that exceeds state-of-the-art approaches by approximately 10% on the RegDB.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"152 \",\"pages\":\"Article 105310\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2024-11-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885624004153\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624004153","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

可见光-红外人员再识别技术对智能安防具有重要意义。无监督方法可以缩小不同模态无标签的差距。以往的大多数无监督方法仅使用图像信息训练模型，因此模型无法获得强大的深层语义信息。在本文中，我们利用 CLIP 来提取深度文本信息。我们提出了文本-图像对齐（TIA）模块，以对齐图像和文本信息，有效弥合可见光和红外模式之间的差距。我们制作了一个局部-全局图像匹配（LGIM）模块来查找同质信息。具体来说，我们采用匈牙利算法和模拟退火（SA）算法，从图像特征中获取原始信息，同时减少异构信息的干扰。此外，我们还设计了一种可变跨模态对齐损失（CCAL），使模型能够在不同的训练阶段学习特定模态的特征。我们的方法性能良好，并通过有针对性的学习获得了强大的鲁棒性。广泛的实验证明了我们方法的有效性，我们的方法在 RegDB 上的排名-1 准确率比最先进的方法高出约 10%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Text-augmented Multi-Modality contrastive learning for unsupervised visible-infrared person re-identification

Visible-infrared person re-identification holds significant implications for intelligent security. Unsupervised methods can reduce the gap of different modalities without labels. Most previous unsupervised methods only train their models with image information, so that the model cannot obtain powerful deep semantic information. In this paper, we leverage CLIP to extract deep text information. We propose a Text–Image Alignment (TIA) module to align the image and text information and effectively bridge the gap between visible and infrared modality. We produce a Local–Global Image Match (LGIM) module to find homogeneous information. Specifically, we employ the Hungarian algorithm and Simulated Annealing (SA) algorithm to attain original information from image features while mitigating the interference of heterogeneous information. Additionally, we design a Changeable Cross-modality Alignment Loss (CCAL) to enable the model to learn modality-specific features during different training stages. Our method performs well and attains powerful robustness by targeted learning. Extensive experiments demonstrate the effectiveness of our approach, our method achieves a rank-1 accuracy that exceeds state-of-the-art approaches by approximately 10% on the RegDB.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.