Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval

ArXiv Pub Date : 2024-03-08 DOI:10.1609/aaai.v38i16.29789

Hailang Huang, Zhijie Nie, Ziqiao Wang, Ziyu Shang

{"title":"Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval","authors":"Hailang Huang, Zhijie Nie, Ziqiao Wang, Ziyu Shang","doi":"10.1609/aaai.v38i16.29789","DOIUrl":null,"url":null,"abstract":"Current image-text retrieval methods have demonstrated impressive performance in recent years. However, they still face two problems: the inter-modal matching missing problem and the intra-modal semantic loss problem. These problems can significantly affect the accuracy of image-text retrieval. To address these challenges, we propose a novel method called Cross-modal and Uni-modal Soft-label Alignment (CUSA). Our method leverages the power of uni-modal pre-trained models to provide soft-label supervision signals for the image-text retrieval model. Additionally, we introduce two alignment techniques, Cross-modal Soft-label Alignment (CSA) and Uni-modal Soft-label Alignment (USA), to overcome false negatives and enhance similarity recognition between uni-modal samples. Our method is designed to be plug-and-play, meaning it can be easily applied to existing image-text retrieval models without changing their original architectures. Extensive experiments on various image-text retrieval models and datasets, we demonstrate that our method can consistently improve the performance of image-text retrieval and achieve new state-of-the-art results. Furthermore, our method can also boost the uni-modal retrieval performance of image-text retrieval models, enabling it to achieve universal retrieval. The code and supplementary files can be found at https://github.com/lerogo/aaai24_itr_cusa.","PeriodicalId":513202,"journal":{"name":"ArXiv","volume":"30 11","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1609/aaai.v38i16.29789","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Current image-text retrieval methods have demonstrated impressive performance in recent years. However, they still face two problems: the inter-modal matching missing problem and the intra-modal semantic loss problem. These problems can significantly affect the accuracy of image-text retrieval. To address these challenges, we propose a novel method called Cross-modal and Uni-modal Soft-label Alignment (CUSA). Our method leverages the power of uni-modal pre-trained models to provide soft-label supervision signals for the image-text retrieval model. Additionally, we introduce two alignment techniques, Cross-modal Soft-label Alignment (CSA) and Uni-modal Soft-label Alignment (USA), to overcome false negatives and enhance similarity recognition between uni-modal samples. Our method is designed to be plug-and-play, meaning it can be easily applied to existing image-text retrieval models without changing their original architectures. Extensive experiments on various image-text retrieval models and datasets, we demonstrate that our method can consistently improve the performance of image-text retrieval and achieve new state-of-the-art results. Furthermore, our method can also boost the uni-modal retrieval performance of image-text retrieval models, enabling it to achieve universal retrieval. The code and supplementary files can be found at https://github.com/lerogo/aaai24_itr_cusa.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

图像-文本检索中的跨模态和单模态软标记对齐

近年来，当前的图像-文本检索方法已经取得了令人瞩目的成绩。然而，它们仍然面临两个问题：模态间匹配缺失问题和模态内语义损失问题。这些问题会严重影响图像-文本检索的准确性。为了解决这些问题，我们提出了一种名为 "跨模态和单模态软标记对齐（CUSA）"的新方法。我们的方法利用单模态预训练模型的力量，为图像文本检索模型提供软标签监督信号。此外，我们还引入了两种对齐技术，即跨模态软标签对齐（CSA）和单模态软标签对齐（USA），以克服误判，提高单模态样本之间的相似性识别能力。我们的方法设计为即插即用，这意味着它可以轻松地应用于现有的图像-文本检索模型，而无需改变其原始架构。我们在各种图像-文本检索模型和数据集上进行了广泛的实验，证明我们的方法可以持续提高图像-文本检索的性能，并取得新的一流成果。此外，我们的方法还能提高图像文本检索模型的单模态检索性能，使其实现通用检索。代码和补充文件可在 https://github.com/lerogo/aaai24_itr_cusa 上找到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ArXiv

自引率

0.00%

发文量