Cross-modal independent matching network for image-text retrieval

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pattern Recognition Pub Date : 2024-10-29 DOI:10.1016/j.patcog.2024.111096

Xiao Ke , Baitao Chen , Xiong Yang , Yuhang Cai , Hao Liu , Wenzhong Guo

{"title":"Cross-modal independent matching network for image-text retrieval","authors":"Xiao Ke , Baitao Chen , Xiong Yang , Yuhang Cai , Hao Liu , Wenzhong Guo","doi":"10.1016/j.patcog.2024.111096","DOIUrl":null,"url":null,"abstract":"<div><div>Image-text retrieval serves as a bridge connecting vision and language. Mainstream modal cross matching methods can effectively perform cross-modal interactions with high theoretical performance. However, there is a deficiency in efficiency. Modal independent matching methods exhibit superior efficiency but lack in performance. Therefore, achieving a balance between matching efficiency and performance becomes a challenge in the field of image-text retrieval. In this paper, we propose a new Cross-modal Independent Matching Network (CIMN) for image-text retrieval. Specifically, we first use the proposed Feature Relationship Reasoning (FRR) to infer neighborhood and potential relations of modal features. Then, we introduce Graph Pooling (GP) based on graph convolutional networks to perform modal global semantic aggregation. Finally, we introduce the Gravitation Loss (GL) by incorporating sample mass into the learning process. This loss can correct the matching relationship between and within each modality, avoiding the problem of equal treatment of all samples in the traditional triplet loss. Extensive experiments on Flickr30K and MSCOCO datasets demonstrate the superiority of the proposed method. It achieves a good balance between matching efficiency and performance, surpasses other similar independent matching methods in performance, and can obtain retrieval accuracy comparable to some mainstream cross matching methods with an order of magnitude lower inference time.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111096"},"PeriodicalIF":7.6000,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324008471","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Image-text retrieval serves as a bridge connecting vision and language. Mainstream modal cross matching methods can effectively perform cross-modal interactions with high theoretical performance. However, there is a deficiency in efficiency. Modal independent matching methods exhibit superior efficiency but lack in performance. Therefore, achieving a balance between matching efficiency and performance becomes a challenge in the field of image-text retrieval. In this paper, we propose a new Cross-modal Independent Matching Network (CIMN) for image-text retrieval. Specifically, we first use the proposed Feature Relationship Reasoning (FRR) to infer neighborhood and potential relations of modal features. Then, we introduce Graph Pooling (GP) based on graph convolutional networks to perform modal global semantic aggregation. Finally, we introduce the Gravitation Loss (GL) by incorporating sample mass into the learning process. This loss can correct the matching relationship between and within each modality, avoiding the problem of equal treatment of all samples in the traditional triplet loss. Extensive experiments on Flickr30K and MSCOCO datasets demonstrate the superiority of the proposed method. It achieves a good balance between matching efficiency and performance, surpasses other similar independent matching methods in performance, and can obtain retrieval accuracy comparable to some mainstream cross matching methods with an order of magnitude lower inference time.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于图像文本检索的跨模态独立匹配网络

图像-文本检索是连接视觉和语言的桥梁。主流的模态交叉匹配方法可以有效地进行跨模态交互，并具有较高的理论性能。但在效率方面存在不足。独立模态匹配方法效率高，但性能不足。因此，如何在匹配效率和性能之间取得平衡成为图像-文本检索领域的一项挑战。本文提出了一种用于图像文本检索的新型跨模态独立匹配网络（CIMN）。具体来说，我们首先使用提出的特征关系推理（FRR）来推断模态特征的邻域和潜在关系。然后，我们引入基于图卷积网络的图池化（GP）来执行模态全局语义聚合。最后，我们将样本质量纳入学习过程，引入引力损失（GL）。这种损失可以纠正每种模态之间和内部的匹配关系，避免了传统三重损失中平等对待所有样本的问题。在 Flickr30K 和 MSCOCO 数据集上进行的大量实验证明了所提出方法的优越性。它在匹配效率和性能之间实现了良好的平衡，在性能上超越了其他类似的独立匹配方法，并能获得与一些主流交叉匹配方法相当的检索精度，推理时间却低了一个数量级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.

期刊最新文献

Editorial Board Contrastive calibration on consensus and complementary multi-view representations Adversarial supervised contrastive feature learning for cross-modal retrieval A visual-textual mutual guidance fusion network for remote sensing visual question answering Generalizable face forgery detection via mining single-step reconstruction difference