Context-aware relation enhancement and similarity reasoning for image-text retrieval

IF 1.3 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IET Computer Vision Pub Date : 2024-01-30 DOI:10.1049/cvi2.12270

Zheng Cui, Yongli Hu, Yanfeng Sun, Baocai Yin

{"title":"Context-aware relation enhancement and similarity reasoning for image-text retrieval","authors":"Zheng Cui, Yongli Hu, Yanfeng Sun, Baocai Yin","doi":"10.1049/cvi2.12270","DOIUrl":null,"url":null,"abstract":"<p>Image-text retrieval is a fundamental yet challenging task, which aims to bridge a semantic gap between heterogeneous data to achieve precise measurements of semantic similarity. The technique of fine-grained alignment between cross-modal features plays a key role in various successful methods that have been proposed. Nevertheless, existing methods cannot effectively utilise intra-modal information to enhance feature representation and lack powerful similarity reasoning to get a precise similarity score. Intending to tackle these issues, a context-aware Relation Enhancement and Similarity Reasoning model, called RESR, is proposed, which conducts both intra-modal relation enhancement and inter-modal similarity reasoning while considering the global-context information. For intra-modal relation enhancement, a novel context-aware graph convolutional network is introduced to enhance local feature representations by utilising relation and global-context information. For inter-modal similarity reasoning, local and global similarity features are exploited by the bidirectional alignment of image and text, and the similarity reasoning is implemented among multi-granularity similarity features. Finally, refined local and global similarity features are adaptively fused to get a precise similarity score. The experimental results show that our effective model outperforms some state-of-the-art approaches, achieving average improvements of 2.5% and 6.3% in R@sum on the Flickr30K and MS-COCO dataset.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 5","pages":"652-665"},"PeriodicalIF":1.3000,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12270","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cvi2.12270","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Image-text retrieval is a fundamental yet challenging task, which aims to bridge a semantic gap between heterogeneous data to achieve precise measurements of semantic similarity. The technique of fine-grained alignment between cross-modal features plays a key role in various successful methods that have been proposed. Nevertheless, existing methods cannot effectively utilise intra-modal information to enhance feature representation and lack powerful similarity reasoning to get a precise similarity score. Intending to tackle these issues, a context-aware Relation Enhancement and Similarity Reasoning model, called RESR, is proposed, which conducts both intra-modal relation enhancement and inter-modal similarity reasoning while considering the global-context information. For intra-modal relation enhancement, a novel context-aware graph convolutional network is introduced to enhance local feature representations by utilising relation and global-context information. For inter-modal similarity reasoning, local and global similarity features are exploited by the bidirectional alignment of image and text, and the similarity reasoning is implemented among multi-granularity similarity features. Finally, refined local and global similarity features are adaptively fused to get a precise similarity score. The experimental results show that our effective model outperforms some state-of-the-art approaches, achieving average improvements of 2.5% and 6.3% in R@sum on the Flickr30K and MS-COCO dataset.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于图像文本检索的上下文感知关系增强和相似性推理

图像-文本检索是一项基本但极具挑战性的任务，其目的是弥合异构数据之间的语义鸿沟，实现语义相似性的精确测量。在已提出的各种成功方法中，跨模态特征之间的精细配准技术起着关键作用。然而，现有方法无法有效利用模态内信息来增强特征表示，也缺乏强大的相似性推理能力来获得精确的相似性得分。为了解决这些问题，我们提出了一种称为 RESR 的情境感知关系增强和相似性推理模型，它在考虑全局情境信息的同时，还能进行模内关系增强和模间相似性推理。在模内关系增强方面，引入了一个新颖的上下文感知图卷积网络，利用关系和全局上下文信息来增强局部特征表征。在模态间相似性推理方面，通过图像和文本的双向对齐利用了局部和全局相似性特征，并在多粒度相似性特征中实现了相似性推理。最后，经过提炼的局部和全局相似性特征会进行自适应融合，从而得到精确的相似性得分。实验结果表明，我们的有效模型优于一些最先进的方法，在 Flickr30K 和 MS-COCO 数据集上的 R@sum 平均提高了 2.5% 和 6.3%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IET Computer Vision 工程技术-工程：电子与电气

CiteScore

3.30

自引率

11.80%

发文量

审稿时长

3.4 months

期刊介绍： IET Computer Vision seeks original research papers in a wide range of areas of computer vision. The vision of the journal is to publish the highest quality research work that is relevant and topical to the field, but not forgetting those works that aim to introduce new horizons and set the agenda for future avenues of research in computer vision. IET Computer Vision welcomes submissions on the following topics: Biologically and perceptually motivated approaches to low level vision (feature detection, etc.); Perceptual grouping and organisation Representation, analysis and matching of 2D and 3D shape Shape-from-X Object recognition Image understanding Learning with visual inputs Motion analysis and object tracking Multiview scene analysis Cognitive approaches in low, mid and high level vision Control in visual systems Colour, reflectance and light Statistical and probabilistic models Face and gesture Surveillance Biometrics and security Robotics Vehicle guidance Automatic model aquisition Medical image analysis and understanding Aerial scene analysis and remote sensing Deep learning models in computer vision Both methodological and applications orientated papers are welcome. Manuscripts submitted are expected to include a detailed and analytical review of the literature and state-of-the-art exposition of the original proposed research and its methodology, its thorough experimental evaluation, and last but not least, comparative evaluation against relevant and state-of-the-art methods. Submissions not abiding by these minimum requirements may be returned to authors without being sent to review. Special Issues Current Call for Papers: Computer Vision for Smart Cameras and Camera Networks - https://digital-library.theiet.org/files/IET_CVI_SC.pdf Computer Vision for the Creative Industries - https://digital-library.theiet.org/files/IET_CVI_CVCI.pdf