Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text Feedback

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Transactions on Multimedia Pub Date : 2024-06-21 DOI:10.1109/TMM.2024.3417694

Yahui Xu;Yi Bin;Jiwei Wei;Yang Yang;Guoqing Wang;Heng Tao Shen

{"title":"Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text Feedback","authors":"Yahui Xu;Yi Bin;Jiwei Wei;Yang Yang;Guoqing Wang;Heng Tao Shen","doi":"10.1109/TMM.2024.3417694","DOIUrl":null,"url":null,"abstract":"We study the task of image retrieval with text feedback, where a reference image and modification text are composed to retrieve the desired target image. To accomplish this goal, existing methods always get the multimodal representations through different feature encoders and then adopt different strategies to model the correlation between the composed inputs and the target image. However, the multimodal query brings more challenges as it requires not only the synergistic understanding of the semantics from the heterogeneous multimodal inputs but also the ability to accurately build the underlying semantic correlation existing in each inputs-target triplet, i.e., reference image, modification text, and target image. In this paper, we tackle these issues with a novel Align and Retrieve (AlRet) framework. First, our proposed methods employ the contrastive loss in the feature encoders to learn meaningful multimodal representation while making the subsequent correlation modeling process in a more harmonious space. Then we propose to learn the accurate correlation between the composed inputs and target image in a novel composition-and-decomposition paradigm. Specifically, the composition network couples the reference image and modification text into a joint representation to learn the correlation between the joint representation and target image. The decomposition network conversely decouples the target image into visual and text subspaces to exploit the underlying correlation between the target image with each query element. The composition-and-decomposition paradigm forms a closed loop, which can be optimized simultaneously to promote each other in the performance. Massive comparison experiments on three real-world datasets confirm the effectiveness of the proposed method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9936-9948"},"PeriodicalIF":8.4000,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10568424/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

We study the task of image retrieval with text feedback, where a reference image and modification text are composed to retrieve the desired target image. To accomplish this goal, existing methods always get the multimodal representations through different feature encoders and then adopt different strategies to model the correlation between the composed inputs and the target image. However, the multimodal query brings more challenges as it requires not only the synergistic understanding of the semantics from the heterogeneous multimodal inputs but also the ability to accurately build the underlying semantic correlation existing in each inputs-target triplet, i.e., reference image, modification text, and target image. In this paper, we tackle these issues with a novel Align and Retrieve (AlRet) framework. First, our proposed methods employ the contrastive loss in the feature encoders to learn meaningful multimodal representation while making the subsequent correlation modeling process in a more harmonious space. Then we propose to learn the accurate correlation between the composed inputs and target image in a novel composition-and-decomposition paradigm. Specifically, the composition network couples the reference image and modification text into a joint representation to learn the correlation between the joint representation and target image. The decomposition network conversely decouples the target image into visual and text subspaces to exploit the underlying correlation between the target image with each query element. The composition-and-decomposition paradigm forms a closed loop, which can be optimized simultaneously to promote each other in the performance. Massive comparison experiments on three real-world datasets confirm the effectiveness of the proposed method.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

对齐和检索：有文本反馈的图像检索中的合成与分解学习

我们研究的是带有文本反馈的图像检索任务，即参考图像和修改文本组成检索所需的目标图像。为了实现这一目标，现有的方法总是通过不同的特征编码器获得多模态表示，然后采用不同的策略对组成的输入和目标图像之间的相关性进行建模。然而，多模态查询带来了更多挑战，因为它不仅需要协同理解来自异构多模态输入的语义，还需要准确构建存在于每个输入-目标三元组（即参考图像、修改文本和目标图像）中的底层语义相关性。在本文中，我们采用一种新颖的对齐和检索（AlRet）框架来解决这些问题。首先，我们提出的方法利用特征编码器中的对比损失来学习有意义的多模态表示，同时使后续的相关性建模过程在一个更加和谐的空间中进行。然后，我们提出以一种新颖的合成-分解范式来学习合成输入与目标图像之间的精确相关性。具体来说，合成网络将参考图像和修改文本组合成一个联合表示，从而学习联合表示与目标图像之间的相关性。分解网络则将目标图像分解为视觉和文本子空间，以利用目标图像与每个查询元素之间的潜在相关性。组成和分解范式形成了一个闭环，可以同时优化，从而在性能上相互促进。在三个真实世界数据集上进行的大规模对比实验证实了所提方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.