Improving What Cross-Modal Retrieval Models Learn through Object-Oriented Inter- and Intra-Modal Attention Networks

Po-Yao (Bernie) Huang, Vaibhav, Xiaojun Chang, Alexander Hauptmann
{"title":"Improving What Cross-Modal Retrieval Models Learn through Object-Oriented Inter- and Intra-Modal Attention Networks","authors":"Po-Yao (Bernie) Huang, Vaibhav, Xiaojun Chang, Alexander Hauptmann","doi":"10.1145/3323873.3325043","DOIUrl":null,"url":null,"abstract":"Although significant progress has been made for cross-modal retrieval models in recent years, few have explored what those models truly learn and what makes one model superior to another. Start by training two state-of-the-art text-to-image retrieval models with adversarial text inputs, we investigate and quantify the importance of syntactic structure and lexical information in learning the joint visual-semantic embedding space for cross-modal retrieval. The results show that the retrieval power mainly comes from localizing and connecting the visual objects and their cross-modal counter-parts, the textual phrases. Inspired by this observation, we propose a novel model which employs object-oriented encoders along with inter- and intra-modal attention networks to improve inter-modal dependencies for cross-modal retrieval. In addition, we develop a new multimodal structure-preserving objective which additionally emphasizes intra-modal hard negative examples to promote intra-modal discrepancies. Extensive experiments show that the proposed approach outperforms the existing best method by a large margin (16.4% and 6.7% relatively with Recall@1 in the text-to-image retrieval task on the Flickr30K dataset and the MS-COCO dataset respectively).","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3323873.3325043","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 15

Abstract

Although significant progress has been made for cross-modal retrieval models in recent years, few have explored what those models truly learn and what makes one model superior to another. Start by training two state-of-the-art text-to-image retrieval models with adversarial text inputs, we investigate and quantify the importance of syntactic structure and lexical information in learning the joint visual-semantic embedding space for cross-modal retrieval. The results show that the retrieval power mainly comes from localizing and connecting the visual objects and their cross-modal counter-parts, the textual phrases. Inspired by this observation, we propose a novel model which employs object-oriented encoders along with inter- and intra-modal attention networks to improve inter-modal dependencies for cross-modal retrieval. In addition, we develop a new multimodal structure-preserving objective which additionally emphasizes intra-modal hard negative examples to promote intra-modal discrepancies. Extensive experiments show that the proposed approach outperforms the existing best method by a large margin (16.4% and 6.7% relatively with Recall@1 in the text-to-image retrieval task on the Flickr30K dataset and the MS-COCO dataset respectively).
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
改进跨模态检索模型通过面向对象的模态间和模态内注意网络学习
尽管近年来跨模态检索模型取得了重大进展,但很少有人探索这些模型真正学习了什么,以及是什么使一个模型优于另一个模型。首先,我们训练了两个具有对抗文本输入的最先进的文本到图像检索模型,研究并量化了句法结构和词汇信息在学习跨模态检索的联合视觉语义嵌入空间中的重要性。结果表明,检索能力主要来自于对视觉对象及其跨模态对应部分——文本短语的定位和连接。受这一观察结果的启发,我们提出了一种新的模型,该模型采用面向对象的编码器以及模态间和模态内的注意网络来改善跨模态检索的模态间依赖性。此外,我们开发了一个新的多模态结构保留目标,该目标额外强调了模态内的硬否定例子,以促进模态内的差异。大量实验表明,本文提出的方法在Flickr30K数据集和MS-COCO数据集的文本到图像检索任务中,比现有的最佳方法分别高出16.4%和6.7%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
EAGER Multimodal Multimedia Retrieval with vitrivr RobustiQ: A Robust ANN Search Method for Billion-scale Similarity Search on GPUs Improving What Cross-Modal Retrieval Models Learn through Object-Oriented Inter- and Intra-Modal Attention Networks DeepMarks: A Secure Fingerprinting Framework for Digital Rights Management of Deep Learning Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1