Forward and Backward Multimodal NMT for Improved Monolingual and Multilingual Cross-Modal Retrieval

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-08 DOI:10.1145/3372278.3390674

Po-Yao (Bernie) Huang, Xiaojun Chang, Alexander Hauptmann, E. Hovy

{"title":"Forward and Backward Multimodal NMT for Improved Monolingual and Multilingual Cross-Modal Retrieval","authors":"Po-Yao (Bernie) Huang, Xiaojun Chang, Alexander Hauptmann, E. Hovy","doi":"10.1145/3372278.3390674","DOIUrl":null,"url":null,"abstract":"We explore methods to enrich the diversity of captions associated with pictures for learning improved visual-semantic embeddings (VSE) in cross-modal retrieval. In the spirit of \"A picture is worth a thousand words\", it would take dozens of sentences to parallel each picture's content adequately. But in fact, real-world multimodal datasets tend to provide only a few (typically, five) descriptions per image. For cross-modal retrieval, the resulting lack of diversity and coverage prevents systems from capturing the fine-grained inter-modal dependencies and intra-modal diversities in the shared VSE space. Using the fact that the encoder-decoder architectures in neural machine translation (NMT) have the capacity to enrich both monolingual and multilingual textual diversity, we propose a novel framework leveraging multimodal neural machine translation (MMT) to perform forward and backward translations based on salient visual objects to generate additional text-image pairs which enables training improved monolingual cross-modal retrieval (English-Image) and multilingual cross-modal retrieval (English-Image and German-Image) models. Experimental results show that the proposed framework can substantially and consistently improve the performance of state-of-the-art models on multiple datasets. The results also suggest that the models with multilingual VSE outperform the models with monolingual VSE.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"115 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3372278.3390674","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

We explore methods to enrich the diversity of captions associated with pictures for learning improved visual-semantic embeddings (VSE) in cross-modal retrieval. In the spirit of "A picture is worth a thousand words", it would take dozens of sentences to parallel each picture's content adequately. But in fact, real-world multimodal datasets tend to provide only a few (typically, five) descriptions per image. For cross-modal retrieval, the resulting lack of diversity and coverage prevents systems from capturing the fine-grained inter-modal dependencies and intra-modal diversities in the shared VSE space. Using the fact that the encoder-decoder architectures in neural machine translation (NMT) have the capacity to enrich both monolingual and multilingual textual diversity, we propose a novel framework leveraging multimodal neural machine translation (MMT) to perform forward and backward translations based on salient visual objects to generate additional text-image pairs which enables training improved monolingual cross-modal retrieval (English-Image) and multilingual cross-modal retrieval (English-Image and German-Image) models. Experimental results show that the proposed framework can substantially and consistently improve the performance of state-of-the-art models on multiple datasets. The results also suggest that the models with multilingual VSE outperform the models with monolingual VSE.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

改进单语和多语跨模态检索的前向和后向多模态NMT

我们探索了在跨模态检索中丰富与图片相关的字幕多样性的方法，以学习改进的视觉语义嵌入(VSE)。本着“一张图片胜过千言万语”的精神，需要几十个句子来充分平行每一张图片的内容。但事实上，现实世界的多模态数据集往往只提供几个(通常是五个)描述。对于跨模态检索，多样性和覆盖范围的缺乏导致系统无法捕获共享VSE空间中细粒度的模态间依赖关系和模态内多样性。利用神经机器翻译(NMT)中编码器-解码器结构丰富单语和多语文本多样性的特性，我们提出了一个新的框架，利用多模态神经机器翻译(MMT)来执行基于显著视觉对象的前向和后向翻译，以生成额外的文本-图像对，从而能够训练改进的单语言跨模态检索(英语-image)和多语言跨模态检索(英语-image和德语-image)模型。实验结果表明，所提出的框架可以在多个数据集上显著且持续地提高最先进模型的性能。结果还表明，多语言VSE模型优于单语言VSE模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2020 International Conference on Multimedia Retrieval

自引率

0.00%

发文量

期刊最新文献

Music Tower Blocks: Multi-Faceted Exploration Interface for Web-Scale Music Access Deep Semantic-Alignment Hashing for Unsupervised Cross-Modal Retrieval Urban Movie Map for Walkers: Route View Synthesis using 360° Videos ICDAR'20: Intelligent Cross-Data Analysis and Retrieval An Interactive Multimodal Retrieval System for Memory Assistant and Life Organized Support