HybridVocab:通过多角度对齐实现多模态机器翻译

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI:10.1145/3512527.3531386

Ru Peng, Yawen Zeng, J. Zhao

{"title":"HybridVocab:通过多角度对齐实现多模态机器翻译","authors":"Ru Peng, Yawen Zeng, J. Zhao","doi":"10.1145/3512527.3531386","DOIUrl":null,"url":null,"abstract":"Multi-modal machine translation (MMT) aims to augment the linguistic machine translation frameworks by incorporating aligned vision information. As the core research challenge for MMT, how to fuse the image information and further align it with the bilingual data remains critical. Existing works have either focused on a methodological alignment in the space of bilingual text or emphasized the combination of the one-sided text and given image. In this work, we entertain the possibility of a triplet alignment, among the source and target text together with the image instance. In particular, we propose Multi-aspect AlignmenT (MAT) model that augments the MMT tasks to three sub-tasks --- namely cross-language translation alignment, cross-modal captioning alignment and multi-modal hybrid alignment tasks. Core to this model consists of a hybrid vocabulary which compiles the visually depictable entity (nouns) occurrence on both sides of the text as well as the detected object labels appearing in the images. Through this sub-task, we postulate that MAT manages to further align the modalities by casting three instances into a shared domain, as compared against previously proposed methods. Extensive experiments and analyses demonstrate the superiority of our approaches, which achieve several state-of-the-art results on two benchmark datasets of the MMT task.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"HybridVocab: Towards Multi-Modal Machine Translation via Multi-Aspect Alignment\",\"authors\":\"Ru Peng, Yawen Zeng, J. Zhao\",\"doi\":\"10.1145/3512527.3531386\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multi-modal machine translation (MMT) aims to augment the linguistic machine translation frameworks by incorporating aligned vision information. As the core research challenge for MMT, how to fuse the image information and further align it with the bilingual data remains critical. Existing works have either focused on a methodological alignment in the space of bilingual text or emphasized the combination of the one-sided text and given image. In this work, we entertain the possibility of a triplet alignment, among the source and target text together with the image instance. In particular, we propose Multi-aspect AlignmenT (MAT) model that augments the MMT tasks to three sub-tasks --- namely cross-language translation alignment, cross-modal captioning alignment and multi-modal hybrid alignment tasks. Core to this model consists of a hybrid vocabulary which compiles the visually depictable entity (nouns) occurrence on both sides of the text as well as the detected object labels appearing in the images. Through this sub-task, we postulate that MAT manages to further align the modalities by casting three instances into a shared domain, as compared against previously proposed methods. Extensive experiments and analyses demonstrate the superiority of our approaches, which achieve several state-of-the-art results on two benchmark datasets of the MMT task.\",\"PeriodicalId\":179895,\"journal\":{\"name\":\"Proceedings of the 2022 International Conference on Multimedia Retrieval\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 International Conference on Multimedia Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3512527.3531386\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3512527.3531386","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

多模态机器翻译(MMT)旨在通过整合对齐的视觉信息来增强语言机器翻译框架。如何将图像信息融合并进一步与双语数据对齐是MMT研究的核心挑战。现有的作品要么侧重于双语文本空间的方法论对齐，要么强调片面文本与给定图像的结合。在这项工作中，我们考虑了源文本和目标文本以及图像实例之间三元对齐的可能性。特别地，我们提出了Multi-aspect AlignmenT (MAT)模型，该模型将MMT任务扩展到三个子任务，即跨语言翻译对齐、跨模态字幕对齐和多模态混合对齐任务。该模型的核心是一个混合词汇表，它编译了文本两侧出现的视觉上可描述的实体(名词)以及出现在图像中的检测到的对象标签。通过此子任务，我们假设MAT通过将三个实例投射到共享域中来进一步对齐模式，而不是之前提出的方法。大量的实验和分析证明了我们的方法的优越性，在MMT任务的两个基准数据集上取得了几个最先进的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

HybridVocab: Towards Multi-Modal Machine Translation via Multi-Aspect Alignment

Multi-modal machine translation (MMT) aims to augment the linguistic machine translation frameworks by incorporating aligned vision information. As the core research challenge for MMT, how to fuse the image information and further align it with the bilingual data remains critical. Existing works have either focused on a methodological alignment in the space of bilingual text or emphasized the combination of the one-sided text and given image. In this work, we entertain the possibility of a triplet alignment, among the source and target text together with the image instance. In particular, we propose Multi-aspect AlignmenT (MAT) model that augments the MMT tasks to three sub-tasks --- namely cross-language translation alignment, cross-modal captioning alignment and multi-modal hybrid alignment tasks. Core to this model consists of a hybrid vocabulary which compiles the visually depictable entity (nouns) occurrence on both sides of the text as well as the detected object labels appearing in the images. Through this sub-task, we postulate that MAT manages to further align the modalities by casting three instances into a shared domain, as compared against previously proposed methods. Extensive experiments and analyses demonstrate the superiority of our approaches, which achieve several state-of-the-art results on two benchmark datasets of the MMT task.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2022 International Conference on Multimedia Retrieval

自引率

0.00%

发文量