Shike Wang, Wen Zhang, Wenyu Guo, Dong Yu, Pengyuan Liu
{"title":"基于对比学习的多模态机器翻译视觉表示增强","authors":"Shike Wang, Wen Zhang, Wenyu Guo, Dong Yu, Pengyuan Liu","doi":"10.1109/IJCNN55064.2022.9892312","DOIUrl":null,"url":null,"abstract":"Multimodal machine translation (MMT) is a task that incorporates extra image modality with text to translate. Previous works have worked on the interaction between two modalities and investigated the need of visual modality. However, few works focus on the models with better and more effective visual representation as input. We argue that the performance of MMT systems will get improved when better visual representation inputs into the systems. To investigate the thought, we introduce mT-ICL, a multimodal Transformer model with image contrastive learning. The contrastive objective is optimized to enhance the representation ability of the image encoder so that the encoder can generate better and more adaptive visual representation. Experiments show that our mT-ICL significantly outperforms the strong baseline and achieves the new SOTA on most of test sets of English-to-German and English-to-French. Further analysis reveals that visual modality works more than a regularization method under contrastive learning framework.","PeriodicalId":106974,"journal":{"name":"2022 International Joint Conference on Neural Networks (IJCNN)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Contrastive Learning Based Visual Representation Enhancement for Multimodal Machine Translation\",\"authors\":\"Shike Wang, Wen Zhang, Wenyu Guo, Dong Yu, Pengyuan Liu\",\"doi\":\"10.1109/IJCNN55064.2022.9892312\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multimodal machine translation (MMT) is a task that incorporates extra image modality with text to translate. Previous works have worked on the interaction between two modalities and investigated the need of visual modality. However, few works focus on the models with better and more effective visual representation as input. We argue that the performance of MMT systems will get improved when better visual representation inputs into the systems. To investigate the thought, we introduce mT-ICL, a multimodal Transformer model with image contrastive learning. The contrastive objective is optimized to enhance the representation ability of the image encoder so that the encoder can generate better and more adaptive visual representation. Experiments show that our mT-ICL significantly outperforms the strong baseline and achieves the new SOTA on most of test sets of English-to-German and English-to-French. Further analysis reveals that visual modality works more than a regularization method under contrastive learning framework.\",\"PeriodicalId\":106974,\"journal\":{\"name\":\"2022 International Joint Conference on Neural Networks (IJCNN)\",\"volume\":\"57 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Joint Conference on Neural Networks (IJCNN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IJCNN55064.2022.9892312\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Joint Conference on Neural Networks (IJCNN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IJCNN55064.2022.9892312","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Contrastive Learning Based Visual Representation Enhancement for Multimodal Machine Translation
Multimodal machine translation (MMT) is a task that incorporates extra image modality with text to translate. Previous works have worked on the interaction between two modalities and investigated the need of visual modality. However, few works focus on the models with better and more effective visual representation as input. We argue that the performance of MMT systems will get improved when better visual representation inputs into the systems. To investigate the thought, we introduce mT-ICL, a multimodal Transformer model with image contrastive learning. The contrastive objective is optimized to enhance the representation ability of the image encoder so that the encoder can generate better and more adaptive visual representation. Experiments show that our mT-ICL significantly outperforms the strong baseline and achieves the new SOTA on most of test sets of English-to-German and English-to-French. Further analysis reveals that visual modality works more than a regularization method under contrastive learning framework.