基于数据增强策略的低资源神经机器翻译改进

IF 3.3 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Informatica Pub Date : 2023-08-29 DOI:10.31449/inf.v47i3.4761

Thai Nguyen Quoc, Huong Le Thanh, Hanh Pham Van

{"title":"基于数据增强策略的低资源神经机器翻译改进","authors":"Thai Nguyen Quoc, Huong Le Thanh, Hanh Pham Van","doi":"10.31449/inf.v47i3.4761","DOIUrl":null,"url":null,"abstract":"The development of neural models has greatly improved the performance of machine translation, but these methods require large-scale parallel data, which can be difficult to obtain for low-resource language pairs. To address this issue, this research employs a pre-trained multilingual model and fine-tunes it by using a small bilingual dataset. Additionally, two data-augmentation strategies are proposed to generate new training data: (i) back-translation with the dataset from the source language; (ii) data augmentation via the English pivot language. The proposed approach is applied to the Khmer-Vietnamese machine translation. Experimental results show that our proposed approach outperforms the Google Translator model by 5.3% in terms of BLEU score on a test set of 2,000 Khmer-Vietnamese sentence pairs.","PeriodicalId":56292,"journal":{"name":"Informatica","volume":"22 1","pages":"0"},"PeriodicalIF":3.3000,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Low-Resource Neural Machine Translation Improvement Using Data Augmentation Strategies\",\"authors\":\"Thai Nguyen Quoc, Huong Le Thanh, Hanh Pham Van\",\"doi\":\"10.31449/inf.v47i3.4761\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The development of neural models has greatly improved the performance of machine translation, but these methods require large-scale parallel data, which can be difficult to obtain for low-resource language pairs. To address this issue, this research employs a pre-trained multilingual model and fine-tunes it by using a small bilingual dataset. Additionally, two data-augmentation strategies are proposed to generate new training data: (i) back-translation with the dataset from the source language; (ii) data augmentation via the English pivot language. The proposed approach is applied to the Khmer-Vietnamese machine translation. Experimental results show that our proposed approach outperforms the Google Translator model by 5.3% in terms of BLEU score on a test set of 2,000 Khmer-Vietnamese sentence pairs.\",\"PeriodicalId\":56292,\"journal\":{\"name\":\"Informatica\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2023-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Informatica\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.31449/inf.v47i3.4761\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatica","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31449/inf.v47i3.4761","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

神经模型的发展极大地提高了机器翻译的性能，但这些方法需要大规模的并行数据，而对于低资源的语言对，这些数据很难获得。为了解决这个问题，本研究采用了一个预训练的多语言模型，并通过使用一个小型双语数据集对其进行微调。此外，提出了两种数据增强策略来生成新的训练数据:(i)从源语言反翻译数据集;(ii)通过英语支点语言进行数据增强。将该方法应用于高棉-越南语的机器翻译。实验结果表明，在2000个高棉-越南语句子对的测试集上，我们提出的方法在BLEU得分方面比Google Translator模型高出5.3%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Low-Resource Neural Machine Translation Improvement Using Data Augmentation Strategies

The development of neural models has greatly improved the performance of machine translation, but these methods require large-scale parallel data, which can be difficult to obtain for low-resource language pairs. To address this issue, this research employs a pre-trained multilingual model and fine-tunes it by using a small bilingual dataset. Additionally, two data-augmentation strategies are proposed to generate new training data: (i) back-translation with the dataset from the source language; (ii) data augmentation via the English pivot language. The proposed approach is applied to the Khmer-Vietnamese machine translation. Experimental results show that our proposed approach outperforms the Google Translator model by 5.3% in terms of BLEU score on a test set of 2,000 Khmer-Vietnamese sentence pairs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Informatica 工程技术-计算机：信息系统

CiteScore

5.90

自引率

6.90%

发文量

审稿时长

12 months

期刊介绍： The quarterly journal Informatica provides an international forum for high-quality original research and publishes papers on mathematical simulation and optimization, recognition and control, programming theory and systems, automation systems and elements. Informatica provides a multidisciplinary forum for scientists and engineers involved in research and design including experts who implement and manage information systems applications.