Low-Resource Neural Machine Translation Improvement Using Data Augmentation Strategies

IF 2.8 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Informatica Pub Date : 2023-08-29 DOI:10.31449/inf.v47i3.4761

Thai Nguyen Quoc, Huong Le Thanh, Hanh Pham Van

引用次数: 0

Abstract

The development of neural models has greatly improved the performance of machine translation, but these methods require large-scale parallel data, which can be difficult to obtain for low-resource language pairs. To address this issue, this research employs a pre-trained multilingual model and fine-tunes it by using a small bilingual dataset. Additionally, two data-augmentation strategies are proposed to generate new training data: (i) back-translation with the dataset from the source language; (ii) data augmentation via the English pivot language. The proposed approach is applied to the Khmer-Vietnamese machine translation. Experimental results show that our proposed approach outperforms the Google Translator model by 5.3% in terms of BLEU score on a test set of 2,000 Khmer-Vietnamese sentence pairs.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于数据增强策略的低资源神经机器翻译改进

神经模型的发展极大地提高了机器翻译的性能，但这些方法需要大规模的并行数据，而对于低资源的语言对，这些数据很难获得。为了解决这个问题，本研究采用了一个预训练的多语言模型，并通过使用一个小型双语数据集对其进行微调。此外，提出了两种数据增强策略来生成新的训练数据:(i)从源语言反翻译数据集;(ii)通过英语支点语言进行数据增强。将该方法应用于高棉-越南语的机器翻译。实验结果表明，在2000个高棉-越南语句子对的测试集上，我们提出的方法在BLEU得分方面比Google Translator模型高出5.3%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Informatica 工程技术-计算机：信息系统

CiteScore

5.90

自引率

6.90%

发文量

审稿时长

12 months

期刊介绍： The quarterly journal Informatica provides an international forum for high-quality original research and publishes papers on mathematical simulation and optimization, recognition and control, programming theory and systems, automation systems and elements. Informatica provides a multidisciplinary forum for scientists and engineers involved in research and design including experts who implement and manage information systems applications.