Low-Resource Neural Machine Translation Improvement Using Data Augmentation Strategies

IF 3.3 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Informatica Pub Date : 2023-08-29 DOI:10.31449/inf.v47i3.4761
Thai Nguyen Quoc, Huong Le Thanh, Hanh Pham Van
{"title":"Low-Resource Neural Machine Translation Improvement Using Data Augmentation Strategies","authors":"Thai Nguyen Quoc, Huong Le Thanh, Hanh Pham Van","doi":"10.31449/inf.v47i3.4761","DOIUrl":null,"url":null,"abstract":"The development of neural models has greatly improved the performance of machine translation, but these methods require large-scale parallel data, which can be difficult to obtain for low-resource language pairs. To address this issue, this research employs a pre-trained multilingual model and fine-tunes it by using a small bilingual dataset. Additionally, two data-augmentation strategies are proposed to generate new training data: (i) back-translation with the dataset from the source language; (ii) data augmentation via the English pivot language. The proposed approach is applied to the Khmer-Vietnamese machine translation. Experimental results show that our proposed approach outperforms the Google Translator model by 5.3% in terms of BLEU score on a test set of 2,000 Khmer-Vietnamese sentence pairs.","PeriodicalId":56292,"journal":{"name":"Informatica","volume":"22 1","pages":"0"},"PeriodicalIF":3.3000,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatica","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31449/inf.v47i3.4761","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

The development of neural models has greatly improved the performance of machine translation, but these methods require large-scale parallel data, which can be difficult to obtain for low-resource language pairs. To address this issue, this research employs a pre-trained multilingual model and fine-tunes it by using a small bilingual dataset. Additionally, two data-augmentation strategies are proposed to generate new training data: (i) back-translation with the dataset from the source language; (ii) data augmentation via the English pivot language. The proposed approach is applied to the Khmer-Vietnamese machine translation. Experimental results show that our proposed approach outperforms the Google Translator model by 5.3% in terms of BLEU score on a test set of 2,000 Khmer-Vietnamese sentence pairs.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于数据增强策略的低资源神经机器翻译改进
神经模型的发展极大地提高了机器翻译的性能,但这些方法需要大规模的并行数据,而对于低资源的语言对,这些数据很难获得。为了解决这个问题,本研究采用了一个预训练的多语言模型,并通过使用一个小型双语数据集对其进行微调。此外,提出了两种数据增强策略来生成新的训练数据:(i)从源语言反翻译数据集;(ii)通过英语支点语言进行数据增强。将该方法应用于高棉-越南语的机器翻译。实验结果表明,在2000个高棉-越南语句子对的测试集上,我们提出的方法在BLEU得分方面比Google Translator模型高出5.3%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Informatica
Informatica 工程技术-计算机:信息系统
CiteScore
5.90
自引率
6.90%
发文量
19
审稿时长
12 months
期刊介绍: The quarterly journal Informatica provides an international forum for high-quality original research and publishes papers on mathematical simulation and optimization, recognition and control, programming theory and systems, automation systems and elements. Informatica provides a multidisciplinary forum for scientists and engineers involved in research and design including experts who implement and manage information systems applications.
期刊最新文献
Beyond Quasi-Adjoint Graphs: On Polynomial-Time Solvable Cases of the Hamiltonian Cycle and Path Problems Confidential Transaction Balance Verification by the Net Using Non-Interactive Zero-Knowledge Proofs An Improved Algorithm for Extracting Frequent Gradual Patterns Offloaded Data Processing Energy Efficiency Evaluation Demystifying the Stability and the Performance Aspects of CoCoSo Ranking Method under Uncertain Preferences
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1