加强低资源泰缅英神经机器翻译的研究

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE ACM Transactions on Asian and Low-Resource Language Information Processing Pub Date : 2024-02-13 DOI:10.1145/3645111

Mya Ei San, Sasiporn Usanavasin, Ye Kyaw Thu, Manabu Okumura

{"title":"加强低资源泰缅英神经机器翻译的研究","authors":"Mya Ei San, Sasiporn Usanavasin, Ye Kyaw Thu, Manabu Okumura","doi":"10.1145/3645111","DOIUrl":null,"url":null,"abstract":"<p>Several methodologies have recently been proposed to enhance the performance of low-resource Neural Machine Translation (NMT). However, these techniques have yet to be explored thoroughly in low-resource Thai and Myanmar languages. Therefore, we first applied augmentation techniques such as SwitchOut and Ciphertext Based Data Augmentation (CipherDAug) to improve NMT performance in these languages. We secondly enhanced the NMT performance by fine-tuning the pre-trained Multilingual Denoising BART model (mBART), where BART denotes Bidirectional and Auto-Regressive Transformer. We implemented three NMT systems: namely, Transformer+SwitchOut, Multi-source Transformer+CipherDAug, and fine-tuned mBART in the bidirectional translations of Thai-English-Myanmar language pairs from the ASEAN-MT corpus. Experimental results showed that Multi-source Transformer+CipherDAug significantly improved BLEU, ChrF, and TER scores over the first baseline Transformer and second baseline Edit-Based Transformer (EDITOR). The model achieved notable BLEU scores: 37.9 (English-to-Thai), 42.7 (Thai-to-English), 28.9 (English-to-Myanmar), 31.2 (Myanmar-to-English), 25.3 (Thai-to-Myanmar), and 25.5 (Myanmar-to-Thai). The fine-tuned mBART model also considerably outperformed the two baselines, except for the Myanmar-to-English pair. SwitchOut improved over the second baseline in all pairs and performed similarly to the first baseline in most cases. Lastly, we performed detailed analyses verifying that the CipherDAug and mBART models potentially facilitate improving low-resource NMT performance in Thai and Myanmar languages.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":1.8000,"publicationDate":"2024-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Study for Enhancing Low-resource Thai-Myanmar-English Neural Machine Translation\",\"authors\":\"Mya Ei San, Sasiporn Usanavasin, Ye Kyaw Thu, Manabu Okumura\",\"doi\":\"10.1145/3645111\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Several methodologies have recently been proposed to enhance the performance of low-resource Neural Machine Translation (NMT). However, these techniques have yet to be explored thoroughly in low-resource Thai and Myanmar languages. Therefore, we first applied augmentation techniques such as SwitchOut and Ciphertext Based Data Augmentation (CipherDAug) to improve NMT performance in these languages. We secondly enhanced the NMT performance by fine-tuning the pre-trained Multilingual Denoising BART model (mBART), where BART denotes Bidirectional and Auto-Regressive Transformer. We implemented three NMT systems: namely, Transformer+SwitchOut, Multi-source Transformer+CipherDAug, and fine-tuned mBART in the bidirectional translations of Thai-English-Myanmar language pairs from the ASEAN-MT corpus. Experimental results showed that Multi-source Transformer+CipherDAug significantly improved BLEU, ChrF, and TER scores over the first baseline Transformer and second baseline Edit-Based Transformer (EDITOR). The model achieved notable BLEU scores: 37.9 (English-to-Thai), 42.7 (Thai-to-English), 28.9 (English-to-Myanmar), 31.2 (Myanmar-to-English), 25.3 (Thai-to-Myanmar), and 25.5 (Myanmar-to-Thai). The fine-tuned mBART model also considerably outperformed the two baselines, except for the Myanmar-to-English pair. SwitchOut improved over the second baseline in all pairs and performed similarly to the first baseline in most cases. Lastly, we performed detailed analyses verifying that the CipherDAug and mBART models potentially facilitate improving low-resource NMT performance in Thai and Myanmar languages.</p>\",\"PeriodicalId\":54312,\"journal\":{\"name\":\"ACM Transactions on Asian and Low-Resource Language Information Processing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2024-02-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Asian and Low-Resource Language Information Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3645111\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Asian and Low-Resource Language Information Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3645111","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

最近提出了几种方法来提高低资源神经机器翻译（NMT）的性能。但是，这些技术在低资源泰语和缅甸语中还没有得到深入探讨。因此，我们首先应用了增强技术，如 SwitchOut 和基于密文的数据增强（CipherDAug），以提高这些语言的 NMT 性能。其次，我们通过微调预先训练好的多语言去噪 BART 模型（mBART）来提高 NMT 性能，其中 BART 表示双向和自动回归变换器（Bidirectional and Auto-Regressive Transformer）。我们在 ASEAN-MT 语料库的泰英缅语言对的双向翻译中实施了三种 NMT 系统：即 Transformer+SwitchOut、Multi-source Transformer+CipherDAug，以及微调后的 mBART。实验结果表明，与第一基线转换器和第二基线基于编辑的转换器（EDITOR）相比，多源转换器+CipherDAug 显著提高了 BLEU、ChrF 和 TER 分数。该模型取得了显著的 BLEU 分数：37.9（英译泰）、42.7（泰译英）、28.9（英译缅）、31.2（缅译英）、25.3（泰译缅）和 25.5（缅译泰）。微调后的 mBART 模型也大大优于两个基线模型，但缅甸语对英语除外。SwitchOut 在所有语音对中的表现都优于第二基线，在大多数情况下与第一基线的表现相似。最后，我们进行了详细的分析，验证了 CipherDAug 和 mBART 模型可能有助于提高泰语和缅甸语的低资源 NMT 性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Study for Enhancing Low-resource Thai-Myanmar-English Neural Machine Translation

Several methodologies have recently been proposed to enhance the performance of low-resource Neural Machine Translation (NMT). However, these techniques have yet to be explored thoroughly in low-resource Thai and Myanmar languages. Therefore, we first applied augmentation techniques such as SwitchOut and Ciphertext Based Data Augmentation (CipherDAug) to improve NMT performance in these languages. We secondly enhanced the NMT performance by fine-tuning the pre-trained Multilingual Denoising BART model (mBART), where BART denotes Bidirectional and Auto-Regressive Transformer. We implemented three NMT systems: namely, Transformer+SwitchOut, Multi-source Transformer+CipherDAug, and fine-tuned mBART in the bidirectional translations of Thai-English-Myanmar language pairs from the ASEAN-MT corpus. Experimental results showed that Multi-source Transformer+CipherDAug significantly improved BLEU, ChrF, and TER scores over the first baseline Transformer and second baseline Edit-Based Transformer (EDITOR). The model achieved notable BLEU scores: 37.9 (English-to-Thai), 42.7 (Thai-to-English), 28.9 (English-to-Myanmar), 31.2 (Myanmar-to-English), 25.3 (Thai-to-Myanmar), and 25.5 (Myanmar-to-Thai). The fine-tuned mBART model also considerably outperformed the two baselines, except for the Myanmar-to-English pair. SwitchOut improved over the second baseline in all pairs and performed similarly to the first baseline in most cases. Lastly, we performed detailed analyses verifying that the CipherDAug and mBART models potentially facilitate improving low-resource NMT performance in Thai and Myanmar languages.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Asian and Low-Resource Language Information Processing Computer Science-General Computer Science

CiteScore

3.60

自引率

15.00%

发文量

241

期刊介绍： The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) publishes high quality original archival papers and technical notes in the areas of computation and processing of information in Asian languages, low-resource languages of Africa, Australasia, Oceania and the Americas, as well as related disciplines. The subject areas covered by TALLIP include, but are not limited to: -Computational Linguistics: including computational phonology, computational morphology, computational syntax (e.g. parsing), computational semantics, computational pragmatics, etc. -Linguistic Resources: including computational lexicography, terminology, electronic dictionaries, cross-lingual dictionaries, electronic thesauri, etc. -Hardware and software algorithms and tools for Asian or low-resource language processing, e.g., handwritten character recognition. -Information Understanding: including text understanding, speech understanding, character recognition, discourse processing, dialogue systems, etc. -Machine Translation involving Asian or low-resource languages. -Information Retrieval: including natural language processing (NLP) for concept-based indexing, natural language query interfaces, semantic relevance judgments, etc. -Information Extraction and Filtering: including automatic abstraction, user profiling, etc. -Speech processing: including text-to-speech synthesis and automatic speech recognition. -Multimedia Asian Information Processing: including speech, image, video, image/text translation, etc. -Cross-lingual information processing involving Asian or low-resource languages. -Papers that deal in theory, systems design, evaluation and applications in the aforesaid subjects are appropriate for TALLIP. Emphasis will be placed on the originality and the practical significance of the reported research.