Transliteration and Byte Pair Encoding to Improve Tamil to Sinhala Neural Machine Translation

Pasindu Tennage, Achini Herath, Malith Thilakarathne, Prabath Sandaruwan, Surangika Ranathunga
{"title":"Transliteration and Byte Pair Encoding to Improve Tamil to Sinhala Neural Machine Translation","authors":"Pasindu Tennage, Achini Herath, Malith Thilakarathne, Prabath Sandaruwan, Surangika Ranathunga","doi":"10.1109/MERCON.2018.8421939","DOIUrl":null,"url":null,"abstract":"Neural Machine Translation (NMT) is the current state-of-the-art machine translation technique. However, applicability of NMT for language pairs that have high morphological variations is still debatable. Lack of language resources, especially a sufficiently large parallel corpus causes additional issues, which leads to very poor translation performance, when NMT is applied to languages with high morphological variations. In this paper, we present three techniques to improve domain-specific NMT performance of the under-resourced language pair Sinhala and Tamil that have high morphological variations. Out of these three techniques, transliteration is a novel approach to improve domain-specific NMT performance for language pairs such as Sinhala and Tamil that share a common grammatical structure and have moderate lexical similarity. We built the first transliteration system for Sinhala to English and Tamil to English, which provided an accuracy of 99.6%, when tested with the parallel corpus we used for NMT training. The other technique we employed is Byte Pair Encoding (BPE), which is a technique that has been used to achieve open vocabulary translation with a fixed vocabulary of subword symbols. Our experiments show that while the translation based on independent BPE models and pure transliteration perform moderately, integrating transliteration to build a joint BPE model for the aforementioned language pair increases the translation quality by 1.68 BLEU score.","PeriodicalId":6603,"journal":{"name":"2018 Moratuwa Engineering Research Conference (MERCon)","volume":"35 1","pages":"390-395"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Moratuwa Engineering Research Conference (MERCon)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MERCON.2018.8421939","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

Neural Machine Translation (NMT) is the current state-of-the-art machine translation technique. However, applicability of NMT for language pairs that have high morphological variations is still debatable. Lack of language resources, especially a sufficiently large parallel corpus causes additional issues, which leads to very poor translation performance, when NMT is applied to languages with high morphological variations. In this paper, we present three techniques to improve domain-specific NMT performance of the under-resourced language pair Sinhala and Tamil that have high morphological variations. Out of these three techniques, transliteration is a novel approach to improve domain-specific NMT performance for language pairs such as Sinhala and Tamil that share a common grammatical structure and have moderate lexical similarity. We built the first transliteration system for Sinhala to English and Tamil to English, which provided an accuracy of 99.6%, when tested with the parallel corpus we used for NMT training. The other technique we employed is Byte Pair Encoding (BPE), which is a technique that has been used to achieve open vocabulary translation with a fixed vocabulary of subword symbols. Our experiments show that while the translation based on independent BPE models and pure transliteration perform moderately, integrating transliteration to build a joint BPE model for the aforementioned language pair increases the translation quality by 1.68 BLEU score.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
转写和字节对编码改进泰米尔语到僧伽罗语的神经机器翻译
神经机器翻译(NMT)是当前最先进的机器翻译技术。然而,对于具有高度形态学变化的语言对,NMT的适用性仍然存在争议。缺乏语言资源,特别是缺乏足够大的并行语料库会导致其他问题,当NMT应用于具有高度形态学变化的语言时,会导致翻译性能非常差。在本文中,我们提出了三种技术来提高资源不足的语言对僧伽罗语和泰米尔语具有高度形态学变化的特定领域的NMT性能。在这三种技术中,音译是一种新的方法,可以提高语言对(如僧伽罗语和泰米尔语)特定领域的NMT性能,这些语言对具有共同的语法结构和适度的词汇相似性。我们建立了第一个僧伽罗语到英语和泰米尔语到英语的音译系统,当使用我们用于NMT训练的平行语料库进行测试时,该系统提供了99.6%的准确率。我们采用的另一种技术是字节对编码(Byte Pair Encoding, BPE),这是一种用于实现具有固定子词符号词汇表的开放词汇翻译的技术。我们的实验表明,虽然基于独立的BPE模型和纯音译的翻译效果一般,但整合音译构建上述语言对的联合BPE模型可使翻译质量提高1.68 BLEU分数。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Fired-Siltstone Based Geopolymers for CO2 Sequestration Wells : A Study on the Effect of Curing Temperature Design and Development of a Smart Wheelchair with Multiple Control Interfaces Modelling Transfer Function of Power Transformers Using Sweep Frequency Response Analysis 3D Full-Field Deformation Measuring Technique with Optics-Based Measurements Optimization of Thermal Comfort in Sri Lankan Residential Buildings
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1