用于音译的统计机器学习:在僧伽罗语、泰米尔语和英语之间音译名字

H. S. Priyadarshani, M. Rajapaksha, M. M. S. P. Ranasinghe, Kengatharaiyer Sarveswaran, G. Dias
{"title":"用于音译的统计机器学习:在僧伽罗语、泰米尔语和英语之间音译名字","authors":"H. S. Priyadarshani, M. Rajapaksha, M. M. S. P. Ranasinghe, Kengatharaiyer Sarveswaran, G. Dias","doi":"10.1109/IALP48816.2019.9037651","DOIUrl":null,"url":null,"abstract":"In this paper, we focus on building models for transliteration of personal names between the primary languages of Sri Lanka-namely Sinhala, Tamil and English. Currently, a Rule-based system has been used to transliterate names between Sinhala and Tamil. However, we found that it fails in several cases. Further, there were no systems available to transliterate names to English. In this paper, we present a hybrid approach where we use machine learning and statistical machine translation to do the transliteration. We built a parallel trilingual corpus of personal names. Then we trained a machine learner to classify names based on the ethnicity as we found it is an influencing factor in transliteration. Then we took the transliteration as a translation problem and applied statistical machine translation to generate the most probable transliteration for personal names. The system shows very promising results compared with the existing rule-based system. It gives a BLEU score of 89 in all the test cases and produces the top BLEU score of 93.7 for Sinhala to English transliteration.","PeriodicalId":208066,"journal":{"name":"2019 International Conference on Asian Language Processing (IALP)","volume":"521 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Statistical Machine Learning for Transliteration: Transliterating names between Sinhala, Tamil and English\",\"authors\":\"H. S. Priyadarshani, M. Rajapaksha, M. M. S. P. Ranasinghe, Kengatharaiyer Sarveswaran, G. Dias\",\"doi\":\"10.1109/IALP48816.2019.9037651\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we focus on building models for transliteration of personal names between the primary languages of Sri Lanka-namely Sinhala, Tamil and English. Currently, a Rule-based system has been used to transliterate names between Sinhala and Tamil. However, we found that it fails in several cases. Further, there were no systems available to transliterate names to English. In this paper, we present a hybrid approach where we use machine learning and statistical machine translation to do the transliteration. We built a parallel trilingual corpus of personal names. Then we trained a machine learner to classify names based on the ethnicity as we found it is an influencing factor in transliteration. Then we took the transliteration as a translation problem and applied statistical machine translation to generate the most probable transliteration for personal names. The system shows very promising results compared with the existing rule-based system. It gives a BLEU score of 89 in all the test cases and produces the top BLEU score of 93.7 for Sinhala to English transliteration.\",\"PeriodicalId\":208066,\"journal\":{\"name\":\"2019 International Conference on Asian Language Processing (IALP)\",\"volume\":\"521 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 International Conference on Asian Language Processing (IALP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IALP48816.2019.9037651\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Asian Language Processing (IALP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP48816.2019.9037651","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

在本文中,我们着重于建立人名在斯里兰卡主要语言——即僧伽罗语、泰米尔语和英语之间的音译模型。目前,一个基于规则的系统已被用于在僧伽罗语和泰米尔语之间转写名字。然而,我们发现它在几个案例中失败了。此外,也没有将名字音译成英文的系统。在本文中,我们提出了一种混合方法,我们使用机器学习和统计机器翻译来做音译。我们建立了一个平行的三语人名语料库。然后我们训练了一个机器学习器,根据种族对名字进行分类,因为我们发现种族是音译的一个影响因素。然后我们将音译作为一个翻译问题,应用统计机器翻译生成人名最可能的音译。与现有的基于规则的系统相比,该系统显示出很好的效果。它在所有测试案例中给出了89分的BLEU分数,僧伽罗语到英语音译的BLEU得分最高,为93.7分。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Statistical Machine Learning for Transliteration: Transliterating names between Sinhala, Tamil and English
In this paper, we focus on building models for transliteration of personal names between the primary languages of Sri Lanka-namely Sinhala, Tamil and English. Currently, a Rule-based system has been used to transliterate names between Sinhala and Tamil. However, we found that it fails in several cases. Further, there were no systems available to transliterate names to English. In this paper, we present a hybrid approach where we use machine learning and statistical machine translation to do the transliteration. We built a parallel trilingual corpus of personal names. Then we trained a machine learner to classify names based on the ethnicity as we found it is an influencing factor in transliteration. Then we took the transliteration as a translation problem and applied statistical machine translation to generate the most probable transliteration for personal names. The system shows very promising results compared with the existing rule-based system. It gives a BLEU score of 89 in all the test cases and produces the top BLEU score of 93.7 for Sinhala to English transliteration.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A General Procedure for Improving Language Models in Low-Resource Speech Recognition Automated Prediction of Item Difficulty in Reading Comprehension Using Long Short-Term Memory An Measurement Method of Ancient Poetry Difficulty for Adaptive Testing How to Answer Comparison Questions An Enhancement of Malay Social Media Text Normalization for Lexicon-Based Sentiment Analysis
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1