Context Aware Back-Transliteration from English to Sinhala

Rushan Nanayakkara, Thilini Nadungodage, Randil Pushpananda
{"title":"Context Aware Back-Transliteration from English to Sinhala","authors":"Rushan Nanayakkara, Thilini Nadungodage, Randil Pushpananda","doi":"10.1109/ICTer58063.2022.10024072","DOIUrl":null,"url":null,"abstract":"The Sinhala language is widely used on social media by using the English alphabet to represent native Sinhala words. The standard script of English language is Roman script. Hence we refer to Sinhala texts transliterated using English alphabet as Romanized-Sinhala texts. This process of representing texts of one language using the alphabet of another language is called transliteration. Over the time Sinhala Natural Language Processing (NLP) researchers have developed many systems to process native Sinhala texts. However, it is impossible to use the existing Sinhala text processing tools to process Romanized-Sinhala texts as those systems can only process Sinhala scripts. Therefore these texts need to be transliterated back using their original Sinhala scripts to be processed using existing Sinhala NLP tools. Transliterating texts backwards using their native alphabet is referred to as back-transliteration. In this study, we present a Transliteration Unit (TU) based back-transliteration system for the back-transliteration of Romanized-Sinhala texts. We also introduce a novel method for converting the Romanized-Sinhala scripts into TU sequences. The system was trained using a primary data set and evaluated using an unseen portion of the same data set as well as a secondary data set which represents texts from a different context to the primary data set. The proposed model has achieved 0.81 in BLEU score and 0.78 in METEOR score on the primary data set while achieving 0.57 in BLEU score and 0.47 in METEOR score on the secondary data set.","PeriodicalId":123176,"journal":{"name":"2022 22nd International Conference on Advances in ICT for Emerging Regions (ICTer)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 22nd International Conference on Advances in ICT for Emerging Regions (ICTer)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTer58063.2022.10024072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The Sinhala language is widely used on social media by using the English alphabet to represent native Sinhala words. The standard script of English language is Roman script. Hence we refer to Sinhala texts transliterated using English alphabet as Romanized-Sinhala texts. This process of representing texts of one language using the alphabet of another language is called transliteration. Over the time Sinhala Natural Language Processing (NLP) researchers have developed many systems to process native Sinhala texts. However, it is impossible to use the existing Sinhala text processing tools to process Romanized-Sinhala texts as those systems can only process Sinhala scripts. Therefore these texts need to be transliterated back using their original Sinhala scripts to be processed using existing Sinhala NLP tools. Transliterating texts backwards using their native alphabet is referred to as back-transliteration. In this study, we present a Transliteration Unit (TU) based back-transliteration system for the back-transliteration of Romanized-Sinhala texts. We also introduce a novel method for converting the Romanized-Sinhala scripts into TU sequences. The system was trained using a primary data set and evaluated using an unseen portion of the same data set as well as a secondary data set which represents texts from a different context to the primary data set. The proposed model has achieved 0.81 in BLEU score and 0.78 in METEOR score on the primary data set while achieving 0.57 in BLEU score and 0.47 in METEOR score on the secondary data set.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
上下文感知从英语到僧伽罗语的反向音译
僧伽罗语在社交媒体上被广泛使用,使用英语字母来表示僧伽罗语的本土单词。英语的标准文字是罗马文字。因此,我们把用英文字母音译的僧伽罗文本称为罗马化僧伽罗文本。这种用另一种语言的字母来表示一种语言的文本的过程被称为音译。随着时间的推移,僧伽罗自然语言处理(NLP)研究人员已经开发了许多系统来处理本地僧伽罗语文本。然而,现有的僧伽罗文字处理工具无法处理罗马化的僧伽罗文字,因为这些系统只能处理僧伽罗文字。因此,这些文本需要用其原始僧伽罗文字进行音译,然后使用现有的僧伽罗NLP工具进行处理。用文字的母语字母倒转文字被称为倒转音译。在这项研究中,我们提出了一个基于音译单元(TU)的罗马-僧伽罗文本反音译系统。我们还介绍了一种将罗马-僧伽罗文字转换成TU序列的新方法。该系统使用主要数据集进行训练,并使用同一数据集的未见部分以及代表来自不同上下文的文本的辅助数据集进行评估。该模型在主数据集上BLEU得分为0.81,METEOR得分为0.78,在辅助数据集上BLEU得分为0.57,METEOR得分为0.47。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Towards Improving Early Learning Capabilities of Students Through a Gamified Learning Tool Improving Sinhala Hate Speech Detection Using Deep Learning ICTer 2022 Keynote Speakers A Deep Learning Approach to Predict Health Status Using Microbiome Profiling An Ensemble Methods based Machine Learning Approach for Rice Plant disease diagnosing
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1