Context Aware Back-Transliteration from English to Sinhala

2022 22nd International Conference on Advances in ICT for Emerging Regions (ICTer) Pub Date : 2022-11-30 DOI:10.1109/ICTer58063.2022.10024072

Rushan Nanayakkara, Thilini Nadungodage, Randil Pushpananda

{"title":"Context Aware Back-Transliteration from English to Sinhala","authors":"Rushan Nanayakkara, Thilini Nadungodage, Randil Pushpananda","doi":"10.1109/ICTer58063.2022.10024072","DOIUrl":null,"url":null,"abstract":"The Sinhala language is widely used on social media by using the English alphabet to represent native Sinhala words. The standard script of English language is Roman script. Hence we refer to Sinhala texts transliterated using English alphabet as Romanized-Sinhala texts. This process of representing texts of one language using the alphabet of another language is called transliteration. Over the time Sinhala Natural Language Processing (NLP) researchers have developed many systems to process native Sinhala texts. However, it is impossible to use the existing Sinhala text processing tools to process Romanized-Sinhala texts as those systems can only process Sinhala scripts. Therefore these texts need to be transliterated back using their original Sinhala scripts to be processed using existing Sinhala NLP tools. Transliterating texts backwards using their native alphabet is referred to as back-transliteration. In this study, we present a Transliteration Unit (TU) based back-transliteration system for the back-transliteration of Romanized-Sinhala texts. We also introduce a novel method for converting the Romanized-Sinhala scripts into TU sequences. The system was trained using a primary data set and evaluated using an unseen portion of the same data set as well as a secondary data set which represents texts from a different context to the primary data set. The proposed model has achieved 0.81 in BLEU score and 0.78 in METEOR score on the primary data set while achieving 0.57 in BLEU score and 0.47 in METEOR score on the secondary data set.","PeriodicalId":123176,"journal":{"name":"2022 22nd International Conference on Advances in ICT for Emerging Regions (ICTer)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 22nd International Conference on Advances in ICT for Emerging Regions (ICTer)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTer58063.2022.10024072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The Sinhala language is widely used on social media by using the English alphabet to represent native Sinhala words. The standard script of English language is Roman script. Hence we refer to Sinhala texts transliterated using English alphabet as Romanized-Sinhala texts. This process of representing texts of one language using the alphabet of another language is called transliteration. Over the time Sinhala Natural Language Processing (NLP) researchers have developed many systems to process native Sinhala texts. However, it is impossible to use the existing Sinhala text processing tools to process Romanized-Sinhala texts as those systems can only process Sinhala scripts. Therefore these texts need to be transliterated back using their original Sinhala scripts to be processed using existing Sinhala NLP tools. Transliterating texts backwards using their native alphabet is referred to as back-transliteration. In this study, we present a Transliteration Unit (TU) based back-transliteration system for the back-transliteration of Romanized-Sinhala texts. We also introduce a novel method for converting the Romanized-Sinhala scripts into TU sequences. The system was trained using a primary data set and evaluated using an unseen portion of the same data set as well as a secondary data set which represents texts from a different context to the primary data set. The proposed model has achieved 0.81 in BLEU score and 0.78 in METEOR score on the primary data set while achieving 0.57 in BLEU score and 0.47 in METEOR score on the secondary data set.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

上下文感知从英语到僧伽罗语的反向音译

僧伽罗语在社交媒体上被广泛使用，使用英语字母来表示僧伽罗语的本土单词。英语的标准文字是罗马文字。因此，我们把用英文字母音译的僧伽罗文本称为罗马化僧伽罗文本。这种用另一种语言的字母来表示一种语言的文本的过程被称为音译。随着时间的推移，僧伽罗自然语言处理(NLP)研究人员已经开发了许多系统来处理本地僧伽罗语文本。然而，现有的僧伽罗文字处理工具无法处理罗马化的僧伽罗文字，因为这些系统只能处理僧伽罗文字。因此，这些文本需要用其原始僧伽罗文字进行音译，然后使用现有的僧伽罗NLP工具进行处理。用文字的母语字母倒转文字被称为倒转音译。在这项研究中，我们提出了一个基于音译单元(TU)的罗马-僧伽罗文本反音译系统。我们还介绍了一种将罗马-僧伽罗文字转换成TU序列的新方法。该系统使用主要数据集进行训练，并使用同一数据集的未见部分以及代表来自不同上下文的文本的辅助数据集进行评估。该模型在主数据集上BLEU得分为0.81,METEOR得分为0.78，在辅助数据集上BLEU得分为0.57,METEOR得分为0.47。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 22nd International Conference on Advances in ICT for Emerging Regions (ICTer)

自引率

0.00%

发文量

期刊最新文献

Towards Improving Early Learning Capabilities of Students Through a Gamified Learning Tool Improving Sinhala Hate Speech Detection Using Deep Learning ICTer 2022 Keynote Speakers A Deep Learning Approach to Predict Health Status Using Microbiome Profiling An Ensemble Methods based Machine Learning Approach for Rice Plant disease diagnosing