{"title":"Context Aware Back-Transliteration from English to Sinhala","authors":"Rushan Nanayakkara, Thilini Nadungodage, Randil Pushpananda","doi":"10.1109/ICTer58063.2022.10024072","DOIUrl":null,"url":null,"abstract":"The Sinhala language is widely used on social media by using the English alphabet to represent native Sinhala words. The standard script of English language is Roman script. Hence we refer to Sinhala texts transliterated using English alphabet as Romanized-Sinhala texts. This process of representing texts of one language using the alphabet of another language is called transliteration. Over the time Sinhala Natural Language Processing (NLP) researchers have developed many systems to process native Sinhala texts. However, it is impossible to use the existing Sinhala text processing tools to process Romanized-Sinhala texts as those systems can only process Sinhala scripts. Therefore these texts need to be transliterated back using their original Sinhala scripts to be processed using existing Sinhala NLP tools. Transliterating texts backwards using their native alphabet is referred to as back-transliteration. In this study, we present a Transliteration Unit (TU) based back-transliteration system for the back-transliteration of Romanized-Sinhala texts. We also introduce a novel method for converting the Romanized-Sinhala scripts into TU sequences. The system was trained using a primary data set and evaluated using an unseen portion of the same data set as well as a secondary data set which represents texts from a different context to the primary data set. The proposed model has achieved 0.81 in BLEU score and 0.78 in METEOR score on the primary data set while achieving 0.57 in BLEU score and 0.47 in METEOR score on the secondary data set.","PeriodicalId":123176,"journal":{"name":"2022 22nd International Conference on Advances in ICT for Emerging Regions (ICTer)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 22nd International Conference on Advances in ICT for Emerging Regions (ICTer)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTer58063.2022.10024072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The Sinhala language is widely used on social media by using the English alphabet to represent native Sinhala words. The standard script of English language is Roman script. Hence we refer to Sinhala texts transliterated using English alphabet as Romanized-Sinhala texts. This process of representing texts of one language using the alphabet of another language is called transliteration. Over the time Sinhala Natural Language Processing (NLP) researchers have developed many systems to process native Sinhala texts. However, it is impossible to use the existing Sinhala text processing tools to process Romanized-Sinhala texts as those systems can only process Sinhala scripts. Therefore these texts need to be transliterated back using their original Sinhala scripts to be processed using existing Sinhala NLP tools. Transliterating texts backwards using their native alphabet is referred to as back-transliteration. In this study, we present a Transliteration Unit (TU) based back-transliteration system for the back-transliteration of Romanized-Sinhala texts. We also introduce a novel method for converting the Romanized-Sinhala scripts into TU sequences. The system was trained using a primary data set and evaluated using an unseen portion of the same data set as well as a secondary data set which represents texts from a different context to the primary data set. The proposed model has achieved 0.81 in BLEU score and 0.78 in METEOR score on the primary data set while achieving 0.57 in BLEU score and 0.47 in METEOR score on the secondary data set.