Transliteration Characteristics in Romanized Assamese Language Social Media Text and Machine Transliteration

IF 1.8 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE ACM Transactions on Asian and Low-Resource Language Information Processing Pub Date : 2024-01-06 DOI:10.1145/3639565
Hemanta Baruah, Sanasam Ranbir Singh, Priyankoo Sarmah
{"title":"Transliteration Characteristics in Romanized Assamese Language Social Media Text and Machine Transliteration","authors":"Hemanta Baruah, Sanasam Ranbir Singh, Priyankoo Sarmah","doi":"10.1145/3639565","DOIUrl":null,"url":null,"abstract":"<p>This article aims to understand different transliteration behaviors of Romanized Assamese text on social media. Assamese, a language that belongs to the Indo-Aryan language family, is also among the 22 scheduled languages in India. With the increasing popularity of social media in India and also the common use of the English Qwerty keyboard, Indian users on social media express themselves in their native languages, but using the Roman/Latin script. Unlike some other popular South Asian languages (say <b>Pinyin</b> for Chinese), Indian languages do not have a common standard romanization convention for writing on social media platforms. Assamese and English are two very different orthographical languages. Thus, considering both orthographic and phonemic characteristics of the language, this study tries to explain how Assamese vowels, vowel diacritics, and consonants are represented in Roman transliterated form. From a dataset of romanized Assamese social media texts collected from three popular social media sites: (Facebook, YouTube and Twitter), we have manually labeled them with their native Assamese script. A comparison analysis is also carried out between the transliterated Assamese social media texts with six different Assamese romanization schemes that reflect how Assamese users on social media do not adhere to any fixed romanization scheme. We have built three separate character-level transliteration models from our dataset. One using a traditional phrase-based statistical machine transliteration model, (1). PBSMT model and two separate neural transliteration models: (2). BiLSTM neural seq2seq model with attention, and (3). Neural transformer model. A thorough error analysis has been performed on the transliteration result obtained from the three state-of-the-art models mentioned above. This may help to build a more robust machine transliteration system for the Assamese social media domain in the future. Finally, an attention analysis experiment is also carried out with the help of attention weight scores taken from the character-level BiLSTM neural seq2seq transliteration model built from our dataset.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"54 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Asian and Low-Resource Language Information Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3639565","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

This article aims to understand different transliteration behaviors of Romanized Assamese text on social media. Assamese, a language that belongs to the Indo-Aryan language family, is also among the 22 scheduled languages in India. With the increasing popularity of social media in India and also the common use of the English Qwerty keyboard, Indian users on social media express themselves in their native languages, but using the Roman/Latin script. Unlike some other popular South Asian languages (say Pinyin for Chinese), Indian languages do not have a common standard romanization convention for writing on social media platforms. Assamese and English are two very different orthographical languages. Thus, considering both orthographic and phonemic characteristics of the language, this study tries to explain how Assamese vowels, vowel diacritics, and consonants are represented in Roman transliterated form. From a dataset of romanized Assamese social media texts collected from three popular social media sites: (Facebook, YouTube and Twitter), we have manually labeled them with their native Assamese script. A comparison analysis is also carried out between the transliterated Assamese social media texts with six different Assamese romanization schemes that reflect how Assamese users on social media do not adhere to any fixed romanization scheme. We have built three separate character-level transliteration models from our dataset. One using a traditional phrase-based statistical machine transliteration model, (1). PBSMT model and two separate neural transliteration models: (2). BiLSTM neural seq2seq model with attention, and (3). Neural transformer model. A thorough error analysis has been performed on the transliteration result obtained from the three state-of-the-art models mentioned above. This may help to build a more robust machine transliteration system for the Assamese social media domain in the future. Finally, an attention analysis experiment is also carried out with the help of attention weight scores taken from the character-level BiLSTM neural seq2seq transliteration model built from our dataset.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
罗马化阿萨姆语社交媒体文本中的转写特征与机器转写
本文旨在了解社交媒体上罗马化阿萨姆语文本的不同音译行为。阿萨姆语属于印度-雅利安语系,也是印度 22 种在册语言之一。随着社交媒体在印度的日益普及,以及英语 Qwerty 键盘的普遍使用,印度用户在社交媒体上使用自己的母语表达自己,但使用的是罗马/拉丁字母。与其他一些流行的南亚语言(如汉语拼音)不同,印度语言在社交媒体平台上没有通用的罗马字母标准。阿萨姆语和英语是两种正字法截然不同的语言。因此,考虑到该语言的正字法和音位特征,本研究试图解释阿萨姆语元音、元音变音符和辅音如何以罗马音译形式表示。我们从三个流行的社交媒体网站(Facebook、YouTube 和 Twitter)上收集了罗马化的阿萨姆社交媒体文本数据集,并手动将其标注为阿萨姆本地文字。我们还将阿萨姆语社交媒体文本与六种不同的阿萨姆语罗马化方案进行了对比分析,这反映了社交媒体上的阿萨姆语用户并不拘泥于任何固定的罗马化方案。我们根据数据集建立了三个独立的字符级音译模型。一个使用传统的基于短语的统计机器音译模型 (1)。PBSMT 模型和两个独立的神经音译模型:(2).带有注意力的 BiLSTM 神经 seq2seq 模型,以及 (3).神经转换器模型。我们对上述三种最先进模型的音译结果进行了全面的误差分析。这可能有助于将来为阿萨姆语社交媒体领域建立更强大的机器音译系统。最后,我们还利用从我们的数据集建立的字符级 BiLSTM 神经 seq2seq 音译模型中提取的注意力权重分数进行了注意力分析实验。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
3.60
自引率
15.00%
发文量
241
期刊介绍: The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) publishes high quality original archival papers and technical notes in the areas of computation and processing of information in Asian languages, low-resource languages of Africa, Australasia, Oceania and the Americas, as well as related disciplines. The subject areas covered by TALLIP include, but are not limited to: -Computational Linguistics: including computational phonology, computational morphology, computational syntax (e.g. parsing), computational semantics, computational pragmatics, etc. -Linguistic Resources: including computational lexicography, terminology, electronic dictionaries, cross-lingual dictionaries, electronic thesauri, etc. -Hardware and software algorithms and tools for Asian or low-resource language processing, e.g., handwritten character recognition. -Information Understanding: including text understanding, speech understanding, character recognition, discourse processing, dialogue systems, etc. -Machine Translation involving Asian or low-resource languages. -Information Retrieval: including natural language processing (NLP) for concept-based indexing, natural language query interfaces, semantic relevance judgments, etc. -Information Extraction and Filtering: including automatic abstraction, user profiling, etc. -Speech processing: including text-to-speech synthesis and automatic speech recognition. -Multimedia Asian Information Processing: including speech, image, video, image/text translation, etc. -Cross-lingual information processing involving Asian or low-resource languages. -Papers that deal in theory, systems design, evaluation and applications in the aforesaid subjects are appropriate for TALLIP. Emphasis will be placed on the originality and the practical significance of the reported research.
期刊最新文献
Study on Intelligent Scoring of English Composition Based on Machine Learning from the Perspective of Natural Language Processing FedREAS: A Robust Efficient Aggregation and Selection Framework for Federated Learning X-Phishing-Writer: A Framework for Cross-Lingual Phishing Email Generation Automatic Algerian Sarcasm Detection from Texts and Images KannadaLex: A lexical database with psycholinguistic information
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1