同音字规范化对阿姆哈拉语语义模型的影响

Tadesse Destaw Belay, A. Ayele, G. Gelaye, Seid Muhie Yimam, Chris Biemann
{"title":"同音字规范化对阿姆哈拉语语义模型的影响","authors":"Tadesse Destaw Belay, A. Ayele, G. Gelaye, Seid Muhie Yimam, Chris Biemann","doi":"10.1109/ict4da53266.2021.9672229","DOIUrl":null,"url":null,"abstract":"Amharic is the second-most spoken Semitic language after Arabic and serves as the official working language of the government of Ethiopia. In Amharic writing, there are different characters with the same sound, which are called homophones. The current trend in Amharic NLP research is to normalize homophones into a single representation. This means, instead of character 11We have used the IPA notation for Amharic character transliteration, , and , the character will be used; instead of , and , the character will be replaced; and so on. This was done by the assumption that they are repetitive alphabets as they have the same sound. However, the impact of homophone normalization for Amharic NLP applications is not well studied. When one homophone character is substituted by another, there will be a meaning change and it is against the Amharic writing regulation. For example, the word is “poverty” while means “salvage”. These two words are homophones, but they have different meanings. To study the impacts of homophone normalization, we develop different general-purpose pre-trained embedding models for Amharic using regular and normalized homophone characters. We fine-tune the pre-trained models and build some Amharic NLP applications. For PoS tagging, a model that employs a regular FLAIR embedding model performs better, achieving an F1-score of 77%. For sentiment analysis, the model from regular RoBERTa embedding outperforms the other models with an F1-score of 60%. For IR systems, we achieve an F1-score of 90% using the normalized document. The results show that normalization is highly dependent on the NLP applications. For sentiment analysis and PoS tagging, normalization has negative impacts while it is essential for IR. Our research indicates that normalization should be applied with caution and more effort towards standardization should be given.","PeriodicalId":371663,"journal":{"name":"2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Impacts of Homophone Normalization on Semantic Models for Amharic\",\"authors\":\"Tadesse Destaw Belay, A. Ayele, G. Gelaye, Seid Muhie Yimam, Chris Biemann\",\"doi\":\"10.1109/ict4da53266.2021.9672229\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Amharic is the second-most spoken Semitic language after Arabic and serves as the official working language of the government of Ethiopia. In Amharic writing, there are different characters with the same sound, which are called homophones. The current trend in Amharic NLP research is to normalize homophones into a single representation. This means, instead of character 11We have used the IPA notation for Amharic character transliteration, , and , the character will be used; instead of , and , the character will be replaced; and so on. This was done by the assumption that they are repetitive alphabets as they have the same sound. However, the impact of homophone normalization for Amharic NLP applications is not well studied. When one homophone character is substituted by another, there will be a meaning change and it is against the Amharic writing regulation. For example, the word is “poverty” while means “salvage”. These two words are homophones, but they have different meanings. To study the impacts of homophone normalization, we develop different general-purpose pre-trained embedding models for Amharic using regular and normalized homophone characters. We fine-tune the pre-trained models and build some Amharic NLP applications. For PoS tagging, a model that employs a regular FLAIR embedding model performs better, achieving an F1-score of 77%. For sentiment analysis, the model from regular RoBERTa embedding outperforms the other models with an F1-score of 60%. For IR systems, we achieve an F1-score of 90% using the normalized document. The results show that normalization is highly dependent on the NLP applications. For sentiment analysis and PoS tagging, normalization has negative impacts while it is essential for IR. Our research indicates that normalization should be applied with caution and more effort towards standardization should be given.\",\"PeriodicalId\":371663,\"journal\":{\"name\":\"2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ict4da53266.2021.9672229\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ict4da53266.2021.9672229","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

摘要

阿姆哈拉语是继阿拉伯语之后第二大使用的闪族语言,也是埃塞俄比亚政府的官方工作语言。在阿姆哈拉语的文字中,有不同的字有相同的发音,这被称为同音异义字。目前阿姆哈拉语自然语言处理研究的趋势是将同音异义词归一化为单一的表示。这意味着,我们已经使用国际音标法来转写阿姆哈拉语字符,而不是字符11,并且,该字符将被使用;而不是,和,字符将被替换;等等......这是假设它们是重复的字母,因为它们有相同的发音。然而,同音字归一化对阿姆哈拉语自然语言处理应用的影响还没有得到很好的研究。当一个同音字被另一个同音字取代时,会有一个意义的改变,这是违反阿姆哈拉语写作规则的。例如,这个词是“贫穷”,而意思是“救助”。这两个词是同音异义词,但它们的意思不同。为了研究同音字归一化对阿姆哈拉语的影响,我们采用正则同音字和归一化同音字建立了不同的通用预训练嵌入模型。我们对预训练模型进行了微调,并构建了一些阿姆哈拉语NLP应用程序。对于词性标注,采用常规FLAIR嵌入模型的模型表现更好,f1得分为77%。对于情感分析,来自常规RoBERTa嵌入的模型优于其他模型,f1得分为60%。对于红外系统,我们使用规范化文档实现了90%的f1得分。结果表明,归一化程度高度依赖于自然语言处理的应用。对于情感分析和词性标注,归一化会产生负面影响,而对于情感分析则是必不可少的。我们的研究表明,应谨慎应用规范化,并应在标准化方面付出更多努力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Impacts of Homophone Normalization on Semantic Models for Amharic
Amharic is the second-most spoken Semitic language after Arabic and serves as the official working language of the government of Ethiopia. In Amharic writing, there are different characters with the same sound, which are called homophones. The current trend in Amharic NLP research is to normalize homophones into a single representation. This means, instead of character 11We have used the IPA notation for Amharic character transliteration, , and , the character will be used; instead of , and , the character will be replaced; and so on. This was done by the assumption that they are repetitive alphabets as they have the same sound. However, the impact of homophone normalization for Amharic NLP applications is not well studied. When one homophone character is substituted by another, there will be a meaning change and it is against the Amharic writing regulation. For example, the word is “poverty” while means “salvage”. These two words are homophones, but they have different meanings. To study the impacts of homophone normalization, we develop different general-purpose pre-trained embedding models for Amharic using regular and normalized homophone characters. We fine-tune the pre-trained models and build some Amharic NLP applications. For PoS tagging, a model that employs a regular FLAIR embedding model performs better, achieving an F1-score of 77%. For sentiment analysis, the model from regular RoBERTa embedding outperforms the other models with an F1-score of 60%. For IR systems, we achieve an F1-score of 90% using the normalized document. The results show that normalization is highly dependent on the NLP applications. For sentiment analysis and PoS tagging, normalization has negative impacts while it is essential for IR. Our research indicates that normalization should be applied with caution and more effort towards standardization should be given.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
HSSIW: Hybrid Squirrel Search and Invasive Weed Based Cost-Makespan Task Scheduling for Fog-Cloud Environment Past Event Recall Test for Mitigating Session Hijacking and Cross-Site Request Forgery Classifying Severity Level of Psychiatric Symptoms on Twitter Data Investigate Risk Factors and Predict Neonatal and Infant Mortality Based on Maternal Determinants using Homogenous Ensemble Methods BackIP: Mutation Based Test Data Generation Using Hybrid Approach
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1