Tadesse Destaw Belay, A. Ayele, G. Gelaye, Seid Muhie Yimam, Chris Biemann
{"title":"Impacts of Homophone Normalization on Semantic Models for Amharic","authors":"Tadesse Destaw Belay, A. Ayele, G. Gelaye, Seid Muhie Yimam, Chris Biemann","doi":"10.1109/ict4da53266.2021.9672229","DOIUrl":null,"url":null,"abstract":"Amharic is the second-most spoken Semitic language after Arabic and serves as the official working language of the government of Ethiopia. In Amharic writing, there are different characters with the same sound, which are called homophones. The current trend in Amharic NLP research is to normalize homophones into a single representation. This means, instead of character 11We have used the IPA notation for Amharic character transliteration, , and , the character will be used; instead of , and , the character will be replaced; and so on. This was done by the assumption that they are repetitive alphabets as they have the same sound. However, the impact of homophone normalization for Amharic NLP applications is not well studied. When one homophone character is substituted by another, there will be a meaning change and it is against the Amharic writing regulation. For example, the word is “poverty” while means “salvage”. These two words are homophones, but they have different meanings. To study the impacts of homophone normalization, we develop different general-purpose pre-trained embedding models for Amharic using regular and normalized homophone characters. We fine-tune the pre-trained models and build some Amharic NLP applications. For PoS tagging, a model that employs a regular FLAIR embedding model performs better, achieving an F1-score of 77%. For sentiment analysis, the model from regular RoBERTa embedding outperforms the other models with an F1-score of 60%. For IR systems, we achieve an F1-score of 90% using the normalized document. The results show that normalization is highly dependent on the NLP applications. For sentiment analysis and PoS tagging, normalization has negative impacts while it is essential for IR. Our research indicates that normalization should be applied with caution and more effort towards standardization should be given.","PeriodicalId":371663,"journal":{"name":"2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ict4da53266.2021.9672229","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Amharic is the second-most spoken Semitic language after Arabic and serves as the official working language of the government of Ethiopia. In Amharic writing, there are different characters with the same sound, which are called homophones. The current trend in Amharic NLP research is to normalize homophones into a single representation. This means, instead of character 11We have used the IPA notation for Amharic character transliteration, , and , the character will be used; instead of , and , the character will be replaced; and so on. This was done by the assumption that they are repetitive alphabets as they have the same sound. However, the impact of homophone normalization for Amharic NLP applications is not well studied. When one homophone character is substituted by another, there will be a meaning change and it is against the Amharic writing regulation. For example, the word is “poverty” while means “salvage”. These two words are homophones, but they have different meanings. To study the impacts of homophone normalization, we develop different general-purpose pre-trained embedding models for Amharic using regular and normalized homophone characters. We fine-tune the pre-trained models and build some Amharic NLP applications. For PoS tagging, a model that employs a regular FLAIR embedding model performs better, achieving an F1-score of 77%. For sentiment analysis, the model from regular RoBERTa embedding outperforms the other models with an F1-score of 60%. For IR systems, we achieve an F1-score of 90% using the normalized document. The results show that normalization is highly dependent on the NLP applications. For sentiment analysis and PoS tagging, normalization has negative impacts while it is essential for IR. Our research indicates that normalization should be applied with caution and more effort towards standardization should be given.