{"title":"Improving Sinhala Hate Speech Detection Using Deep Learning","authors":"Kavishka Gamage, V. Welgama, R. Weerasinghe","doi":"10.1109/ICTer58063.2022.10024103","DOIUrl":null,"url":null,"abstract":"Automatic Hate Speech Detection is a fine-grained sentiment analysis task that has been the focus of many researchers around the world. This has been a difficult task due to challenges such as the usage of native languages and distinct vocabularies, as well as the distortion of words. However, based on the findings of previous studies on Sinhala hate speech identification, this has proven to be more difficult for low-resource languages like Sinhala. The effectiveness of pretrained embedding for Sinhala hate speech detection has not been investigated. We investigated several embeddings as well as frequency-based features, including bag of words, n-grams, and TF-IDF to address this shortcoming. We present results from several machine learning experiments, including deep learning experiments and transfer learning experiments on state-of-the-art cross-lingual transformers. With an f1-score of 0.764 and a recall value of 0.788 in our study, the XLMR model outperformed other baseline algorithms and deep learning models.","PeriodicalId":123176,"journal":{"name":"2022 22nd International Conference on Advances in ICT for Emerging Regions (ICTer)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 22nd International Conference on Advances in ICT for Emerging Regions (ICTer)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTer58063.2022.10024103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Automatic Hate Speech Detection is a fine-grained sentiment analysis task that has been the focus of many researchers around the world. This has been a difficult task due to challenges such as the usage of native languages and distinct vocabularies, as well as the distortion of words. However, based on the findings of previous studies on Sinhala hate speech identification, this has proven to be more difficult for low-resource languages like Sinhala. The effectiveness of pretrained embedding for Sinhala hate speech detection has not been investigated. We investigated several embeddings as well as frequency-based features, including bag of words, n-grams, and TF-IDF to address this shortcoming. We present results from several machine learning experiments, including deep learning experiments and transfer learning experiments on state-of-the-art cross-lingual transformers. With an f1-score of 0.764 and a recall value of 0.788 in our study, the XLMR model outperformed other baseline algorithms and deep learning models.