孟加拉语中的仇恨言论和攻击性语言检测

Q3 Environmental Science AACL Bioflux Pub Date : 2022-10-07 DOI:10.48550/arXiv.2210.03479

Mithun Das, Somnath Banerjee, Punyajoy Saha, Animesh Mukherjee

{"title":"孟加拉语中的仇恨言论和攻击性语言检测","authors":"Mithun Das, Somnath Banerjee, Punyajoy Saha, Animesh Mukherjee","doi":"10.48550/arXiv.2210.03479","DOIUrl":null,"url":null,"abstract":"Social media often serves as a breeding ground for various hateful and offensive content. Identifying such content on social media is crucial due to its impact on the race, gender, or religion in an unprejudiced society. However, while there is extensive research in hate speech detection in English, there is a gap in hateful content detection in low-resource languages like Bengali. Besides, a current trend on social media is the use of Romanized Bengali for regular interactions. To overcome the existing research’s limitations, in this study, we develop an annotated dataset of 10K Bengali posts consisting of 5K actual and 5K Romanized Bengali tweets. We implement several baseline models for the classification of such hateful posts. We further explore the interlingual transfer mechanism to boost classification performance. Finally, we perform an in-depth error analysis by looking into the misclassified posts by the models. While training actual and Romanized datasets separately, we observe that XLM-Roberta performs the best. Further, we witness that on joint training and few-shot training, MuRIL outperforms other models by interpreting the semantic expressions better. We make our code and dataset public for others.","PeriodicalId":39298,"journal":{"name":"AACL Bioflux","volume":"20 1","pages":"286-296"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Hate Speech and Offensive Language Detection in Bengali\",\"authors\":\"Mithun Das, Somnath Banerjee, Punyajoy Saha, Animesh Mukherjee\",\"doi\":\"10.48550/arXiv.2210.03479\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Social media often serves as a breeding ground for various hateful and offensive content. Identifying such content on social media is crucial due to its impact on the race, gender, or religion in an unprejudiced society. However, while there is extensive research in hate speech detection in English, there is a gap in hateful content detection in low-resource languages like Bengali. Besides, a current trend on social media is the use of Romanized Bengali for regular interactions. To overcome the existing research’s limitations, in this study, we develop an annotated dataset of 10K Bengali posts consisting of 5K actual and 5K Romanized Bengali tweets. We implement several baseline models for the classification of such hateful posts. We further explore the interlingual transfer mechanism to boost classification performance. Finally, we perform an in-depth error analysis by looking into the misclassified posts by the models. While training actual and Romanized datasets separately, we observe that XLM-Roberta performs the best. Further, we witness that on joint training and few-shot training, MuRIL outperforms other models by interpreting the semantic expressions better. We make our code and dataset public for others.\",\"PeriodicalId\":39298,\"journal\":{\"name\":\"AACL Bioflux\",\"volume\":\"20 1\",\"pages\":\"286-296\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AACL Bioflux\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2210.03479\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Environmental Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AACL Bioflux","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2210.03479","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Environmental Science","Score":null,"Total":0}

引用次数: 10

摘要

社交媒体经常成为滋生各种仇恨和冒犯性内容的温床。在一个没有偏见的社会中，识别社交媒体上的此类内容至关重要，因为它会对种族、性别或宗教产生影响。然而，尽管在英语仇恨言论检测方面有广泛的研究，但在孟加拉语等低资源语言的仇恨内容检测方面仍存在空白。此外，社交媒体上目前的一个趋势是使用罗马化的孟加拉语进行日常互动。为了克服现有研究的局限性，在本研究中，我们开发了一个由5K实际和5K罗马化孟加拉文推文组成的10K孟加拉文注释数据集。我们实施了几个基线模型来对这些仇恨帖子进行分类。我们进一步探索语际迁移机制，以提高分类性能。最后，我们通过查看模型的错误分类帖子进行了深入的错误分析。在分别训练实际数据集和罗马化数据集时，我们观察到XLM-Roberta表现最好。此外，我们看到在联合训练和少射训练中，MuRIL通过更好地解释语义表达式而优于其他模型。我们将代码和数据集公开给其他人。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Hate Speech and Offensive Language Detection in Bengali

Social media often serves as a breeding ground for various hateful and offensive content. Identifying such content on social media is crucial due to its impact on the race, gender, or religion in an unprejudiced society. However, while there is extensive research in hate speech detection in English, there is a gap in hateful content detection in low-resource languages like Bengali. Besides, a current trend on social media is the use of Romanized Bengali for regular interactions. To overcome the existing research’s limitations, in this study, we develop an annotated dataset of 10K Bengali posts consisting of 5K actual and 5K Romanized Bengali tweets. We implement several baseline models for the classification of such hateful posts. We further explore the interlingual transfer mechanism to boost classification performance. Finally, we perform an in-depth error analysis by looking into the misclassified posts by the models. While training actual and Romanized datasets separately, we observe that XLM-Roberta performs the best. Further, we witness that on joint training and few-shot training, MuRIL outperforms other models by interpreting the semantic expressions better. We make our code and dataset public for others.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

AACL Bioflux Environmental Science-Management, Monitoring, Policy and Law

CiteScore

1.40

自引率

0.00%

发文量