Hate Speech and Offensive Language Detection in Bengali

Q3 Environmental Science AACL Bioflux Pub Date : 2022-10-07 DOI:10.48550/arXiv.2210.03479
Mithun Das, Somnath Banerjee, Punyajoy Saha, Animesh Mukherjee
{"title":"Hate Speech and Offensive Language Detection in Bengali","authors":"Mithun Das, Somnath Banerjee, Punyajoy Saha, Animesh Mukherjee","doi":"10.48550/arXiv.2210.03479","DOIUrl":null,"url":null,"abstract":"Social media often serves as a breeding ground for various hateful and offensive content. Identifying such content on social media is crucial due to its impact on the race, gender, or religion in an unprejudiced society. However, while there is extensive research in hate speech detection in English, there is a gap in hateful content detection in low-resource languages like Bengali. Besides, a current trend on social media is the use of Romanized Bengali for regular interactions. To overcome the existing research’s limitations, in this study, we develop an annotated dataset of 10K Bengali posts consisting of 5K actual and 5K Romanized Bengali tweets. We implement several baseline models for the classification of such hateful posts. We further explore the interlingual transfer mechanism to boost classification performance. Finally, we perform an in-depth error analysis by looking into the misclassified posts by the models. While training actual and Romanized datasets separately, we observe that XLM-Roberta performs the best. Further, we witness that on joint training and few-shot training, MuRIL outperforms other models by interpreting the semantic expressions better. We make our code and dataset public for others.","PeriodicalId":39298,"journal":{"name":"AACL Bioflux","volume":"20 1","pages":"286-296"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AACL Bioflux","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2210.03479","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Environmental Science","Score":null,"Total":0}
引用次数: 10

Abstract

Social media often serves as a breeding ground for various hateful and offensive content. Identifying such content on social media is crucial due to its impact on the race, gender, or religion in an unprejudiced society. However, while there is extensive research in hate speech detection in English, there is a gap in hateful content detection in low-resource languages like Bengali. Besides, a current trend on social media is the use of Romanized Bengali for regular interactions. To overcome the existing research’s limitations, in this study, we develop an annotated dataset of 10K Bengali posts consisting of 5K actual and 5K Romanized Bengali tweets. We implement several baseline models for the classification of such hateful posts. We further explore the interlingual transfer mechanism to boost classification performance. Finally, we perform an in-depth error analysis by looking into the misclassified posts by the models. While training actual and Romanized datasets separately, we observe that XLM-Roberta performs the best. Further, we witness that on joint training and few-shot training, MuRIL outperforms other models by interpreting the semantic expressions better. We make our code and dataset public for others.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
孟加拉语中的仇恨言论和攻击性语言检测
社交媒体经常成为滋生各种仇恨和冒犯性内容的温床。在一个没有偏见的社会中,识别社交媒体上的此类内容至关重要,因为它会对种族、性别或宗教产生影响。然而,尽管在英语仇恨言论检测方面有广泛的研究,但在孟加拉语等低资源语言的仇恨内容检测方面仍存在空白。此外,社交媒体上目前的一个趋势是使用罗马化的孟加拉语进行日常互动。为了克服现有研究的局限性,在本研究中,我们开发了一个由5K实际和5K罗马化孟加拉文推文组成的10K孟加拉文注释数据集。我们实施了几个基线模型来对这些仇恨帖子进行分类。我们进一步探索语际迁移机制,以提高分类性能。最后,我们通过查看模型的错误分类帖子进行了深入的错误分析。在分别训练实际数据集和罗马化数据集时,我们观察到XLM-Roberta表现最好。此外,我们看到在联合训练和少射训练中,MuRIL通过更好地解释语义表达式而优于其他模型。我们将代码和数据集公开给其他人。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
AACL Bioflux
AACL Bioflux Environmental Science-Management, Monitoring, Policy and Law
CiteScore
1.40
自引率
0.00%
发文量
0
期刊最新文献
HaRiM^+: Evaluating Summary Quality with Hallucination Risk PESE: Event Structure Extraction using Pointer Network based Encoder-Decoder Architecture Bipartite-play Dialogue Collection for Practical Automatic Evaluation of Dialogue Systems Local Structure Matters Most in Most Languages Unsupervised Domain Adaptation for Sparse Retrieval by Filling Vocabulary and Word Frequency Gaps
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1