Detection of Arabic offensive language in social media using machine learning models

Aya Mousa , Ismail Shahin , Ali Bou Nassif , Ashraf Elnagar
{"title":"Detection of Arabic offensive language in social media using machine learning models","authors":"Aya Mousa ,&nbsp;Ismail Shahin ,&nbsp;Ali Bou Nassif ,&nbsp;Ashraf Elnagar","doi":"10.1016/j.iswa.2024.200376","DOIUrl":null,"url":null,"abstract":"<div><p>This research aims to detect different types of Arabic offensive language in twitter. It uses a multiclass classification system in which each tweet is categorized into one or more of the offensive language types based on the used word(s). In this study, five types are classified, which are: bullying, insult, racism, obscene, and non-offensive. To classify the abusive language, a cascaded model consisting of Bidirectional Encoder Representation of Transformers (BERT) models (AraBERT, ArabicBERT, XLMRoBERTa, GigaBERT, MBERT, and QARiB), deep learning models (1D-CNN, BiLSTM), and Radial Basis Function (RBF) is presented in this work. In addition, various types of machine learning models are utilized. The dataset is collected from twitter in which each class has the same number of tweets (balanced dataset). Each tweet is assigned to one or more of the selected offensive language types to build multiclass and multilabel systems. In addition, a binary dataset is constructed by assigning the tweets to offensive or non-offensive classes. The highest results are obtained from implementing the cascaded model started by ArabicBERT followed by BiLSTM and RBF with an accuracy, precision, recall, and F1-score of 98.4%, 98.2%,92.8%, and 98.4%, respectively. RBF records the highest results among the utilized traditional classifiers with an accuracy, precision, recall, and F1-score of 60% for each measurement individually, while KNN records the lowest results obtaining 45%, 46%, 45%, and 43% in terms of accuracy, precision, recall, and F1-score, respectively.</p></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"22 ","pages":"Article 200376"},"PeriodicalIF":0.0000,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667305324000516/pdfft?md5=f5155135e406793f134b79e0164c3049&pid=1-s2.0-S2667305324000516-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Systems with Applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667305324000516","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This research aims to detect different types of Arabic offensive language in twitter. It uses a multiclass classification system in which each tweet is categorized into one or more of the offensive language types based on the used word(s). In this study, five types are classified, which are: bullying, insult, racism, obscene, and non-offensive. To classify the abusive language, a cascaded model consisting of Bidirectional Encoder Representation of Transformers (BERT) models (AraBERT, ArabicBERT, XLMRoBERTa, GigaBERT, MBERT, and QARiB), deep learning models (1D-CNN, BiLSTM), and Radial Basis Function (RBF) is presented in this work. In addition, various types of machine learning models are utilized. The dataset is collected from twitter in which each class has the same number of tweets (balanced dataset). Each tweet is assigned to one or more of the selected offensive language types to build multiclass and multilabel systems. In addition, a binary dataset is constructed by assigning the tweets to offensive or non-offensive classes. The highest results are obtained from implementing the cascaded model started by ArabicBERT followed by BiLSTM and RBF with an accuracy, precision, recall, and F1-score of 98.4%, 98.2%,92.8%, and 98.4%, respectively. RBF records the highest results among the utilized traditional classifiers with an accuracy, precision, recall, and F1-score of 60% for each measurement individually, while KNN records the lowest results obtaining 45%, 46%, 45%, and 43% in terms of accuracy, precision, recall, and F1-score, respectively.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用机器学习模型检测社交媒体中的阿拉伯语攻击性语言
本研究旨在检测 twitter 中不同类型的阿拉伯语攻击性语言。它采用多类分类系统,根据使用的单词将每条推文归入一种或多种攻击性语言类型。在本研究中,共分为五种类型:欺凌、侮辱、种族主义、淫秽和非攻击性。为了对辱骂性语言进行分类,本研究提出了一个级联模型,该模型由双向变压器编码器表征(BERT)模型(AraBERT、ArabicBERT、XLMRoBERTa、GigaBERT、MBERT 和 QARiB)、深度学习模型(1D-CNN、BiLSTM)和径向基函数(RBF)组成。此外,还使用了各种类型的机器学习模型。数据集收集自 twitter,其中每个类别都有相同数量的推文(平衡数据集)。每条推文都被分配到一个或多个选定的攻击性语言类型中,以建立多类别和多标签系统。此外,还通过将推文分配到攻击性或非攻击性类别来构建二元数据集。在实施级联模型时,ArabicBERT 的结果最高,其次是 BiLSTM 和 RBF,准确率、精确率、召回率和 F1 分数分别为 98.4%、98.2%、92.8% 和 98.4%。在所使用的传统分类器中,RBF 的结果最高,准确率、精确度、召回率和 F1 分数均达到 60%,而 KNN 的结果最低,准确率、精确度、召回率和 F1 分数分别为 45%、46%、45% 和 43%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
5.60
自引率
0.00%
发文量
0
期刊最新文献
MapReduce teaching learning based optimization algorithm for solving CEC-2013 LSGO benchmark Testsuit Intelligent gear decision method for vehicle automatic transmission system based on data mining Design and implementation of EventsKG for situational monitoring and security intelligence in India: An open-source intelligence gathering approach Ideological orientation and extremism detection in online social networking sites: A systematic review Multi-objective optimization of power networks integrating electric vehicles and wind energy
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1