Aya Mousa , Ismail Shahin , Ali Bou Nassif , Ashraf Elnagar
{"title":"利用机器学习模型检测社交媒体中的阿拉伯语攻击性语言","authors":"Aya Mousa , Ismail Shahin , Ali Bou Nassif , Ashraf Elnagar","doi":"10.1016/j.iswa.2024.200376","DOIUrl":null,"url":null,"abstract":"<div><p>This research aims to detect different types of Arabic offensive language in twitter. It uses a multiclass classification system in which each tweet is categorized into one or more of the offensive language types based on the used word(s). In this study, five types are classified, which are: bullying, insult, racism, obscene, and non-offensive. To classify the abusive language, a cascaded model consisting of Bidirectional Encoder Representation of Transformers (BERT) models (AraBERT, ArabicBERT, XLMRoBERTa, GigaBERT, MBERT, and QARiB), deep learning models (1D-CNN, BiLSTM), and Radial Basis Function (RBF) is presented in this work. In addition, various types of machine learning models are utilized. The dataset is collected from twitter in which each class has the same number of tweets (balanced dataset). Each tweet is assigned to one or more of the selected offensive language types to build multiclass and multilabel systems. In addition, a binary dataset is constructed by assigning the tweets to offensive or non-offensive classes. The highest results are obtained from implementing the cascaded model started by ArabicBERT followed by BiLSTM and RBF with an accuracy, precision, recall, and F1-score of 98.4%, 98.2%,92.8%, and 98.4%, respectively. RBF records the highest results among the utilized traditional classifiers with an accuracy, precision, recall, and F1-score of 60% for each measurement individually, while KNN records the lowest results obtaining 45%, 46%, 45%, and 43% in terms of accuracy, precision, recall, and F1-score, respectively.</p></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"22 ","pages":"Article 200376"},"PeriodicalIF":0.0000,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667305324000516/pdfft?md5=f5155135e406793f134b79e0164c3049&pid=1-s2.0-S2667305324000516-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Detection of Arabic offensive language in social media using machine learning models\",\"authors\":\"Aya Mousa , Ismail Shahin , Ali Bou Nassif , Ashraf Elnagar\",\"doi\":\"10.1016/j.iswa.2024.200376\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>This research aims to detect different types of Arabic offensive language in twitter. It uses a multiclass classification system in which each tweet is categorized into one or more of the offensive language types based on the used word(s). In this study, five types are classified, which are: bullying, insult, racism, obscene, and non-offensive. To classify the abusive language, a cascaded model consisting of Bidirectional Encoder Representation of Transformers (BERT) models (AraBERT, ArabicBERT, XLMRoBERTa, GigaBERT, MBERT, and QARiB), deep learning models (1D-CNN, BiLSTM), and Radial Basis Function (RBF) is presented in this work. In addition, various types of machine learning models are utilized. The dataset is collected from twitter in which each class has the same number of tweets (balanced dataset). Each tweet is assigned to one or more of the selected offensive language types to build multiclass and multilabel systems. In addition, a binary dataset is constructed by assigning the tweets to offensive or non-offensive classes. The highest results are obtained from implementing the cascaded model started by ArabicBERT followed by BiLSTM and RBF with an accuracy, precision, recall, and F1-score of 98.4%, 98.2%,92.8%, and 98.4%, respectively. RBF records the highest results among the utilized traditional classifiers with an accuracy, precision, recall, and F1-score of 60% for each measurement individually, while KNN records the lowest results obtaining 45%, 46%, 45%, and 43% in terms of accuracy, precision, recall, and F1-score, respectively.</p></div>\",\"PeriodicalId\":100684,\"journal\":{\"name\":\"Intelligent Systems with Applications\",\"volume\":\"22 \",\"pages\":\"Article 200376\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-04-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2667305324000516/pdfft?md5=f5155135e406793f134b79e0164c3049&pid=1-s2.0-S2667305324000516-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Intelligent Systems with Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2667305324000516\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Systems with Applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667305324000516","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
本研究旨在检测 twitter 中不同类型的阿拉伯语攻击性语言。它采用多类分类系统,根据使用的单词将每条推文归入一种或多种攻击性语言类型。在本研究中,共分为五种类型:欺凌、侮辱、种族主义、淫秽和非攻击性。为了对辱骂性语言进行分类,本研究提出了一个级联模型,该模型由双向变压器编码器表征(BERT)模型(AraBERT、ArabicBERT、XLMRoBERTa、GigaBERT、MBERT 和 QARiB)、深度学习模型(1D-CNN、BiLSTM)和径向基函数(RBF)组成。此外,还使用了各种类型的机器学习模型。数据集收集自 twitter,其中每个类别都有相同数量的推文(平衡数据集)。每条推文都被分配到一个或多个选定的攻击性语言类型中,以建立多类别和多标签系统。此外,还通过将推文分配到攻击性或非攻击性类别来构建二元数据集。在实施级联模型时,ArabicBERT 的结果最高,其次是 BiLSTM 和 RBF,准确率、精确率、召回率和 F1 分数分别为 98.4%、98.2%、92.8% 和 98.4%。在所使用的传统分类器中,RBF 的结果最高,准确率、精确度、召回率和 F1 分数均达到 60%,而 KNN 的结果最低,准确率、精确度、召回率和 F1 分数分别为 45%、46%、45% 和 43%。
Detection of Arabic offensive language in social media using machine learning models
This research aims to detect different types of Arabic offensive language in twitter. It uses a multiclass classification system in which each tweet is categorized into one or more of the offensive language types based on the used word(s). In this study, five types are classified, which are: bullying, insult, racism, obscene, and non-offensive. To classify the abusive language, a cascaded model consisting of Bidirectional Encoder Representation of Transformers (BERT) models (AraBERT, ArabicBERT, XLMRoBERTa, GigaBERT, MBERT, and QARiB), deep learning models (1D-CNN, BiLSTM), and Radial Basis Function (RBF) is presented in this work. In addition, various types of machine learning models are utilized. The dataset is collected from twitter in which each class has the same number of tweets (balanced dataset). Each tweet is assigned to one or more of the selected offensive language types to build multiclass and multilabel systems. In addition, a binary dataset is constructed by assigning the tweets to offensive or non-offensive classes. The highest results are obtained from implementing the cascaded model started by ArabicBERT followed by BiLSTM and RBF with an accuracy, precision, recall, and F1-score of 98.4%, 98.2%,92.8%, and 98.4%, respectively. RBF records the highest results among the utilized traditional classifiers with an accuracy, precision, recall, and F1-score of 60% for each measurement individually, while KNN records the lowest results obtaining 45%, 46%, 45%, and 43% in terms of accuracy, precision, recall, and F1-score, respectively.