Arabic Toxic Tweet Classification: Leveraging the AraBERT Model

IF 3.7 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Big Data and Cognitive Computing Pub Date : 2023-10-26 DOI:10.3390/bdcc7040170
Amr Mohamed El Koshiry, Entesar Hamed I. Eliwa, Tarek Abd El-Hafeez, Ahmed Omar
{"title":"Arabic Toxic Tweet Classification: Leveraging the AraBERT Model","authors":"Amr Mohamed El Koshiry, Entesar Hamed I. Eliwa, Tarek Abd El-Hafeez, Ahmed Omar","doi":"10.3390/bdcc7040170","DOIUrl":null,"url":null,"abstract":"Social media platforms have become the primary means of communication and information sharing, facilitating interactive exchanges among users. Unfortunately, these platforms also witness the dissemination of inappropriate and toxic content, including hate speech and insults. While significant efforts have been made to classify toxic content in the English language, the same level of attention has not been given to Arabic texts. This study addresses this gap by constructing a standardized Arabic dataset specifically designed for toxic tweet classification. The dataset is annotated automatically using Google’s Perspective API and the expertise of three native Arabic speakers and linguists. To evaluate the performance of different models, we conduct a series of experiments using seven models: long short-term memory (LSTM), bidirectional LSTM, a convolutional neural network, a gated recurrent unit (GRU), bidirectional GRU, multilingual bidirectional encoder representations from transformers, and AraBERT. Additionally, we employ word embedding techniques. Our experimental findings demonstrate that the fine-tuned AraBERT model surpasses the performance of other models, achieving an impressive accuracy of 0.9960. Notably, this accuracy value outperforms similar approaches reported in recent literature. This study represents a significant advancement in Arabic toxic tweet classification, shedding light on the importance of addressing toxicity in social media platforms while considering diverse languages and cultures.","PeriodicalId":36397,"journal":{"name":"Big Data and Cognitive Computing","volume":"105 12","pages":"0"},"PeriodicalIF":3.7000,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data and Cognitive Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/bdcc7040170","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Social media platforms have become the primary means of communication and information sharing, facilitating interactive exchanges among users. Unfortunately, these platforms also witness the dissemination of inappropriate and toxic content, including hate speech and insults. While significant efforts have been made to classify toxic content in the English language, the same level of attention has not been given to Arabic texts. This study addresses this gap by constructing a standardized Arabic dataset specifically designed for toxic tweet classification. The dataset is annotated automatically using Google’s Perspective API and the expertise of three native Arabic speakers and linguists. To evaluate the performance of different models, we conduct a series of experiments using seven models: long short-term memory (LSTM), bidirectional LSTM, a convolutional neural network, a gated recurrent unit (GRU), bidirectional GRU, multilingual bidirectional encoder representations from transformers, and AraBERT. Additionally, we employ word embedding techniques. Our experimental findings demonstrate that the fine-tuned AraBERT model surpasses the performance of other models, achieving an impressive accuracy of 0.9960. Notably, this accuracy value outperforms similar approaches reported in recent literature. This study represents a significant advancement in Arabic toxic tweet classification, shedding light on the importance of addressing toxicity in social media platforms while considering diverse languages and cultures.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
阿拉伯语有毒推文分类:利用AraBERT模型
社交媒体平台已经成为沟通和信息分享的主要手段,方便了用户之间的互动交流。不幸的是,这些平台也见证了不恰当和有毒内容的传播,包括仇恨言论和侮辱。虽然已作出重大努力对英语语文的有毒内容进行分类,但对阿拉伯语文本却没有给予同样的重视。本研究通过构建一个专门为有毒推文分类设计的标准化阿拉伯语数据集来解决这一差距。该数据集使用Google的Perspective API和三位母语为阿拉伯语的语言学家的专业知识自动注释。为了评估不同模型的性能,我们使用七个模型进行了一系列实验:长短期记忆(LSTM),双向LSTM,卷积神经网络,门通循环单元(GRU),双向GRU,多语言双向编码器表示来自变压器和AraBERT。此外,我们还采用了词嵌入技术。我们的实验结果表明,经过微调的AraBERT模型的性能优于其他模型,达到了令人印象深刻的0.9960的精度。值得注意的是,该精度值优于最近文献中报道的类似方法。这项研究代表了阿拉伯语有毒推文分类的重大进步,揭示了在考虑不同语言和文化的情况下解决社交媒体平台毒性问题的重要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Big Data and Cognitive Computing
Big Data and Cognitive Computing Business, Management and Accounting-Management Information Systems
CiteScore
7.10
自引率
8.10%
发文量
128
审稿时长
11 weeks
期刊最新文献
A Survey of Incremental Deep Learning for Defect Detection in Manufacturing BNMI-DINA: A Bayesian Cognitive Diagnosis Model for Enhanced Personalized Learning Semantic Similarity of Common Verbal Expressions in Older Adults through a Pre-Trained Model Knowledge-Based and Generative-AI-Driven Pedagogical Conversational Agents: A Comparative Study of Grice’s Cooperative Principles and Trust Distributed Bayesian Inference for Large-Scale IoT Systems
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1