基于深度学习和变换器语言模型的网络欺凌文本识别

Khalid Saifullah, Muhammad Ibrahim Khan, Suhaima Jamal, Iqbal H. Sarker
{"title":"基于深度学习和变换器语言模型的网络欺凌文本识别","authors":"Khalid Saifullah, Muhammad Ibrahim Khan, Suhaima Jamal, Iqbal H. Sarker","doi":"10.4108/eetinis.v11i1.4703","DOIUrl":null,"url":null,"abstract":"In the contemporary digital age, social media platforms like Facebook, Twitter, and YouTube serve as vital channels for individuals to express ideas and connect with others. Despite fostering increased connectivity, these platforms have inadvertently given rise to negative behaviors, particularly cyberbullying. While extensive research has been conducted on high-resource languages such as English, there is a notable scarcity of resources for low-resource languages like Bengali, Arabic, Tamil, etc., particularly in terms of language modeling. This study addresses this gap by developing a cyberbullying text identification system called BullyFilterNeT tailored for social media texts, considering Bengali as a test case. The intelligent BullyFilterNeT system devised overcomes Out-of-Vocabulary (OOV) challenges associated with non-contextual embeddings and addresses the limitations of context-aware feature representations. To facilitate a comprehensive understanding, three non-contextual embedding models GloVe, FastText, and Word2Vec are developed for feature extraction in Bengali. These embedding models are utilized in the classification models, employing three statistical models (SVM, SGD, Libsvm), and four deep learning models (CNN, VDCNN, LSTM, GRU). Additionally, the study employs six transformer-based language models: mBERT, bELECTRA, IndicBERT, XML-RoBERTa, DistilBERT, and BanglaBERT, respectively to overcome the limitations of earlier models. Remarkably, BanglaBERT-based BullyFilterNeT achieves the highest accuracy of 88.04% in our test set, underscoring its effectiveness in cyberbullying text identification in the Bengali language.","PeriodicalId":33474,"journal":{"name":"EAI Endorsed Transactions on Industrial Networks and Intelligent Systems","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cyberbullying Text Identification based on Deep Learning and Transformer-based Language Models\",\"authors\":\"Khalid Saifullah, Muhammad Ibrahim Khan, Suhaima Jamal, Iqbal H. Sarker\",\"doi\":\"10.4108/eetinis.v11i1.4703\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the contemporary digital age, social media platforms like Facebook, Twitter, and YouTube serve as vital channels for individuals to express ideas and connect with others. Despite fostering increased connectivity, these platforms have inadvertently given rise to negative behaviors, particularly cyberbullying. While extensive research has been conducted on high-resource languages such as English, there is a notable scarcity of resources for low-resource languages like Bengali, Arabic, Tamil, etc., particularly in terms of language modeling. This study addresses this gap by developing a cyberbullying text identification system called BullyFilterNeT tailored for social media texts, considering Bengali as a test case. The intelligent BullyFilterNeT system devised overcomes Out-of-Vocabulary (OOV) challenges associated with non-contextual embeddings and addresses the limitations of context-aware feature representations. To facilitate a comprehensive understanding, three non-contextual embedding models GloVe, FastText, and Word2Vec are developed for feature extraction in Bengali. These embedding models are utilized in the classification models, employing three statistical models (SVM, SGD, Libsvm), and four deep learning models (CNN, VDCNN, LSTM, GRU). Additionally, the study employs six transformer-based language models: mBERT, bELECTRA, IndicBERT, XML-RoBERTa, DistilBERT, and BanglaBERT, respectively to overcome the limitations of earlier models. Remarkably, BanglaBERT-based BullyFilterNeT achieves the highest accuracy of 88.04% in our test set, underscoring its effectiveness in cyberbullying text identification in the Bengali language.\",\"PeriodicalId\":33474,\"journal\":{\"name\":\"EAI Endorsed Transactions on Industrial Networks and Intelligent Systems\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"EAI Endorsed Transactions on Industrial Networks and Intelligent Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4108/eetinis.v11i1.4703\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Engineering\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"EAI Endorsed Transactions on Industrial Networks and Intelligent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4108/eetinis.v11i1.4703","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Engineering","Score":null,"Total":0}
引用次数: 0

摘要

在当代数字时代,Facebook、Twitter 和 YouTube 等社交媒体平台成为个人表达想法和与他人联系的重要渠道。尽管这些平台促进了更多的联系,但也在无意中引发了负面行为,尤其是网络欺凌。虽然针对英语等高资源语言开展了大量研究,但针对孟加拉语、阿拉伯语、泰米尔语等低资源语言的资源却明显匮乏,尤其是在语言建模方面。本研究以孟加拉语为测试案例,开发了一个专为社交媒体文本定制的名为 BullyFilterNeT 的网络欺凌文本识别系统,从而填补了这一空白。所设计的智能 BullyFilterNeT 系统克服了与非上下文嵌入相关的词汇缺失(OOV)难题,并解决了上下文感知特征表征的局限性。为了便于全面理解,开发了三种非上下文嵌入模型 GloVe、FastText 和 Word2Vec,用于孟加拉语的特征提取。这些嵌入模型被用于分类模型中,采用了三种统计模型(SVM、SGD、Libsvm)和四种深度学习模型(CNN、VDCNN、LSTM、GRU)。此外,研究还采用了六种基于转换器的语言模型:mBERT、bELECTRA、IndicBERT、XML-RoBERTa、DistilBERT 和 BanglaBERT,以克服早期模型的局限性。值得注意的是,基于 BanglaBERT 的 BullyFilterNeT 在我们的测试集中达到了 88.04% 的最高准确率,凸显了其在孟加拉语网络欺凌文本识别中的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Cyberbullying Text Identification based on Deep Learning and Transformer-based Language Models
In the contemporary digital age, social media platforms like Facebook, Twitter, and YouTube serve as vital channels for individuals to express ideas and connect with others. Despite fostering increased connectivity, these platforms have inadvertently given rise to negative behaviors, particularly cyberbullying. While extensive research has been conducted on high-resource languages such as English, there is a notable scarcity of resources for low-resource languages like Bengali, Arabic, Tamil, etc., particularly in terms of language modeling. This study addresses this gap by developing a cyberbullying text identification system called BullyFilterNeT tailored for social media texts, considering Bengali as a test case. The intelligent BullyFilterNeT system devised overcomes Out-of-Vocabulary (OOV) challenges associated with non-contextual embeddings and addresses the limitations of context-aware feature representations. To facilitate a comprehensive understanding, three non-contextual embedding models GloVe, FastText, and Word2Vec are developed for feature extraction in Bengali. These embedding models are utilized in the classification models, employing three statistical models (SVM, SGD, Libsvm), and four deep learning models (CNN, VDCNN, LSTM, GRU). Additionally, the study employs six transformer-based language models: mBERT, bELECTRA, IndicBERT, XML-RoBERTa, DistilBERT, and BanglaBERT, respectively to overcome the limitations of earlier models. Remarkably, BanglaBERT-based BullyFilterNeT achieves the highest accuracy of 88.04% in our test set, underscoring its effectiveness in cyberbullying text identification in the Bengali language.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
4.00
自引率
0.00%
发文量
15
审稿时长
10 weeks
期刊最新文献
ViMedNER: A Medical Named Entity Recognition Dataset for Vietnamese Distributed Spatially Non-Stationary Channel Estimation for Extremely-Large Antenna Systems On the Performance of the Relay Selection in Multi-hop Cluster-based Wireless Networks with Multiple Eavesdroppers Under Equally Correlated Rayleigh Fading Improving Performance of the Typical User in the Indoor Cooperative NOMA Millimeter Wave Networks with Presence of Walls Real-time Single-Channel EOG removal based on Empirical Mode Decomposition
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1