Deep Pre-trained Contrastive Self-Supervised Learning: A Cyberbullying Detection Approach with Augmented Datasets

2022 14th International Conference on Computational Intelligence and Communication Networks (CICN) Pub Date : 2022-12-04 DOI:10.1109/CICN56167.2022.10008274

Lulwah M. Al-Harigy, H. Al-Nuaim, N. Moradpoor, Zhiyuan Tan

{"title":"Deep Pre-trained Contrastive Self-Supervised Learning: A Cyberbullying Detection Approach with Augmented Datasets","authors":"Lulwah M. Al-Harigy, H. Al-Nuaim, N. Moradpoor, Zhiyuan Tan","doi":"10.1109/CICN56167.2022.10008274","DOIUrl":null,"url":null,"abstract":"Cyberbullying is a widespread problem that has only increased in recent years due to the massive dependence on social media. Although, there are many approaches for detecting cyberbullying they still need to be improved upon for more accurate detection. We need new approaches that understand the context of the words used in cyberbullying by generating different representations of each word. In addition. there is a large amount of unlabelled data on the Internet that needs to be labelled for a more accurate detection process. Even though multiple methods for annotating datasets exists, the most widely used are still manual approaches, either using experts or crowdsourcing. However, The time needed and high cost of labor for manually annotation approaches result in a lack of annotated social network datasets for training a robust cyberbullying detector. Automated approaches can be relied upon in labelling data, such as using the Self-Supervised Learning (SSL) model. In this paper, we proposed two main parts. The first part is proposing a model of parallel BERT + Bi-LSTM used for detecting cyberbullying terms. The second part is utilizing Contrastive Self-Supervised Learning (a form of SSL) to augment the training set from unlabeled data using a small portion of another manually annotated dataset. Our proposed model that used deep pre-trained contrastive self-supervised learning for detecting cyberbullying using augmented datasets achieved a performance of (0.9311) using macro average F1 score. This result shows our model outperformed the baseline models - the top three teams in the competition SemEval-2020 Task 12.","PeriodicalId":287589,"journal":{"name":"2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CICN56167.2022.10008274","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Cyberbullying is a widespread problem that has only increased in recent years due to the massive dependence on social media. Although, there are many approaches for detecting cyberbullying they still need to be improved upon for more accurate detection. We need new approaches that understand the context of the words used in cyberbullying by generating different representations of each word. In addition. there is a large amount of unlabelled data on the Internet that needs to be labelled for a more accurate detection process. Even though multiple methods for annotating datasets exists, the most widely used are still manual approaches, either using experts or crowdsourcing. However, The time needed and high cost of labor for manually annotation approaches result in a lack of annotated social network datasets for training a robust cyberbullying detector. Automated approaches can be relied upon in labelling data, such as using the Self-Supervised Learning (SSL) model. In this paper, we proposed two main parts. The first part is proposing a model of parallel BERT + Bi-LSTM used for detecting cyberbullying terms. The second part is utilizing Contrastive Self-Supervised Learning (a form of SSL) to augment the training set from unlabeled data using a small portion of another manually annotated dataset. Our proposed model that used deep pre-trained contrastive self-supervised learning for detecting cyberbullying using augmented datasets achieved a performance of (0.9311) using macro average F1 score. This result shows our model outperformed the baseline models - the top three teams in the competition SemEval-2020 Task 12.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

深度预训练对比自监督学习:基于增强数据集的网络欺凌检测方法

网络欺凌是一个普遍存在的问题，近年来由于对社交媒体的大量依赖而加剧。尽管有很多检测网络欺凌的方法，但为了更准确的检测，它们仍然需要改进。我们需要新的方法，通过生成每个单词的不同表示来理解网络欺凌中使用的单词的上下文。此外。互联网上有大量未标记的数据需要标记，以便更准确地检测过程。尽管存在多种注释数据集的方法，但最广泛使用的仍然是手动方法，要么使用专家，要么使用众包。然而，手动标注方法所需的时间和高昂的人工成本导致缺乏用于训练鲁棒网络欺凌检测器的标注社交网络数据集。自动化方法可以用于标记数据，例如使用自监督学习(Self-Supervised Learning, SSL)模型。在本文中，我们提出两个主要部分。第一部分提出了一种用于检测网络欺凌术语的并行BERT + Bi-LSTM模型。第二部分是利用对比自监督学习(SSL的一种形式)，使用另一个手动注释数据集的一小部分从未标记的数据中扩展训练集。我们提出的模型使用深度预训练的对比自监督学习来检测使用增强数据集的网络欺凌，使用宏观平均F1分数获得了(0.9311)的性能。这个结果表明，我们的模型优于基准模型——SemEval-2020 Task 12竞赛中的前三名团队。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)

自引率

0.00%

发文量