Transformer Models for Recognizing Abusive Language An investigation and review on Tweeteval and SOLID dataset

2023 Second International Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT) Pub Date : 2023-04-05 DOI:10.1109/ICEEICT56924.2023.10157848

Fabeela Ali Rawther, Geevarghese Titus

{"title":"Transformer Models for Recognizing Abusive Language An investigation and review on Tweeteval and SOLID dataset","authors":"Fabeela Ali Rawther, Geevarghese Titus","doi":"10.1109/ICEEICT56924.2023.10157848","DOIUrl":null,"url":null,"abstract":"Social engineering communities have become very popular among the kids and elderly alike. In this era of social media, the streaming of comments, opinions, reviews and communications is done via most common social media messaging communities like Twitter, Meta owned WhatsApp, FB and Instagram, Snapchat, telegram and YouTube comments. In this paper we perform a review on the different methods and models used to identify the offensive language using different datasets. Offensive language detection is a tedious task as it is country and language specific. The corpus used to identify the offensiveness and abusiveness is not covering all the word usages. We have done a comparison study of different methods on text to detect the post is offensive or not. The detection of abusive language is an unsolved and challenging problem to researchers in Natural Language Processing (NLP). This has led to be one of the reasons for increased level of mental instability among teenagers to elderly. The crime via social media has increased to a large value than older days. The study and surveys show that to recognize the structure and context of the language is the best way to solve this problem to an extent. The paper aims to four recent transformer models pretrained and fine-tuned for offensive language detection on the tweeteval dataset viz; DistilBERT, RoBERTa, DistilRoBERTa and DeBERTa. All the model had limitation in the performance based on the training data size used but are optimized by tuning hyper parameters during training. The models are limited to English language offensive words and recent works are going on in the area of multilingual tweets on both text and speech processing.","PeriodicalId":345324,"journal":{"name":"2023 Second International Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 Second International Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEEICT56924.2023.10157848","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Social engineering communities have become very popular among the kids and elderly alike. In this era of social media, the streaming of comments, opinions, reviews and communications is done via most common social media messaging communities like Twitter, Meta owned WhatsApp, FB and Instagram, Snapchat, telegram and YouTube comments. In this paper we perform a review on the different methods and models used to identify the offensive language using different datasets. Offensive language detection is a tedious task as it is country and language specific. The corpus used to identify the offensiveness and abusiveness is not covering all the word usages. We have done a comparison study of different methods on text to detect the post is offensive or not. The detection of abusive language is an unsolved and challenging problem to researchers in Natural Language Processing (NLP). This has led to be one of the reasons for increased level of mental instability among teenagers to elderly. The crime via social media has increased to a large value than older days. The study and surveys show that to recognize the structure and context of the language is the best way to solve this problem to an extent. The paper aims to four recent transformer models pretrained and fine-tuned for offensive language detection on the tweeteval dataset viz; DistilBERT, RoBERTa, DistilRoBERTa and DeBERTa. All the model had limitation in the performance based on the training data size used but are optimized by tuning hyper parameters during training. The models are limited to English language offensive words and recent works are going on in the area of multilingual tweets on both text and speech processing.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于Tweeteval和SOLID数据集的辱骂性语言识别转换模型的研究与回顾

社会工程社区在孩子和老人中都很受欢迎。在这个社交媒体时代，评论、观点、评论和交流是通过最常见的社交媒体信息社区完成的，比如Twitter、Meta旗下的WhatsApp、FB和Instagram、Snapchat、telegram和YouTube评论。在本文中，我们对使用不同数据集识别攻击性语言的不同方法和模型进行了回顾。攻击性语言检测是一项繁琐的任务，因为它是特定于国家和语言的。用于识别冒犯性和辱骂性的语料库并没有涵盖所有的词汇用法。我们对不同的文本检测方法进行了对比研究。谩骂语言的检测一直是自然语言处理(NLP)领域的研究热点和难点。这是青少年到老年人精神不稳定程度增加的原因之一。通过社交媒体的犯罪比以前增加了很多。研究和调查表明，在一定程度上认识语言的结构和语境是解决这一问题的最佳途径。本文旨在对四种最新的变形模型进行预训练和微调，用于在twitter数据集上进行攻击性语言检测，即;蒸馏酒，罗伯塔，蒸馏酒罗伯塔和德伯塔。所有模型的性能都受到训练数据大小的限制，但在训练过程中通过调整超参数进行了优化。这些模型仅限于英语中的冒犯性词汇，最近在多语言推文的文本和语音处理领域正在进行研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2023 Second International Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT)

自引率

0.00%

发文量