Usage of user hate speech index for improving hate speech detection in Twitter posts

2022 XXVIII International Conference on Information, Communication and Automation Technologies (ICAT) Pub Date : 2022-06-16 DOI:10.1109/ICAT54566.2022.9811159

Ehlimana Krupalija, D. Donko, H. Supic

{"title":"Usage of user hate speech index for improving hate speech detection in Twitter posts","authors":"Ehlimana Krupalija, D. Donko, H. Supic","doi":"10.1109/ICAT54566.2022.9811159","DOIUrl":null,"url":null,"abstract":"Social media is an important source of real-world data for sentiment analysis. Hate speech detection models can be trained on data from Twitter and then utilized for content filtering and removal of posts which contain hate speech. This work proposes a new algorithm for calculating user hate speech index based on user post history. Three available datasets were merged for the purpose of acquiring Twitter posts which contained hate speech. Text preprocessing and tokenization was performed, as well as outlier removal and class balancing. The proposed algorithm was used for determining hate speech index of users who posted tweets from the dataset. The preprocessed dataset was used for training and testing multiple machine learning models: k-means clustering without and with principal component analysis, naïve Bayes, decision tree and random forest. Four different feature subsets of the dataset were used for model training and testing. Anomaly detection, data transformation and parameter tuning were used in an attempt to improve classification accuracy. The highest F1 measure was achieved by training the model using a combination of user hate speech index and other user features. The results show that the usage of user hate speech index, with or without other user features, improves the accuracy of hate speech detection.","PeriodicalId":414786,"journal":{"name":"2022 XXVIII International Conference on Information, Communication and Automation Technologies (ICAT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 XXVIII International Conference on Information, Communication and Automation Technologies (ICAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAT54566.2022.9811159","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Social media is an important source of real-world data for sentiment analysis. Hate speech detection models can be trained on data from Twitter and then utilized for content filtering and removal of posts which contain hate speech. This work proposes a new algorithm for calculating user hate speech index based on user post history. Three available datasets were merged for the purpose of acquiring Twitter posts which contained hate speech. Text preprocessing and tokenization was performed, as well as outlier removal and class balancing. The proposed algorithm was used for determining hate speech index of users who posted tweets from the dataset. The preprocessed dataset was used for training and testing multiple machine learning models: k-means clustering without and with principal component analysis, naïve Bayes, decision tree and random forest. Four different feature subsets of the dataset were used for model training and testing. Anomaly detection, data transformation and parameter tuning were used in an attempt to improve classification accuracy. The highest F1 measure was achieved by training the model using a combination of user hate speech index and other user features. The results show that the usage of user hate speech index, with or without other user features, improves the accuracy of hate speech detection.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用用户仇恨言论索引改进Twitter帖子中的仇恨言论检测

社交媒体是情感分析的重要现实数据来源。仇恨言论检测模型可以根据Twitter的数据进行训练，然后用于内容过滤和删除包含仇恨言论的帖子。本文提出了一种基于用户帖子历史计算用户仇恨言论指数的新算法。为了获取包含仇恨言论的推特帖子，合并了三个可用的数据集。进行了文本预处理和标记化，以及异常值去除和类平衡。该算法用于确定从数据集中发布推文的用户的仇恨言论指数。预处理后的数据集用于训练和测试多个机器学习模型:无主成分分析和有主成分分析的k-means聚类、naïve贝叶斯、决策树和随机森林。使用数据集的四个不同特征子集进行模型训练和测试。采用异常检测、数据转换和参数调优等方法提高分类精度。通过使用用户仇恨言论指数和其他用户特征的组合来训练模型，获得了最高的F1度量。结果表明，使用用户仇恨言论索引，无论是否使用其他用户特征，都能提高仇恨言论检测的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 XXVIII International Conference on Information, Communication and Automation Technologies (ICAT)

自引率

0.00%

发文量