Usage of user hate speech index for improving hate speech detection in Twitter posts

Ehlimana Krupalija, D. Donko, H. Supic
{"title":"Usage of user hate speech index for improving hate speech detection in Twitter posts","authors":"Ehlimana Krupalija, D. Donko, H. Supic","doi":"10.1109/ICAT54566.2022.9811159","DOIUrl":null,"url":null,"abstract":"Social media is an important source of real-world data for sentiment analysis. Hate speech detection models can be trained on data from Twitter and then utilized for content filtering and removal of posts which contain hate speech. This work proposes a new algorithm for calculating user hate speech index based on user post history. Three available datasets were merged for the purpose of acquiring Twitter posts which contained hate speech. Text preprocessing and tokenization was performed, as well as outlier removal and class balancing. The proposed algorithm was used for determining hate speech index of users who posted tweets from the dataset. The preprocessed dataset was used for training and testing multiple machine learning models: k-means clustering without and with principal component analysis, naïve Bayes, decision tree and random forest. Four different feature subsets of the dataset were used for model training and testing. Anomaly detection, data transformation and parameter tuning were used in an attempt to improve classification accuracy. The highest F1 measure was achieved by training the model using a combination of user hate speech index and other user features. The results show that the usage of user hate speech index, with or without other user features, improves the accuracy of hate speech detection.","PeriodicalId":414786,"journal":{"name":"2022 XXVIII International Conference on Information, Communication and Automation Technologies (ICAT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 XXVIII International Conference on Information, Communication and Automation Technologies (ICAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAT54566.2022.9811159","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Social media is an important source of real-world data for sentiment analysis. Hate speech detection models can be trained on data from Twitter and then utilized for content filtering and removal of posts which contain hate speech. This work proposes a new algorithm for calculating user hate speech index based on user post history. Three available datasets were merged for the purpose of acquiring Twitter posts which contained hate speech. Text preprocessing and tokenization was performed, as well as outlier removal and class balancing. The proposed algorithm was used for determining hate speech index of users who posted tweets from the dataset. The preprocessed dataset was used for training and testing multiple machine learning models: k-means clustering without and with principal component analysis, naïve Bayes, decision tree and random forest. Four different feature subsets of the dataset were used for model training and testing. Anomaly detection, data transformation and parameter tuning were used in an attempt to improve classification accuracy. The highest F1 measure was achieved by training the model using a combination of user hate speech index and other user features. The results show that the usage of user hate speech index, with or without other user features, improves the accuracy of hate speech detection.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用用户仇恨言论索引改进Twitter帖子中的仇恨言论检测
社交媒体是情感分析的重要现实数据来源。仇恨言论检测模型可以根据Twitter的数据进行训练,然后用于内容过滤和删除包含仇恨言论的帖子。本文提出了一种基于用户帖子历史计算用户仇恨言论指数的新算法。为了获取包含仇恨言论的推特帖子,合并了三个可用的数据集。进行了文本预处理和标记化,以及异常值去除和类平衡。该算法用于确定从数据集中发布推文的用户的仇恨言论指数。预处理后的数据集用于训练和测试多个机器学习模型:无主成分分析和有主成分分析的k-means聚类、naïve贝叶斯、决策树和随机森林。使用数据集的四个不同特征子集进行模型训练和测试。采用异常检测、数据转换和参数调优等方法提高分类精度。通过使用用户仇恨言论指数和其他用户特征的组合来训练模型,获得了最高的F1度量。结果表明,使用用户仇恨言论索引,无论是否使用其他用户特征,都能提高仇恨言论检测的准确性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Methodology to Develop Extended Reality Applications for Exhibition Spaces in Museums Prediction of cardiovascular disease Mitigating Power Peaks in Automotive Power Networks by Exploitation of Flexible Loads Volt-Var Control for Smart Cities with Integrated Public Transportation System Graph Theory as an Engine for Real-Time Advanced Distribution Management System Enhancements
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1