基于机器学习分类器的仇恨文本检测系统分析

Tanzina Akter Tani, Tabassum Islam, Sayed Atique Newaz, N. Sultana
{"title":"基于机器学习分类器的仇恨文本检测系统分析","authors":"Tanzina Akter Tani, Tabassum Islam, Sayed Atique Newaz, N. Sultana","doi":"10.1109/ICTS52701.2021.9608010","DOIUrl":null,"url":null,"abstract":"In today's internet-based world, social media is one of the most popular platforms through which users can outburst their different types of feelings, emotions, frustration, anger, happiness etc. without having concern about distinguishes between moral and social values. These kinds of abusive or offensive texts cause social disturbances, crimes, and many unethical deeds. So, there is a huge necessity to distinguish these kinds of abusive texts/posts and remove them from social media. Different researchers have distinguished different text detection processes in their related work. In our proposed work, three classifiers have been used: Naïve Bayes (NB), Random Forest (RF), and Support Vector Machine (SVM) for detecting hateful text. Bag of Words (BoW) and TF-IDF feature extraction methods have been used to compare these three classifiers for both unigram and bigrams words. To balance hateful and clean content, the Twitter dataset has been under-sampled. Text preprocessing is essential for NLP to produce better and more accurate results which have been carried out in this work. In our result, Naive Bayes has provided the highest accuracy (89%) using the TF-IDF feature extraction model, whereas Random Forest has provided the most accuracy (88%) using Bag of words (BoW) in the case of unigram word. Overall, we got much better performance using unigram than using bigrams word. Finally, we made a number of principle contributions.","PeriodicalId":6738,"journal":{"name":"2021 13th International Conference on Information & Communication Technology and System (ICTS)","volume":"339 1","pages":"330-335"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Systematic Analysis of Hateful Text Detection Using Machine Learning Classifiers\",\"authors\":\"Tanzina Akter Tani, Tabassum Islam, Sayed Atique Newaz, N. Sultana\",\"doi\":\"10.1109/ICTS52701.2021.9608010\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In today's internet-based world, social media is one of the most popular platforms through which users can outburst their different types of feelings, emotions, frustration, anger, happiness etc. without having concern about distinguishes between moral and social values. These kinds of abusive or offensive texts cause social disturbances, crimes, and many unethical deeds. So, there is a huge necessity to distinguish these kinds of abusive texts/posts and remove them from social media. Different researchers have distinguished different text detection processes in their related work. In our proposed work, three classifiers have been used: Naïve Bayes (NB), Random Forest (RF), and Support Vector Machine (SVM) for detecting hateful text. Bag of Words (BoW) and TF-IDF feature extraction methods have been used to compare these three classifiers for both unigram and bigrams words. To balance hateful and clean content, the Twitter dataset has been under-sampled. Text preprocessing is essential for NLP to produce better and more accurate results which have been carried out in this work. In our result, Naive Bayes has provided the highest accuracy (89%) using the TF-IDF feature extraction model, whereas Random Forest has provided the most accuracy (88%) using Bag of words (BoW) in the case of unigram word. Overall, we got much better performance using unigram than using bigrams word. Finally, we made a number of principle contributions.\",\"PeriodicalId\":6738,\"journal\":{\"name\":\"2021 13th International Conference on Information & Communication Technology and System (ICTS)\",\"volume\":\"339 1\",\"pages\":\"330-335\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 13th International Conference on Information & Communication Technology and System (ICTS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICTS52701.2021.9608010\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 13th International Conference on Information & Communication Technology and System (ICTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTS52701.2021.9608010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

在当今以互联网为基础的世界,社交媒体是最受欢迎的平台之一,用户可以通过社交媒体来发泄他们不同类型的感受、情绪、沮丧、愤怒、快乐等,而不必担心道德和社会价值观的区别。这类辱骂性或攻击性的短信会引起社会骚乱、犯罪和许多不道德的行为。因此,有必要区分这些类型的辱骂文本/帖子并将其从社交媒体中删除。不同的研究者在他们的相关工作中区分了不同的文本检测过程。在我们提出的工作中,使用了三种分类器:Naïve贝叶斯(NB),随机森林(RF)和支持向量机(SVM)来检测仇恨文本。单词袋(BoW)和TF-IDF特征提取方法被用来比较这三种分类器对单字和双字的分类。为了平衡仇恨和干净的内容,Twitter数据集的采样不足。文本预处理是自然语言处理产生更好、更准确结果的必要条件。在我们的结果中,使用TF-IDF特征提取模型的朴素贝叶斯提供了最高的准确率(89%),而在单字母单词的情况下,使用词袋(BoW)的随机森林提供了最高的准确率(88%)。总的来说,我们使用单字符比使用双字符获得了更好的性能。最后,我们做出了一些原则性的贡献。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Systematic Analysis of Hateful Text Detection Using Machine Learning Classifiers
In today's internet-based world, social media is one of the most popular platforms through which users can outburst their different types of feelings, emotions, frustration, anger, happiness etc. without having concern about distinguishes between moral and social values. These kinds of abusive or offensive texts cause social disturbances, crimes, and many unethical deeds. So, there is a huge necessity to distinguish these kinds of abusive texts/posts and remove them from social media. Different researchers have distinguished different text detection processes in their related work. In our proposed work, three classifiers have been used: Naïve Bayes (NB), Random Forest (RF), and Support Vector Machine (SVM) for detecting hateful text. Bag of Words (BoW) and TF-IDF feature extraction methods have been used to compare these three classifiers for both unigram and bigrams words. To balance hateful and clean content, the Twitter dataset has been under-sampled. Text preprocessing is essential for NLP to produce better and more accurate results which have been carried out in this work. In our result, Naive Bayes has provided the highest accuracy (89%) using the TF-IDF feature extraction model, whereas Random Forest has provided the most accuracy (88%) using Bag of words (BoW) in the case of unigram word. Overall, we got much better performance using unigram than using bigrams word. Finally, we made a number of principle contributions.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
[Copyright notice] Outlier Detection and Decision Tree for Wireless Sensor Network Fault Diagnosis Graph Algorithm for Anomaly Prediction in East Java Student Admission System FarmEasy: An Intelligent Platform to Empower Crops Prediction and Crops Marketing Hiding Messages in Audio using Modulus Operation and Simple Partition
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1