基于机器学习分类器的仇恨文本检测系统分析

2021 13th International Conference on Information & Communication Technology and System (ICTS) Pub Date : 2021-10-20 DOI:10.1109/ICTS52701.2021.9608010

Tanzina Akter Tani, Tabassum Islam, Sayed Atique Newaz, N. Sultana

{"title":"基于机器学习分类器的仇恨文本检测系统分析","authors":"Tanzina Akter Tani, Tabassum Islam, Sayed Atique Newaz, N. Sultana","doi":"10.1109/ICTS52701.2021.9608010","DOIUrl":null,"url":null,"abstract":"In today's internet-based world, social media is one of the most popular platforms through which users can outburst their different types of feelings, emotions, frustration, anger, happiness etc. without having concern about distinguishes between moral and social values. These kinds of abusive or offensive texts cause social disturbances, crimes, and many unethical deeds. So, there is a huge necessity to distinguish these kinds of abusive texts/posts and remove them from social media. Different researchers have distinguished different text detection processes in their related work. In our proposed work, three classifiers have been used: Naïve Bayes (NB), Random Forest (RF), and Support Vector Machine (SVM) for detecting hateful text. Bag of Words (BoW) and TF-IDF feature extraction methods have been used to compare these three classifiers for both unigram and bigrams words. To balance hateful and clean content, the Twitter dataset has been under-sampled. Text preprocessing is essential for NLP to produce better and more accurate results which have been carried out in this work. In our result, Naive Bayes has provided the highest accuracy (89%) using the TF-IDF feature extraction model, whereas Random Forest has provided the most accuracy (88%) using Bag of words (BoW) in the case of unigram word. Overall, we got much better performance using unigram than using bigrams word. Finally, we made a number of principle contributions.","PeriodicalId":6738,"journal":{"name":"2021 13th International Conference on Information & Communication Technology and System (ICTS)","volume":"339 1","pages":"330-335"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Systematic Analysis of Hateful Text Detection Using Machine Learning Classifiers\",\"authors\":\"Tanzina Akter Tani, Tabassum Islam, Sayed Atique Newaz, N. Sultana\",\"doi\":\"10.1109/ICTS52701.2021.9608010\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In today's internet-based world, social media is one of the most popular platforms through which users can outburst their different types of feelings, emotions, frustration, anger, happiness etc. without having concern about distinguishes between moral and social values. These kinds of abusive or offensive texts cause social disturbances, crimes, and many unethical deeds. So, there is a huge necessity to distinguish these kinds of abusive texts/posts and remove them from social media. Different researchers have distinguished different text detection processes in their related work. In our proposed work, three classifiers have been used: Naïve Bayes (NB), Random Forest (RF), and Support Vector Machine (SVM) for detecting hateful text. Bag of Words (BoW) and TF-IDF feature extraction methods have been used to compare these three classifiers for both unigram and bigrams words. To balance hateful and clean content, the Twitter dataset has been under-sampled. Text preprocessing is essential for NLP to produce better and more accurate results which have been carried out in this work. In our result, Naive Bayes has provided the highest accuracy (89%) using the TF-IDF feature extraction model, whereas Random Forest has provided the most accuracy (88%) using Bag of words (BoW) in the case of unigram word. Overall, we got much better performance using unigram than using bigrams word. Finally, we made a number of principle contributions.\",\"PeriodicalId\":6738,\"journal\":{\"name\":\"2021 13th International Conference on Information & Communication Technology and System (ICTS)\",\"volume\":\"339 1\",\"pages\":\"330-335\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 13th International Conference on Information & Communication Technology and System (ICTS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICTS52701.2021.9608010\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 13th International Conference on Information & Communication Technology and System (ICTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTS52701.2021.9608010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

在当今以互联网为基础的世界，社交媒体是最受欢迎的平台之一，用户可以通过社交媒体来发泄他们不同类型的感受、情绪、沮丧、愤怒、快乐等，而不必担心道德和社会价值观的区别。这类辱骂性或攻击性的短信会引起社会骚乱、犯罪和许多不道德的行为。因此，有必要区分这些类型的辱骂文本/帖子并将其从社交媒体中删除。不同的研究者在他们的相关工作中区分了不同的文本检测过程。在我们提出的工作中，使用了三种分类器:Naïve贝叶斯(NB)，随机森林(RF)和支持向量机(SVM)来检测仇恨文本。单词袋(BoW)和TF-IDF特征提取方法被用来比较这三种分类器对单字和双字的分类。为了平衡仇恨和干净的内容，Twitter数据集的采样不足。文本预处理是自然语言处理产生更好、更准确结果的必要条件。在我们的结果中，使用TF-IDF特征提取模型的朴素贝叶斯提供了最高的准确率(89%)，而在单字母单词的情况下，使用词袋(BoW)的随机森林提供了最高的准确率(88%)。总的来说，我们使用单字符比使用双字符获得了更好的性能。最后，我们做出了一些原则性的贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Systematic Analysis of Hateful Text Detection Using Machine Learning Classifiers

In today's internet-based world, social media is one of the most popular platforms through which users can outburst their different types of feelings, emotions, frustration, anger, happiness etc. without having concern about distinguishes between moral and social values. These kinds of abusive or offensive texts cause social disturbances, crimes, and many unethical deeds. So, there is a huge necessity to distinguish these kinds of abusive texts/posts and remove them from social media. Different researchers have distinguished different text detection processes in their related work. In our proposed work, three classifiers have been used: Naïve Bayes (NB), Random Forest (RF), and Support Vector Machine (SVM) for detecting hateful text. Bag of Words (BoW) and TF-IDF feature extraction methods have been used to compare these three classifiers for both unigram and bigrams words. To balance hateful and clean content, the Twitter dataset has been under-sampled. Text preprocessing is essential for NLP to produce better and more accurate results which have been carried out in this work. In our result, Naive Bayes has provided the highest accuracy (89%) using the TF-IDF feature extraction model, whereas Random Forest has provided the most accuracy (88%) using Bag of words (BoW) in the case of unigram word. Overall, we got much better performance using unigram than using bigrams word. Finally, we made a number of principle contributions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 13th International Conference on Information & Communication Technology and System (ICTS)

自引率

0.00%

发文量

期刊最新文献

[Copyright notice] Outlier Detection and Decision Tree for Wireless Sensor Network Fault Diagnosis Graph Algorithm for Anomaly Prediction in East Java Student Admission System FarmEasy: An Intelligent Platform to Empower Crops Prediction and Crops Marketing Hiding Messages in Audio using Modulus Operation and Simple Partition