Study of Undersampling Method: Instance Hardness Threshold with Various Estimators for Hate Speech Classification

IJITEE (International Journal of Information Technology and Electrical Engineering) Pub Date : 2018-12-26 DOI:10.22146/IJITEE.42152

Naufal Azmi Verdikha, T. B. Adji, A. E. Permanasari

{"title":"Study of Undersampling Method: Instance Hardness Threshold with Various Estimators for Hate Speech Classification","authors":"Naufal Azmi Verdikha, T. B. Adji, A. E. Permanasari","doi":"10.22146/IJITEE.42152","DOIUrl":null,"url":null,"abstract":"A text classification system is needed to address the problem of hate speech in social media. However, texts of hate speech are very hard to find in social media. This will make the distribution of training data to be unbalanced (imbalanced data). Classification with imbalanced data will make a poor performance. There are several methods to solve the problem of classification with imbalanced data. One of them is undersampling with Instance Hardness Threshold (IHT) method. IHT method balances the dataset by eliminating data that are frequently misclassified. To find those data, IHT requires an estimator, which is a classifier. This research aims to compare estimators of IHT method to solve imbalanced data problem in hate speech classification using TF-IDF weighting method. This research uses the class ratio of dataset after undersampling, time of the undersampling process, and Index of Balanced Accuracy (IBA) evaluation to determine the best IHT method. The results of this research show that IHT method using the Logistic Regression (IHT(LR)) has the fastest undersampling process (1.91 s), perfectly balance dataset with the class ratio is 1:1, and has the best of IBA evaluation in all estimation process. This result makes IHT(LR) be the best method to solve the imbalanced data problem in hate speech classification.","PeriodicalId":292390,"journal":{"name":"IJITEE (International Journal of Information Technology and Electrical Engineering)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IJITEE (International Journal of Information Technology and Electrical Engineering)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22146/IJITEE.42152","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

A text classification system is needed to address the problem of hate speech in social media. However, texts of hate speech are very hard to find in social media. This will make the distribution of training data to be unbalanced (imbalanced data). Classification with imbalanced data will make a poor performance. There are several methods to solve the problem of classification with imbalanced data. One of them is undersampling with Instance Hardness Threshold (IHT) method. IHT method balances the dataset by eliminating data that are frequently misclassified. To find those data, IHT requires an estimator, which is a classifier. This research aims to compare estimators of IHT method to solve imbalanced data problem in hate speech classification using TF-IDF weighting method. This research uses the class ratio of dataset after undersampling, time of the undersampling process, and Index of Balanced Accuracy (IBA) evaluation to determine the best IHT method. The results of this research show that IHT method using the Logistic Regression (IHT(LR)) has the fastest undersampling process (1.91 s), perfectly balance dataset with the class ratio is 1:1, and has the best of IBA evaluation in all estimation process. This result makes IHT(LR) be the best method to solve the imbalanced data problem in hate speech classification.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

欠采样方法的研究:基于不同估计量的实例硬度阈值仇恨言论分类

需要一个文本分类系统来解决社交媒体中的仇恨言论问题。然而，在社交媒体上很难找到仇恨言论的文本。这将使训练数据的分布不平衡(失衡数据)。使用不平衡的数据进行分类会使分类性能变差。有几种方法可以解决不平衡数据的分类问题。其中一种是实例硬度阈值法的欠采样。该方法通过消除经常被错误分类的数据来平衡数据集。为了找到这些数据，IHT需要一个估计器，它是一个分类器。本研究旨在比较IHT方法的估计量，利用TF-IDF加权方法解决仇恨言论分类中的数据不平衡问题。本研究利用欠采样后数据集的类比、欠采样过程的时间和平衡精度指数(IBA)评价来确定最佳的IHT方法。研究结果表明，采用Logistic回归(IHT(LR))的IHT方法欠采样过程最快(1.91 s)，类比为1:1的数据集完美平衡，在所有估计过程中具有最佳的IBA评价。这一结果表明IHT(LR)是解决仇恨言论分类中数据不平衡问题的最佳方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IJITEE (International Journal of Information Technology and Electrical Engineering)

自引率

0.00%

发文量