欠采样方法的研究:基于不同估计量的实例硬度阈值仇恨言论分类

Naufal Azmi Verdikha, T. B. Adji, A. E. Permanasari
{"title":"欠采样方法的研究:基于不同估计量的实例硬度阈值仇恨言论分类","authors":"Naufal Azmi Verdikha, T. B. Adji, A. E. Permanasari","doi":"10.22146/IJITEE.42152","DOIUrl":null,"url":null,"abstract":"A text classification system is needed to address the problem of hate speech in social media. However, texts of hate speech are very hard to find in social media. This will make the distribution of training data to be unbalanced (imbalanced data). Classification with imbalanced data will make a poor performance. There are several methods to solve the problem of classification with imbalanced data. One of them is undersampling with Instance Hardness Threshold (IHT) method. IHT method balances the dataset by eliminating data that are frequently misclassified. To find those data, IHT requires an estimator, which is a classifier. This research aims to compare estimators of IHT method to solve imbalanced data problem in hate speech classification using TF-IDF weighting method. This research uses the class ratio of dataset after undersampling, time of the undersampling process, and Index of Balanced Accuracy (IBA) evaluation to determine the best IHT method. The results of this research show that IHT method using the Logistic Regression (IHT(LR)) has the fastest undersampling process (1.91 s), perfectly balance dataset with the class ratio is 1:1, and has the best of IBA evaluation in all estimation process. This result makes IHT(LR) be the best method to solve the imbalanced data problem in hate speech classification.","PeriodicalId":292390,"journal":{"name":"IJITEE (International Journal of Information Technology and Electrical Engineering)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Study of Undersampling Method: Instance Hardness Threshold with Various Estimators for Hate Speech Classification\",\"authors\":\"Naufal Azmi Verdikha, T. B. Adji, A. E. Permanasari\",\"doi\":\"10.22146/IJITEE.42152\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A text classification system is needed to address the problem of hate speech in social media. However, texts of hate speech are very hard to find in social media. This will make the distribution of training data to be unbalanced (imbalanced data). Classification with imbalanced data will make a poor performance. There are several methods to solve the problem of classification with imbalanced data. One of them is undersampling with Instance Hardness Threshold (IHT) method. IHT method balances the dataset by eliminating data that are frequently misclassified. To find those data, IHT requires an estimator, which is a classifier. This research aims to compare estimators of IHT method to solve imbalanced data problem in hate speech classification using TF-IDF weighting method. This research uses the class ratio of dataset after undersampling, time of the undersampling process, and Index of Balanced Accuracy (IBA) evaluation to determine the best IHT method. The results of this research show that IHT method using the Logistic Regression (IHT(LR)) has the fastest undersampling process (1.91 s), perfectly balance dataset with the class ratio is 1:1, and has the best of IBA evaluation in all estimation process. This result makes IHT(LR) be the best method to solve the imbalanced data problem in hate speech classification.\",\"PeriodicalId\":292390,\"journal\":{\"name\":\"IJITEE (International Journal of Information Technology and Electrical Engineering)\",\"volume\":\"66 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IJITEE (International Journal of Information Technology and Electrical Engineering)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.22146/IJITEE.42152\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IJITEE (International Journal of Information Technology and Electrical Engineering)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22146/IJITEE.42152","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

摘要

需要一个文本分类系统来解决社交媒体中的仇恨言论问题。然而,在社交媒体上很难找到仇恨言论的文本。这将使训练数据的分布不平衡(失衡数据)。使用不平衡的数据进行分类会使分类性能变差。有几种方法可以解决不平衡数据的分类问题。其中一种是实例硬度阈值法的欠采样。该方法通过消除经常被错误分类的数据来平衡数据集。为了找到这些数据,IHT需要一个估计器,它是一个分类器。本研究旨在比较IHT方法的估计量,利用TF-IDF加权方法解决仇恨言论分类中的数据不平衡问题。本研究利用欠采样后数据集的类比、欠采样过程的时间和平衡精度指数(IBA)评价来确定最佳的IHT方法。研究结果表明,采用Logistic回归(IHT(LR))的IHT方法欠采样过程最快(1.91 s),类比为1:1的数据集完美平衡,在所有估计过程中具有最佳的IBA评价。这一结果表明IHT(LR)是解决仇恨言论分类中数据不平衡问题的最佳方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Study of Undersampling Method: Instance Hardness Threshold with Various Estimators for Hate Speech Classification
A text classification system is needed to address the problem of hate speech in social media. However, texts of hate speech are very hard to find in social media. This will make the distribution of training data to be unbalanced (imbalanced data). Classification with imbalanced data will make a poor performance. There are several methods to solve the problem of classification with imbalanced data. One of them is undersampling with Instance Hardness Threshold (IHT) method. IHT method balances the dataset by eliminating data that are frequently misclassified. To find those data, IHT requires an estimator, which is a classifier. This research aims to compare estimators of IHT method to solve imbalanced data problem in hate speech classification using TF-IDF weighting method. This research uses the class ratio of dataset after undersampling, time of the undersampling process, and Index of Balanced Accuracy (IBA) evaluation to determine the best IHT method. The results of this research show that IHT method using the Logistic Regression (IHT(LR)) has the fastest undersampling process (1.91 s), perfectly balance dataset with the class ratio is 1:1, and has the best of IBA evaluation in all estimation process. This result makes IHT(LR) be the best method to solve the imbalanced data problem in hate speech classification.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Eye Blink Classification for Assisting Disability to Communicate Using Bagging and Boosting Product Recommendation Based on Eye Tracking Data Using Fixation Duration Optimal Capacity and Location Wind Turbine to Minimize Power Losses Using NSGA-II Factors Affecting Collaboration Portal Effectiveness of the Audit Board of Indonesia Piezoelectric Energy Harvester for IoT Sensor Devices
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1