使用过采样和集成学习技术处理数据集中不平衡类的双重方法

Yoga Pristyanto, A. F. Nugraha, Irfan Pratama, Akhmad Dahlan, Lucky Adhikrisna Wirasakti
{"title":"使用过采样和集成学习技术处理数据集中不平衡类的双重方法","authors":"Yoga Pristyanto, A. F. Nugraha, Irfan Pratama, Akhmad Dahlan, Lucky Adhikrisna Wirasakti","doi":"10.1109/IMCOM51814.2021.9377420","DOIUrl":null,"url":null,"abstract":"In the field of machine learning, the existence of class imbalances in the dataset will make the resulting model have less than optimal performance. Theoretically, the single classifier has a weakness for class imbalance conditions in the datasets because of the majority of single classifiers tend to work by recognizing patterns in the majority class the datasets are not balanced. So, the performance cannot be maximized. In this study, two approaches were introduced to deal with class imbalance conditions in the dataset. The first approach uses ADASYN as resampling while the second approach uses the Stacking algorithm as meta-learning. After conducting a test using 5 datasets with different imbalanced ratios, it shows that the proposed method produced the highest g-mean and AUC score compared to the other classification algorithms. The proposed method in this study is the stacking algorithm between the SVM and Random Forest algorithms and the addition of ADASYN in the resampling process. Hence, the proposed method can be a solution for handling class imbalance in datasets. However, this study has limitations such as the dataset used is a dataset with a binary class category. For this reason, for the future work, testing will be suggested using the imbalanced class dataset with the multiclass datasets.","PeriodicalId":275121,"journal":{"name":"2021 15th International Conference on Ubiquitous Information Management and Communication (IMCOM)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Dual Approach to Handling Imbalanced Class in Datasets Using Oversampling and Ensemble Learning Techniques\",\"authors\":\"Yoga Pristyanto, A. F. Nugraha, Irfan Pratama, Akhmad Dahlan, Lucky Adhikrisna Wirasakti\",\"doi\":\"10.1109/IMCOM51814.2021.9377420\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the field of machine learning, the existence of class imbalances in the dataset will make the resulting model have less than optimal performance. Theoretically, the single classifier has a weakness for class imbalance conditions in the datasets because of the majority of single classifiers tend to work by recognizing patterns in the majority class the datasets are not balanced. So, the performance cannot be maximized. In this study, two approaches were introduced to deal with class imbalance conditions in the dataset. The first approach uses ADASYN as resampling while the second approach uses the Stacking algorithm as meta-learning. After conducting a test using 5 datasets with different imbalanced ratios, it shows that the proposed method produced the highest g-mean and AUC score compared to the other classification algorithms. The proposed method in this study is the stacking algorithm between the SVM and Random Forest algorithms and the addition of ADASYN in the resampling process. Hence, the proposed method can be a solution for handling class imbalance in datasets. However, this study has limitations such as the dataset used is a dataset with a binary class category. For this reason, for the future work, testing will be suggested using the imbalanced class dataset with the multiclass datasets.\",\"PeriodicalId\":275121,\"journal\":{\"name\":\"2021 15th International Conference on Ubiquitous Information Management and Communication (IMCOM)\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-01-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 15th International Conference on Ubiquitous Information Management and Communication (IMCOM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IMCOM51814.2021.9377420\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 15th International Conference on Ubiquitous Information Management and Communication (IMCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IMCOM51814.2021.9377420","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

在机器学习领域,数据集中类不平衡的存在会使得到的模型性能低于最优。从理论上讲,单个分类器对于数据集中的类不平衡条件有一个弱点,因为大多数单个分类器倾向于通过识别大多数数据集中不平衡的类中的模式来工作。因此,性能无法最大化。在本研究中,引入了两种方法来处理数据集中的类不平衡情况。第一种方法使用ADASYN作为重新采样,而第二种方法使用堆叠算法作为元学习。在使用5个不同失衡比例的数据集进行测试后发现,与其他分类算法相比,本文方法产生的g-mean和AUC得分最高。本文提出的方法是在SVM和Random Forest算法之间叠加算法,并在重采样过程中加入ADASYN。因此,所提出的方法可以作为处理数据集中类不平衡的一种解决方案。然而,本研究存在局限性,如使用的数据集是具有二进制类类别的数据集。因此,在未来的工作中,建议使用不平衡类数据集和多类数据集进行测试。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Dual Approach to Handling Imbalanced Class in Datasets Using Oversampling and Ensemble Learning Techniques
In the field of machine learning, the existence of class imbalances in the dataset will make the resulting model have less than optimal performance. Theoretically, the single classifier has a weakness for class imbalance conditions in the datasets because of the majority of single classifiers tend to work by recognizing patterns in the majority class the datasets are not balanced. So, the performance cannot be maximized. In this study, two approaches were introduced to deal with class imbalance conditions in the dataset. The first approach uses ADASYN as resampling while the second approach uses the Stacking algorithm as meta-learning. After conducting a test using 5 datasets with different imbalanced ratios, it shows that the proposed method produced the highest g-mean and AUC score compared to the other classification algorithms. The proposed method in this study is the stacking algorithm between the SVM and Random Forest algorithms and the addition of ADASYN in the resampling process. Hence, the proposed method can be a solution for handling class imbalance in datasets. However, this study has limitations such as the dataset used is a dataset with a binary class category. For this reason, for the future work, testing will be suggested using the imbalanced class dataset with the multiclass datasets.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
On a Partially Verifiable Multi-party Multi-argument Zero-knowledge Proof EnvBERT: Multi-Label Text Classification for Imbalanced, Noisy Environmental News Data Method for Changing Users' Attitudes Towards Fashion Styling by Showing Evaluations After Coordinate Selection The Analysis of Web Search Snippets Displaying User's Knowledge An Energy Management System with Edge Computing for Industrial Facility
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1