An Earth mover's distance-based undersampling approach for handling class-imbalanced data

G. Rekha, V. Reddy, A. Tyagi
{"title":"An Earth mover's distance-based undersampling approach for handling class-imbalanced data","authors":"G. Rekha, V. Reddy, A. Tyagi","doi":"10.1504/ijiids.2020.10031612","DOIUrl":null,"url":null,"abstract":"Imbalanced datasets typically make prediction accuracy difficult. Most of the real-world data are imbalanced in nature. The traditional classifiers assume a well-balanced class distribution for training data but in practical datasets show up an imbalance, thus obscure a classifier and degrade its capability to learn from such imbalanced datasets. Data pre-processing approaches address this concern by using either random undersampling or oversampling techniques. In this paper, we introduce Earth mover's distance (EMD), as a similarity measure, to find the samples similar in nature and eliminate them as redundant from the dataset. Earth mover's distance has received a lot of attention in wide areas such as computer vision, image retrieval, machine learning, etc. The Earth mover's distance-based undersampling approach provides a solution at the data level to eliminate the redundant instances in majority samples without any loss of valuable information. This method is implemented with five conventional classifiers and one ensemble technique respectively, like C4.5 decision tree (DT), k-nearest neighbour (k-NN), multilayer perceptron (MLP), support vector machine (SVM), naive Bayes (NB) and AdaBoost technique. The proposed method yields a superior performance on 21 datasets from Keel repository.","PeriodicalId":39658,"journal":{"name":"International Journal of Intelligent Information and Database Systems","volume":"15 1","pages":"376-392"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Intelligent Information and Database Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1504/ijiids.2020.10031612","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 4

Abstract

Imbalanced datasets typically make prediction accuracy difficult. Most of the real-world data are imbalanced in nature. The traditional classifiers assume a well-balanced class distribution for training data but in practical datasets show up an imbalance, thus obscure a classifier and degrade its capability to learn from such imbalanced datasets. Data pre-processing approaches address this concern by using either random undersampling or oversampling techniques. In this paper, we introduce Earth mover's distance (EMD), as a similarity measure, to find the samples similar in nature and eliminate them as redundant from the dataset. Earth mover's distance has received a lot of attention in wide areas such as computer vision, image retrieval, machine learning, etc. The Earth mover's distance-based undersampling approach provides a solution at the data level to eliminate the redundant instances in majority samples without any loss of valuable information. This method is implemented with five conventional classifiers and one ensemble technique respectively, like C4.5 decision tree (DT), k-nearest neighbour (k-NN), multilayer perceptron (MLP), support vector machine (SVM), naive Bayes (NB) and AdaBoost technique. The proposed method yields a superior performance on 21 datasets from Keel repository.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
一种用于处理类不平衡数据的基于距离的欠采样方法
不平衡的数据集通常使预测的准确性变得困难。大多数真实世界的数据本质上是不平衡的。传统的分类器对训练数据的分类分布假设是很平衡的,但在实际数据集中表现出不平衡,从而模糊了分类器,降低了分类器从这种不平衡数据集中学习的能力。数据预处理方法通过使用随机欠采样或过采样技术来解决这个问题。在本文中,我们引入了土动器的距离(EMD)作为相似性度量,以发现本质上相似的样本,并从数据集中剔除冗余样本。在计算机视觉、图像检索、机器学习等广泛的领域中,推土机的距离问题受到了广泛的关注。earthmover基于距离的欠采样方法在数据层面提供了一种解决方案,可以在不丢失任何有价值信息的情况下消除大多数样本中的冗余实例。该方法分别采用C4.5决策树(DT)、k近邻(k-NN)、多层感知器(MLP)、支持向量机(SVM)、朴素贝叶斯(NB)和AdaBoost技术等5种传统分类器和1种集成技术实现。该方法在龙骨知识库的21个数据集上取得了优异的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
2.90
自引率
0.00%
发文量
21
期刊介绍: Intelligent information systems and intelligent database systems are a very dynamically developing field in computer sciences. IJIIDS provides a medium for exchanging scientific research and technological achievements accomplished by the international community. It focuses on research in applications of advanced intelligent technologies for data storing and processing in a wide-ranging context. The issues addressed by IJIIDS involve solutions of real-life problems, in which it is necessary to apply intelligent technologies for achieving effective results. The emphasis of the reported work is on new and original research and technological developments rather than reports on the application of existing technology to different sets of data.
期刊最新文献
Development of Wearable Embedded Hybrid Powered Energy Sources for Mobile Phone Charging System Applying the Self-Organizing Map in the Classification of 195 Countries Using 32 Attributes Artificial Intelligence Chatbot Advisory System Intelligent Information and Database Systems: 15th Asian Conference, ACIIDS 2023, Phuket, Thailand, July 24–26, 2023, Proceedings, Part I Modelling of COVID-19 spread time and mortality rate using machine learning techniques
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1