An Earth mover's distance-based undersampling approach for handling class-imbalanced data

Q3 Computer Science International Journal of Intelligent Information and Database Systems Pub Date : 2020-08-26 DOI:10.1504/ijiids.2020.10031612

G. Rekha, V. Reddy, A. Tyagi

{"title":"An Earth mover's distance-based undersampling approach for handling class-imbalanced data","authors":"G. Rekha, V. Reddy, A. Tyagi","doi":"10.1504/ijiids.2020.10031612","DOIUrl":null,"url":null,"abstract":"Imbalanced datasets typically make prediction accuracy difficult. Most of the real-world data are imbalanced in nature. The traditional classifiers assume a well-balanced class distribution for training data but in practical datasets show up an imbalance, thus obscure a classifier and degrade its capability to learn from such imbalanced datasets. Data pre-processing approaches address this concern by using either random undersampling or oversampling techniques. In this paper, we introduce Earth mover's distance (EMD), as a similarity measure, to find the samples similar in nature and eliminate them as redundant from the dataset. Earth mover's distance has received a lot of attention in wide areas such as computer vision, image retrieval, machine learning, etc. The Earth mover's distance-based undersampling approach provides a solution at the data level to eliminate the redundant instances in majority samples without any loss of valuable information. This method is implemented with five conventional classifiers and one ensemble technique respectively, like C4.5 decision tree (DT), k-nearest neighbour (k-NN), multilayer perceptron (MLP), support vector machine (SVM), naive Bayes (NB) and AdaBoost technique. The proposed method yields a superior performance on 21 datasets from Keel repository.","PeriodicalId":39658,"journal":{"name":"International Journal of Intelligent Information and Database Systems","volume":"15 1","pages":"376-392"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Intelligent Information and Database Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1504/ijiids.2020.10031612","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 4

Abstract

Imbalanced datasets typically make prediction accuracy difficult. Most of the real-world data are imbalanced in nature. The traditional classifiers assume a well-balanced class distribution for training data but in practical datasets show up an imbalance, thus obscure a classifier and degrade its capability to learn from such imbalanced datasets. Data pre-processing approaches address this concern by using either random undersampling or oversampling techniques. In this paper, we introduce Earth mover's distance (EMD), as a similarity measure, to find the samples similar in nature and eliminate them as redundant from the dataset. Earth mover's distance has received a lot of attention in wide areas such as computer vision, image retrieval, machine learning, etc. The Earth mover's distance-based undersampling approach provides a solution at the data level to eliminate the redundant instances in majority samples without any loss of valuable information. This method is implemented with five conventional classifiers and one ensemble technique respectively, like C4.5 decision tree (DT), k-nearest neighbour (k-NN), multilayer perceptron (MLP), support vector machine (SVM), naive Bayes (NB) and AdaBoost technique. The proposed method yields a superior performance on 21 datasets from Keel repository.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种用于处理类不平衡数据的基于距离的欠采样方法

不平衡的数据集通常使预测的准确性变得困难。大多数真实世界的数据本质上是不平衡的。传统的分类器对训练数据的分类分布假设是很平衡的，但在实际数据集中表现出不平衡，从而模糊了分类器，降低了分类器从这种不平衡数据集中学习的能力。数据预处理方法通过使用随机欠采样或过采样技术来解决这个问题。在本文中，我们引入了土动器的距离(EMD)作为相似性度量，以发现本质上相似的样本，并从数据集中剔除冗余样本。在计算机视觉、图像检索、机器学习等广泛的领域中，推土机的距离问题受到了广泛的关注。earthmover基于距离的欠采样方法在数据层面提供了一种解决方案，可以在不丢失任何有价值信息的情况下消除大多数样本中的冗余实例。该方法分别采用C4.5决策树(DT)、k近邻(k-NN)、多层感知器(MLP)、支持向量机(SVM)、朴素贝叶斯(NB)和AdaBoost技术等5种传统分类器和1种集成技术实现。该方法在龙骨知识库的21个数据集上取得了优异的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Journal of Intelligent Information and Database Systems Computer Science-Information Systems

CiteScore

2.90

自引率

0.00%

发文量

期刊介绍： Intelligent information systems and intelligent database systems are a very dynamically developing field in computer sciences. IJIIDS provides a medium for exchanging scientific research and technological achievements accomplished by the international community. It focuses on research in applications of advanced intelligent technologies for data storing and processing in a wide-ranging context. The issues addressed by IJIIDS involve solutions of real-life problems, in which it is necessary to apply intelligent technologies for achieving effective results. The emphasis of the reported work is on new and original research and technological developments rather than reports on the application of existing technology to different sets of data.