Undersampling based on generalized learning vector quantization and natural nearest neighbors for imbalanced data

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE International Journal of Machine Learning and Cybernetics Pub Date : 2024-07-03 DOI:10.1007/s13042-024-02261-w

Long-Hui Wang, Qi Dai, Jia-You Wang, Tony Du, Lifang Chen

{"title":"Undersampling based on generalized learning vector quantization and natural nearest neighbors for imbalanced data","authors":"Long-Hui Wang, Qi Dai, Jia-You Wang, Tony Du, Lifang Chen","doi":"10.1007/s13042-024-02261-w","DOIUrl":null,"url":null,"abstract":"<p>Imbalanced datasets can adversely affect classifier performance. Conventional undersampling approaches may lead to the loss of essential information, while oversampling techniques could introduce noise. To address this challenge, we propose an undersampling algorithm called GLNDU (Generalized Learning Vector Quantization and Natural Nearest Neighbors-based Undersampling). GLNDU utilizes Generalized Learning Vector Quantization (GLVQ) for computing the centroids of positive and negative instances. It also utilizes the concept of Natural Nearest Neighbors to identify majority-class instances in the overlapping region of the centroids of minority-class instances. Afterwards, these majority-class instances are removed, resulting in a new balanced training dataset that is used to train a foundational classifier. We conduct extensive experiments on 29 publicly available datasets, evaluating the performance using AUC and G_mean values. GLNDU demonstrates significant advantages over established methods such as SVM, CART, and KNN across different types of classifiers. Additionally, the results of the Friedman ranking and Nemenyi post-hoc test provide additional support for the findings obtained from the experiments.</p>","PeriodicalId":51327,"journal":{"name":"International Journal of Machine Learning and Cybernetics","volume":"157 1","pages":""},"PeriodicalIF":2.7000,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Machine Learning and Cybernetics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s13042-024-02261-w","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Imbalanced datasets can adversely affect classifier performance. Conventional undersampling approaches may lead to the loss of essential information, while oversampling techniques could introduce noise. To address this challenge, we propose an undersampling algorithm called GLNDU (Generalized Learning Vector Quantization and Natural Nearest Neighbors-based Undersampling). GLNDU utilizes Generalized Learning Vector Quantization (GLVQ) for computing the centroids of positive and negative instances. It also utilizes the concept of Natural Nearest Neighbors to identify majority-class instances in the overlapping region of the centroids of minority-class instances. Afterwards, these majority-class instances are removed, resulting in a new balanced training dataset that is used to train a foundational classifier. We conduct extensive experiments on 29 publicly available datasets, evaluating the performance using AUC and G_mean values. GLNDU demonstrates significant advantages over established methods such as SVM, CART, and KNN across different types of classifiers. Additionally, the results of the Friedman ranking and Nemenyi post-hoc test provide additional support for the findings obtained from the experiments.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于广义学习向量量化和自然近邻的不平衡数据去采样

不平衡的数据集会对分类器的性能产生不利影响。传统的欠采样方法可能会导致基本信息丢失，而超采样技术则可能会引入噪声。为了应对这一挑战，我们提出了一种称为 GLNDU（基于广义学习矢量量化和自然近邻的欠采样）的欠采样算法。GLNDU 利用广义学习矢量量化（GLVQ）计算正负实例的中心点。它还利用 "自然近邻"（Natural Nearest Neighbors）的概念，在少数类实例中心点的重叠区域识别多数类实例。之后，这些多数类实例会被移除，从而产生一个新的平衡训练数据集，用于训练基础分类器。我们在 29 个公开可用的数据集上进行了广泛的实验，并使用 AUC 和 G_mean 值对性能进行了评估。与 SVM、CART 和 KNN 等成熟方法相比，GLNDU 在不同类型的分类器上都表现出显著优势。此外，Friedman 排序和 Nemenyi 事后检验的结果也为实验结果提供了更多支持。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Journal of Machine Learning and Cybernetics COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

7.90

自引率

10.70%

发文量

225

期刊介绍： Cybernetics is concerned with describing complex interactions and interrelationships between systems which are omnipresent in our daily life. Machine Learning discovers fundamental functional relationships between variables and ensembles of variables in systems. The merging of the disciplines of Machine Learning and Cybernetics is aimed at the discovery of various forms of interaction between systems through diverse mechanisms of learning from data. The International Journal of Machine Learning and Cybernetics (IJMLC) focuses on the key research problems emerging at the junction of machine learning and cybernetics and serves as a broad forum for rapid dissemination of the latest advancements in the area. The emphasis of IJMLC is on the hybrid development of machine learning and cybernetics schemes inspired by different contributing disciplines such as engineering, mathematics, cognitive sciences, and applications. New ideas, design alternatives, implementations and case studies pertaining to all the aspects of machine learning and cybernetics fall within the scope of the IJMLC. Key research areas to be covered by the journal include: Machine Learning for modeling interactions between systems Pattern Recognition technology to support discovery of system-environment interaction Control of system-environment interactions Biochemical interaction in biological and biologically-inspired systems Learning for improvement of communication schemes between systems