Long-Hui Wang, Qi Dai, Jia-You Wang, Tony Du, Lifang Chen
{"title":"基于广义学习向量量化和自然近邻的不平衡数据去采样","authors":"Long-Hui Wang, Qi Dai, Jia-You Wang, Tony Du, Lifang Chen","doi":"10.1007/s13042-024-02261-w","DOIUrl":null,"url":null,"abstract":"<p>Imbalanced datasets can adversely affect classifier performance. Conventional undersampling approaches may lead to the loss of essential information, while oversampling techniques could introduce noise. To address this challenge, we propose an undersampling algorithm called GLNDU (Generalized Learning Vector Quantization and Natural Nearest Neighbors-based Undersampling). GLNDU utilizes Generalized Learning Vector Quantization (GLVQ) for computing the centroids of positive and negative instances. It also utilizes the concept of Natural Nearest Neighbors to identify majority-class instances in the overlapping region of the centroids of minority-class instances. Afterwards, these majority-class instances are removed, resulting in a new balanced training dataset that is used to train a foundational classifier. We conduct extensive experiments on 29 publicly available datasets, evaluating the performance using AUC and G_mean values. GLNDU demonstrates significant advantages over established methods such as SVM, CART, and KNN across different types of classifiers. Additionally, the results of the Friedman ranking and Nemenyi post-hoc test provide additional support for the findings obtained from the experiments.</p>","PeriodicalId":51327,"journal":{"name":"International Journal of Machine Learning and Cybernetics","volume":"157 1","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Undersampling based on generalized learning vector quantization and natural nearest neighbors for imbalanced data\",\"authors\":\"Long-Hui Wang, Qi Dai, Jia-You Wang, Tony Du, Lifang Chen\",\"doi\":\"10.1007/s13042-024-02261-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Imbalanced datasets can adversely affect classifier performance. Conventional undersampling approaches may lead to the loss of essential information, while oversampling techniques could introduce noise. To address this challenge, we propose an undersampling algorithm called GLNDU (Generalized Learning Vector Quantization and Natural Nearest Neighbors-based Undersampling). GLNDU utilizes Generalized Learning Vector Quantization (GLVQ) for computing the centroids of positive and negative instances. It also utilizes the concept of Natural Nearest Neighbors to identify majority-class instances in the overlapping region of the centroids of minority-class instances. Afterwards, these majority-class instances are removed, resulting in a new balanced training dataset that is used to train a foundational classifier. We conduct extensive experiments on 29 publicly available datasets, evaluating the performance using AUC and G_mean values. GLNDU demonstrates significant advantages over established methods such as SVM, CART, and KNN across different types of classifiers. Additionally, the results of the Friedman ranking and Nemenyi post-hoc test provide additional support for the findings obtained from the experiments.</p>\",\"PeriodicalId\":51327,\"journal\":{\"name\":\"International Journal of Machine Learning and Cybernetics\",\"volume\":\"157 1\",\"pages\":\"\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2024-07-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Machine Learning and Cybernetics\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s13042-024-02261-w\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Machine Learning and Cybernetics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s13042-024-02261-w","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Undersampling based on generalized learning vector quantization and natural nearest neighbors for imbalanced data
Imbalanced datasets can adversely affect classifier performance. Conventional undersampling approaches may lead to the loss of essential information, while oversampling techniques could introduce noise. To address this challenge, we propose an undersampling algorithm called GLNDU (Generalized Learning Vector Quantization and Natural Nearest Neighbors-based Undersampling). GLNDU utilizes Generalized Learning Vector Quantization (GLVQ) for computing the centroids of positive and negative instances. It also utilizes the concept of Natural Nearest Neighbors to identify majority-class instances in the overlapping region of the centroids of minority-class instances. Afterwards, these majority-class instances are removed, resulting in a new balanced training dataset that is used to train a foundational classifier. We conduct extensive experiments on 29 publicly available datasets, evaluating the performance using AUC and G_mean values. GLNDU demonstrates significant advantages over established methods such as SVM, CART, and KNN across different types of classifiers. Additionally, the results of the Friedman ranking and Nemenyi post-hoc test provide additional support for the findings obtained from the experiments.
期刊介绍:
Cybernetics is concerned with describing complex interactions and interrelationships between systems which are omnipresent in our daily life. Machine Learning discovers fundamental functional relationships between variables and ensembles of variables in systems. The merging of the disciplines of Machine Learning and Cybernetics is aimed at the discovery of various forms of interaction between systems through diverse mechanisms of learning from data.
The International Journal of Machine Learning and Cybernetics (IJMLC) focuses on the key research problems emerging at the junction of machine learning and cybernetics and serves as a broad forum for rapid dissemination of the latest advancements in the area. The emphasis of IJMLC is on the hybrid development of machine learning and cybernetics schemes inspired by different contributing disciplines such as engineering, mathematics, cognitive sciences, and applications. New ideas, design alternatives, implementations and case studies pertaining to all the aspects of machine learning and cybernetics fall within the scope of the IJMLC.
Key research areas to be covered by the journal include:
Machine Learning for modeling interactions between systems
Pattern Recognition technology to support discovery of system-environment interaction
Control of system-environment interactions
Biochemical interaction in biological and biologically-inspired systems
Learning for improvement of communication schemes between systems