{"title":"Clustering-Based Oversampling Algorithm for Multi-class Imbalance Learning","authors":"Haixia Zhao, Jian Wu","doi":"10.1007/s00357-024-09491-1","DOIUrl":null,"url":null,"abstract":"<p>Multi-class imbalanced data learning faces many challenges. Its complex structural characteristics cause severe intra-class imbalance or overgeneralization in most solution strategies. This negatively affects data learning. This paper proposes a clustering-based oversampling algorithm (COM) to handle multi-class imbalance learning. In order to avoid the loss of important information, COM clusters the minority class based on the structural characteristics of the instances, among which rare instances and outliers are carefully portrayed through assigning a sampling weight to each of the clusters. Clusters with high densities are given low weights, and then, oversampling is performed within clusters to avoid overgeneralization. COM avoids intra-class imbalance effectively because low-density clusters are more likely than high-density ones to be selected to synthesize instances. Our study used the UCI and KEEL imbalanced datasets to demonstrate the effectiveness and stability of the proposed method.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"17 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Classification","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00357-024-09491-1","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Multi-class imbalanced data learning faces many challenges. Its complex structural characteristics cause severe intra-class imbalance or overgeneralization in most solution strategies. This negatively affects data learning. This paper proposes a clustering-based oversampling algorithm (COM) to handle multi-class imbalance learning. In order to avoid the loss of important information, COM clusters the minority class based on the structural characteristics of the instances, among which rare instances and outliers are carefully portrayed through assigning a sampling weight to each of the clusters. Clusters with high densities are given low weights, and then, oversampling is performed within clusters to avoid overgeneralization. COM avoids intra-class imbalance effectively because low-density clusters are more likely than high-density ones to be selected to synthesize instances. Our study used the UCI and KEEL imbalanced datasets to demonstrate the effectiveness and stability of the proposed method.
期刊介绍:
To publish original and valuable papers in the field of classification, numerical taxonomy, multidimensional scaling and other ordination techniques, clustering, tree structures and other network models (with somewhat less emphasis on principal components analysis, factor analysis, and discriminant analysis), as well as associated models and algorithms for fitting them. Articles will support advances in methodology while demonstrating compelling substantive applications. Comprehensive review articles are also acceptable. Contributions will represent disciplines such as statistics, psychology, biology, information retrieval, anthropology, archeology, astronomy, business, chemistry, computer science, economics, engineering, geography, geology, linguistics, marketing, mathematics, medicine, political science, psychiatry, sociology, and soil science.