{"title":"PRO-SMOTEBoost用于重新平衡和改进不平衡数据分类的自适应 SMOTEBoost 概率算法","authors":"Laouni Djafri","doi":"10.1016/j.ins.2024.121548","DOIUrl":null,"url":null,"abstract":"<div><div>In the field of data mining and machine learning, dealing with imbalanced datasets is one of the most complex problems. The class imbalance issue significantly affects the classification of minority classes when using common classification algorithms. These algorithms often prioritize improving the performance of the majority class at the expense of the minority class, leading to misclassifying negative instances as positive ones. To address this problem, the Synthetic Minority Over-sampling Technique (SMOTE) has gained popularity to rebalance imbalanced data for classification. However, in this paper, we propose two algorithms to enhance the performance of imbalanced classification further. The first algorithm is PRO-SMOTE, an improvement over SMOTE. PRO-SMOTE relies on conditional probabilities to effectively rebalance imbalanced classes and improve the predictive performance metrics satisfactorily and reliably. By considering conditional probabilities, PRO-SMOTE can reduce the majority classes and optimally increase the minority class. Second, the PRO-SMOTEBoost algorithm, in turn, is based on the PRO-SMOTE to overcome classification anomalies and problems encountered by machine learning algorithms during classification, especially the weak ones. PRO-SMOTEBoost aims to maximize predictive precision to the greatest extent possible by combining the strengths of PRO-SMOTE with boosting techniques. Evaluating these algorithms using traditional machine learning algorithms such as Random Forests, C4.5, Naive Bayes, and Support Vector Machines has demonstrated excellent classification results. The performance metrics, encompassing F1-score, G-means, Precision, Accuracy, Recall, AUC-ROC, and Precision-Recall-curves, achieved by the proposed algorithm demonstrate a range that extends from over 90% to a flawless score of 100%. Compared to using these traditional algorithms individually, the utilization of PRO-SMOTEBoost has shown a significant improvement of 10% to 40% in performance metrics. Overall, the proposed algorithms, PRO-SMOTE and PRO-SMOTEBoost, offer effective solutions to address the challenges posed by imbalanced datasets. They provide improved predictive metrics and demonstrate their superiority when compared to traditional even modern machine learning algorithms.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121548"},"PeriodicalIF":8.1000,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PRO-SMOTEBoost: An adaptive SMOTEBoost probabilistic algorithm for rebalancing and improving imbalanced data classification\",\"authors\":\"Laouni Djafri\",\"doi\":\"10.1016/j.ins.2024.121548\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In the field of data mining and machine learning, dealing with imbalanced datasets is one of the most complex problems. The class imbalance issue significantly affects the classification of minority classes when using common classification algorithms. These algorithms often prioritize improving the performance of the majority class at the expense of the minority class, leading to misclassifying negative instances as positive ones. To address this problem, the Synthetic Minority Over-sampling Technique (SMOTE) has gained popularity to rebalance imbalanced data for classification. However, in this paper, we propose two algorithms to enhance the performance of imbalanced classification further. The first algorithm is PRO-SMOTE, an improvement over SMOTE. PRO-SMOTE relies on conditional probabilities to effectively rebalance imbalanced classes and improve the predictive performance metrics satisfactorily and reliably. By considering conditional probabilities, PRO-SMOTE can reduce the majority classes and optimally increase the minority class. Second, the PRO-SMOTEBoost algorithm, in turn, is based on the PRO-SMOTE to overcome classification anomalies and problems encountered by machine learning algorithms during classification, especially the weak ones. PRO-SMOTEBoost aims to maximize predictive precision to the greatest extent possible by combining the strengths of PRO-SMOTE with boosting techniques. Evaluating these algorithms using traditional machine learning algorithms such as Random Forests, C4.5, Naive Bayes, and Support Vector Machines has demonstrated excellent classification results. The performance metrics, encompassing F1-score, G-means, Precision, Accuracy, Recall, AUC-ROC, and Precision-Recall-curves, achieved by the proposed algorithm demonstrate a range that extends from over 90% to a flawless score of 100%. Compared to using these traditional algorithms individually, the utilization of PRO-SMOTEBoost has shown a significant improvement of 10% to 40% in performance metrics. Overall, the proposed algorithms, PRO-SMOTE and PRO-SMOTEBoost, offer effective solutions to address the challenges posed by imbalanced datasets. They provide improved predictive metrics and demonstrate their superiority when compared to traditional even modern machine learning algorithms.</div></div>\",\"PeriodicalId\":51063,\"journal\":{\"name\":\"Information Sciences\",\"volume\":\"690 \",\"pages\":\"Article 121548\"},\"PeriodicalIF\":8.1000,\"publicationDate\":\"2024-10-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Sciences\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0020025524014622\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"0\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0020025524014622","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
PRO-SMOTEBoost: An adaptive SMOTEBoost probabilistic algorithm for rebalancing and improving imbalanced data classification
In the field of data mining and machine learning, dealing with imbalanced datasets is one of the most complex problems. The class imbalance issue significantly affects the classification of minority classes when using common classification algorithms. These algorithms often prioritize improving the performance of the majority class at the expense of the minority class, leading to misclassifying negative instances as positive ones. To address this problem, the Synthetic Minority Over-sampling Technique (SMOTE) has gained popularity to rebalance imbalanced data for classification. However, in this paper, we propose two algorithms to enhance the performance of imbalanced classification further. The first algorithm is PRO-SMOTE, an improvement over SMOTE. PRO-SMOTE relies on conditional probabilities to effectively rebalance imbalanced classes and improve the predictive performance metrics satisfactorily and reliably. By considering conditional probabilities, PRO-SMOTE can reduce the majority classes and optimally increase the minority class. Second, the PRO-SMOTEBoost algorithm, in turn, is based on the PRO-SMOTE to overcome classification anomalies and problems encountered by machine learning algorithms during classification, especially the weak ones. PRO-SMOTEBoost aims to maximize predictive precision to the greatest extent possible by combining the strengths of PRO-SMOTE with boosting techniques. Evaluating these algorithms using traditional machine learning algorithms such as Random Forests, C4.5, Naive Bayes, and Support Vector Machines has demonstrated excellent classification results. The performance metrics, encompassing F1-score, G-means, Precision, Accuracy, Recall, AUC-ROC, and Precision-Recall-curves, achieved by the proposed algorithm demonstrate a range that extends from over 90% to a flawless score of 100%. Compared to using these traditional algorithms individually, the utilization of PRO-SMOTEBoost has shown a significant improvement of 10% to 40% in performance metrics. Overall, the proposed algorithms, PRO-SMOTE and PRO-SMOTEBoost, offer effective solutions to address the challenges posed by imbalanced datasets. They provide improved predictive metrics and demonstrate their superiority when compared to traditional even modern machine learning algorithms.
期刊介绍:
Informatics and Computer Science Intelligent Systems Applications is an esteemed international journal that focuses on publishing original and creative research findings in the field of information sciences. We also feature a limited number of timely tutorial and surveying contributions.
Our journal aims to cater to a diverse audience, including researchers, developers, managers, strategic planners, graduate students, and anyone interested in staying up-to-date with cutting-edge research in information science, knowledge engineering, and intelligent systems. While readers are expected to share a common interest in information science, they come from varying backgrounds such as engineering, mathematics, statistics, physics, computer science, cell biology, molecular biology, management science, cognitive science, neurobiology, behavioral sciences, and biochemistry.