A hybrid‐ensemble model for software defect prediction for balanced and imbalanced datasets using AI‐based techniques with feature preservation: SMERKP‐XGB
{"title":"A hybrid‐ensemble model for software defect prediction for balanced and imbalanced datasets using AI‐based techniques with feature preservation: SMERKP‐XGB","authors":"Mohd Mustaqeem, Tamanna Siddiqui, Suhel Mustajab","doi":"10.1002/smr.2731","DOIUrl":null,"url":null,"abstract":"Maintaining software quality is a significant challenge as the complexity of software is increasing with the rise of the software industry. Software defects are a primary concern in complex modules, and predicting them in the early stages of the software development life cycle (SDLC) is difficult. Previous techniques to address this issue have not been very promising. We have proposed “A hybrid ensemble model for software defect prediction using AI‐based techniques with feature preservation” to overcome this problem. We have used the National Aeronautics and Space Administration (NASA) dataset from the PROMISE repository for testing and validation. By applying exploratory data analysis (EDA), feature engineering, scaling, and standardization, we found that the dataset is imbalanced, which can negatively affect the model's performance. To address this, we have used the Synthetic Minority Oversampling (SMOTE) technique and the edited nearest neighbor (ENN) (SMOTE‐ENN). We have also used recursive feature elimination cross‐validation (RFE‐CV) with a pipeline to prevent data leaking in CV and kernel‐based principal component analysis (K‐PCA) to minimize dimensionality and selectively relevant features. The reduced dimensional data is then given to the eXtreme Gradient Boosting (XGBoost) for classification, resulting in the hybrid‐ensemble (SMERKP‐XGB) model. The proposed SMERKP‐XGB model is better than previously developed models in terms of accuracy (CM1: 97.53%, PC1: 92.05%, and PC2: 97.45%, KC1:95.65%), and area under the receiver operating characteristic curve values (CM1:96.30%, PC1:98.30%, and PC2:99.30%: KC1: 93.54) and other evaluation criteria mentioned in the literature.","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"19 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Software-Evolution and Process","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1002/smr.2731","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Maintaining software quality is a significant challenge as the complexity of software is increasing with the rise of the software industry. Software defects are a primary concern in complex modules, and predicting them in the early stages of the software development life cycle (SDLC) is difficult. Previous techniques to address this issue have not been very promising. We have proposed “A hybrid ensemble model for software defect prediction using AI‐based techniques with feature preservation” to overcome this problem. We have used the National Aeronautics and Space Administration (NASA) dataset from the PROMISE repository for testing and validation. By applying exploratory data analysis (EDA), feature engineering, scaling, and standardization, we found that the dataset is imbalanced, which can negatively affect the model's performance. To address this, we have used the Synthetic Minority Oversampling (SMOTE) technique and the edited nearest neighbor (ENN) (SMOTE‐ENN). We have also used recursive feature elimination cross‐validation (RFE‐CV) with a pipeline to prevent data leaking in CV and kernel‐based principal component analysis (K‐PCA) to minimize dimensionality and selectively relevant features. The reduced dimensional data is then given to the eXtreme Gradient Boosting (XGBoost) for classification, resulting in the hybrid‐ensemble (SMERKP‐XGB) model. The proposed SMERKP‐XGB model is better than previously developed models in terms of accuracy (CM1: 97.53%, PC1: 92.05%, and PC2: 97.45%, KC1:95.65%), and area under the receiver operating characteristic curve values (CM1:96.30%, PC1:98.30%, and PC2:99.30%: KC1: 93.54) and other evaluation criteria mentioned in the literature.