A hybrid-ensemble model for software defect prediction for balanced and imbalanced datasets using AI-based techniques with feature preservation: SMERKP-XGB

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Journal of Software-Evolution and Process Pub Date : 2024-09-17 DOI:10.1002/smr.2731

Mohd Mustaqeem, Tamanna Siddiqui, Suhel Mustajab

{"title":"A hybrid-ensemble model for software defect prediction for balanced and imbalanced datasets using AI-based techniques with feature preservation: SMERKP-XGB","authors":"Mohd Mustaqeem, Tamanna Siddiqui, Suhel Mustajab","doi":"10.1002/smr.2731","DOIUrl":null,"url":null,"abstract":"<p>Maintaining software quality is a significant challenge as the complexity of software is increasing with the rise of the software industry. Software defects are a primary concern in complex modules, and predicting them in the early stages of the software development life cycle (SDLC) is difficult. Previous techniques to address this issue have not been very promising. We have proposed “A hybrid ensemble model for software defect prediction using AI-based techniques with feature preservation” to overcome this problem. We have used the National Aeronautics and Space Administration (NASA) dataset from the PROMISE repository for testing and validation. By applying exploratory data analysis (EDA), feature engineering, scaling, and standardization, we found that the dataset is imbalanced, which can negatively affect the model's performance. To address this, we have used the Synthetic Minority Oversampling (SMOTE) technique and the edited nearest neighbor (ENN) (SMOTE-ENN). We have also used recursive feature elimination cross-validation (RFE-CV) with a pipeline to prevent data leaking in CV and kernel-based principal component analysis (K-PCA) to minimize dimensionality and selectively relevant features. The reduced dimensional data is then given to the eXtreme Gradient Boosting (XGBoost) for classification, resulting in the hybrid-ensemble (SMERKP-XGB) model. The proposed SMERKP-XGB model is better than previously developed models in terms of accuracy (CM1: 97.53%, PC1: 92.05%, and PC2: 97.45%, KC1:95.65%), and area under the receiver operating characteristic curve values (CM1:96.30%, PC1:98.30%, and PC2:99.30%: KC1: 93.54) and other evaluation criteria mentioned in the literature.</p>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"37 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Software-Evolution and Process","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/smr.2731","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Maintaining software quality is a significant challenge as the complexity of software is increasing with the rise of the software industry. Software defects are a primary concern in complex modules, and predicting them in the early stages of the software development life cycle (SDLC) is difficult. Previous techniques to address this issue have not been very promising. We have proposed “A hybrid ensemble model for software defect prediction using AI-based techniques with feature preservation” to overcome this problem. We have used the National Aeronautics and Space Administration (NASA) dataset from the PROMISE repository for testing and validation. By applying exploratory data analysis (EDA), feature engineering, scaling, and standardization, we found that the dataset is imbalanced, which can negatively affect the model's performance. To address this, we have used the Synthetic Minority Oversampling (SMOTE) technique and the edited nearest neighbor (ENN) (SMOTE-ENN). We have also used recursive feature elimination cross-validation (RFE-CV) with a pipeline to prevent data leaking in CV and kernel-based principal component analysis (K-PCA) to minimize dimensionality and selectively relevant features. The reduced dimensional data is then given to the eXtreme Gradient Boosting (XGBoost) for classification, resulting in the hybrid-ensemble (SMERKP-XGB) model. The proposed SMERKP-XGB model is better than previously developed models in terms of accuracy (CM1: 97.53%, PC1: 92.05%, and PC2: 97.45%, KC1:95.65%), and area under the receiver operating characteristic curve values (CM1:96.30%, PC1:98.30%, and PC2:99.30%: KC1: 93.54) and other evaluation criteria mentioned in the literature.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用基于人工智能的特征保存技术，为平衡和不平衡数据集建立软件缺陷预测混合组合模型：SMERKP-XGB

随着软件产业的兴起，软件的复杂性不断增加，如何保持软件质量是一项重大挑战。软件缺陷是复杂模块的首要问题，而在软件开发生命周期（SDLC）的早期阶段预测软件缺陷是非常困难的。以往解决这一问题的技术并不理想。为了解决这个问题，我们提出了 "基于人工智能技术的软件缺陷预测混合集合模型"。我们使用了美国国家航空航天局（NASA）PROMISE 数据库中的数据集进行测试和验证。通过应用探索性数据分析（EDA）、特征工程、缩放和标准化，我们发现数据集是不平衡的，这会对模型的性能产生负面影响。为此，我们使用了合成少数群体过度采样（SMOTE）技术和编辑近邻（ENN）技术（SMOTE-ENN）。我们还使用了递归特征消除交叉验证（RFE-CV）和基于内核的主成分分析（K-PCA），以防止 CV 和 K-PCA 中的数据泄露，从而最小化维度并选择相关特征。然后，将降维数据交给极梯度提升（XGBoost）进行分类，最终形成混合组合（SMERKP-XGB）模型。所提出的 SMERKP-XGB 模型在准确率（CM1：97.53%；PC1：92.05%；PC2：97.45%；KC1：95.65%）和接收者工作特征曲线下面积值（CM1：96.30%；PC1：98.30%；PC2：99.30%；KC1：93.54）以及文献中提到的其他评价标准方面均优于之前开发的模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Software-Evolution and Process COMPUTER SCIENCE, SOFTWARE ENGINEERING-

自引率

10.00%

发文量

109