{"title":"Software Defect Prediction Based on Optimized Machine Learning Models: A Comparative Study","authors":"M. Z. Siswantoro, Umi Laili Yuhana","doi":"10.34148/teknika.v12i2.634","DOIUrl":null,"url":null,"abstract":"Software defect prediction is crucial used for detecting possible defects in software before they manifest. While machine learning models have become more prevalent in software defect prediction, their effectiveness may vary based on the dataset and hyperparameters of the model. Difficulties arise in determining the most suitable hyperparameters for the model, as well as identifying the prominent features that serve as input to the classifier. This research aims to evaluate various traditional machine learning models that are optimized for software defect prediction on NASA MDP (Metrics Data Program) datasets. The datasets were classified using k-nearest neighbors (k-NN), decision trees, logistic regression, linear discriminant analysis (LDA), single hidden layer multilayer perceptron (SHL-MLP), and Support Vector Machine (SVM). The hyperparameters of the models were fine-tuned using random search, and the feature dimensionality was decreased by utilizing principal component analysis (PCA). The synthetic minority oversampling technique (SMOTE) was implemented to oversample the minority class in order to correct the class imbalance. k-NN was found to be the most suitable for software defect prediction on several datasets, while SHL-MLP and SVM were also effective on certain datasets. It is noteworthy that logistic regression and LDA did not perform as well as the other models. Moreover, the optimized models outperform the baseline models in terms of classification accuracy. The choice of model for software defect prediction should be based on the specific characteristics of the dataset. Furthermore, hyperparameter tuning can improve the accuracy of machine learning models in predicting software defects.","PeriodicalId":52620,"journal":{"name":"Teknika","volume":"26 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Teknika","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.34148/teknika.v12i2.634","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Software defect prediction is crucial used for detecting possible defects in software before they manifest. While machine learning models have become more prevalent in software defect prediction, their effectiveness may vary based on the dataset and hyperparameters of the model. Difficulties arise in determining the most suitable hyperparameters for the model, as well as identifying the prominent features that serve as input to the classifier. This research aims to evaluate various traditional machine learning models that are optimized for software defect prediction on NASA MDP (Metrics Data Program) datasets. The datasets were classified using k-nearest neighbors (k-NN), decision trees, logistic regression, linear discriminant analysis (LDA), single hidden layer multilayer perceptron (SHL-MLP), and Support Vector Machine (SVM). The hyperparameters of the models were fine-tuned using random search, and the feature dimensionality was decreased by utilizing principal component analysis (PCA). The synthetic minority oversampling technique (SMOTE) was implemented to oversample the minority class in order to correct the class imbalance. k-NN was found to be the most suitable for software defect prediction on several datasets, while SHL-MLP and SVM were also effective on certain datasets. It is noteworthy that logistic regression and LDA did not perform as well as the other models. Moreover, the optimized models outperform the baseline models in terms of classification accuracy. The choice of model for software defect prediction should be based on the specific characteristics of the dataset. Furthermore, hyperparameter tuning can improve the accuracy of machine learning models in predicting software defects.
软件缺陷预测对于在软件中可能的缺陷出现之前检测它们是至关重要的。虽然机器学习模型在软件缺陷预测中变得越来越普遍,但它们的有效性可能会根据模型的数据集和超参数而变化。在为模型确定最合适的超参数以及识别作为分类器输入的突出特征方面出现了困难。本研究旨在评估各种传统机器学习模型,这些模型针对NASA MDP (Metrics Data Program)数据集上的软件缺陷预测进行了优化。使用k近邻(k-NN)、决策树、逻辑回归、线性判别分析(LDA)、单隐层多层感知器(SHL-MLP)和支持向量机(SVM)对数据集进行分类。利用随机搜索对模型的超参数进行微调,利用主成分分析(PCA)对特征维数进行降维。采用合成少数派过采样技术(SMOTE)对少数派类进行过采样,以纠正类不平衡。在一些数据集上,发现k-NN最适合软件缺陷预测,而SHL-MLP和SVM在某些数据集上也很有效。值得注意的是,逻辑回归和LDA的表现不如其他模型。此外,优化后的模型在分类精度方面优于基线模型。软件缺陷预测模型的选择应基于数据集的具体特征。此外,超参数调优可以提高机器学习模型预测软件缺陷的准确性。