Heterogeneous Fault Prediction Using Feature Selection and Supervised Learning Algorithms

Vietnam. J. Comput. Sci. Pub Date : 2022-01-24 DOI:10.1142/s2196888822500142

R. Arora, Arvinder Kaur

{"title":"Heterogeneous Fault Prediction Using Feature Selection and Supervised Learning Algorithms","authors":"R. Arora, Arvinder Kaur","doi":"10.1142/s2196888822500142","DOIUrl":null,"url":null,"abstract":"Software Fault Prediction (SFP) is the most persuasive research area of software engineering. Software Fault Prediction which is carried out within the same software project is known as With-In Fault Prediction. However, local data repositories are not enough to build the model of With-in software Fault prediction. The idea of cross-project fault prediction (CPFP) has been suggested in recent years, which aims to construct a prediction model on one project, and use that model to predict the other project. However, CPFP requires that both the training and testing datasets use the same set of metrics. As a consequence, traditional CPFP approaches are challenging to implement through projects with diverse metric sets. The specific case of CPFP is Heterogeneous Fault Prediction (HFP), which allows the program to predict faults among projects with diverse metrics. The proposed framework aims to achieve an HFP model by implementing Feature Selection on both the source and target datasets to build an efficient prediction model using supervised machine learning techniques. Our approach is applied on two open-source projects, Linux and MySQL, and prediction is evaluated based on Area Under Curve (AUC) performance measure. The key results of the proposed approach are as follows: It significantly gives better results of prediction performance for heterogeneous projects as compared with cross projects. Also, it demonstrates that feature selection with feature mapping has a significant effect on HFP models. Non-parametric statistical analyses, such as the Friedman and Nemenyi Post-hoc Tests, are applied, demonstrating that Logistic Regression performed significantly better than other supervised learning algorithms in HFP models.","PeriodicalId":256649,"journal":{"name":"Vietnam. J. Comput. Sci.","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Vietnam. J. Comput. Sci.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s2196888822500142","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Software Fault Prediction (SFP) is the most persuasive research area of software engineering. Software Fault Prediction which is carried out within the same software project is known as With-In Fault Prediction. However, local data repositories are not enough to build the model of With-in software Fault prediction. The idea of cross-project fault prediction (CPFP) has been suggested in recent years, which aims to construct a prediction model on one project, and use that model to predict the other project. However, CPFP requires that both the training and testing datasets use the same set of metrics. As a consequence, traditional CPFP approaches are challenging to implement through projects with diverse metric sets. The specific case of CPFP is Heterogeneous Fault Prediction (HFP), which allows the program to predict faults among projects with diverse metrics. The proposed framework aims to achieve an HFP model by implementing Feature Selection on both the source and target datasets to build an efficient prediction model using supervised machine learning techniques. Our approach is applied on two open-source projects, Linux and MySQL, and prediction is evaluated based on Area Under Curve (AUC) performance measure. The key results of the proposed approach are as follows: It significantly gives better results of prediction performance for heterogeneous projects as compared with cross projects. Also, it demonstrates that feature selection with feature mapping has a significant effect on HFP models. Non-parametric statistical analyses, such as the Friedman and Nemenyi Post-hoc Tests, are applied, demonstrating that Logistic Regression performed significantly better than other supervised learning algorithms in HFP models.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于特征选择和监督学习算法的异构故障预测

软件故障预测(SFP)是软件工程中最有说服力的研究领域。在同一软件项目中进行的软件故障预测称为内故障预测。然而，本地数据存储库不足以构建软件内故障预测模型。近年来提出了跨项目断层预测的思想，其目的是在一个项目上建立预测模型，并用该模型预测另一个项目。然而，CPFP要求训练和测试数据集使用相同的指标集。因此，传统的CPFP方法很难通过具有不同度量集的项目来实现。CPFP的具体案例是异构故障预测(HFP)，它允许程序预测具有不同度量的项目之间的故障。提出的框架旨在通过在源数据集和目标数据集上实现特征选择来实现HFP模型，从而使用监督机器学习技术构建有效的预测模型。我们的方法应用于两个开源项目，Linux和MySQL，并基于曲线下面积(AUC)性能度量来评估预测。该方法的主要结果如下:与跨项目相比，它在异构项目的预测性能上明显优于跨项目。同时，利用特征映射进行特征选择对HFP模型有显著的影响。应用非参数统计分析，如Friedman和Nemenyi事后检验，表明逻辑回归在HFP模型中的表现明显优于其他监督学习算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Vietnam. J. Comput. Sci.

自引率

0.00%

发文量