{"title":"An Approach for Cross Project Defect Prediction Using Identical Metrics Matching and Deep Neural Network","authors":"Pravas Ranjan Bal;Sandeep Kumar","doi":"10.1109/TR.2024.3435709","DOIUrl":null,"url":null,"abstract":"Advancements in software defect prediction (SDP) to handle the scenario of no or limited historical data have introduced the concept of cross-project defect prediction (CPDP). CPDP using machine learning (ML) algorithms has been the staple research area for all software practitioners in the SDP domain. An important assumption in ML algorithms is that both train and test data must follow similar data distribution for better accuracy. These assumptions may hold in the within-project defect prediction (WPDP) scenario where both train and test data belong to the same project. However, it is impossible in the CPDP scenario where the train and test data belong to different projects. So, in the CPDP scenario, researchers tried to use a matched metrics approach to handling this issue. However, in this case, there may be an issue if only a small-sized source (train) dataset matches the data distribution with the target (test) dataset, leading to an insufficient training dataset. Hence, we have proposed a cross-project data preprocessing method, namely knowledge transfer from target data to source data using correlation (KTTSC), to handle this issue and hence to improve the CPDP accuracy of ML models. The experimental results demonstrate that using the dropout regularization-based deep neural network, k nearest neighbor, decision tree, logistic regression, and Naive Bayes classifiers with the proposed KTTSC method show an improvement of 22%, 17%, 23.2%, 13.5%, and 9.5%, respectively, in terms of average AUC scores as compared to the traditional CPDP method and an improvement in the range of 6.6% to 11.1% as compared to existing works on CPDP.","PeriodicalId":56305,"journal":{"name":"IEEE Transactions on Reliability","volume":"74 2","pages":"2678-2692"},"PeriodicalIF":5.7000,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Reliability","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10636799/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Advancements in software defect prediction (SDP) to handle the scenario of no or limited historical data have introduced the concept of cross-project defect prediction (CPDP). CPDP using machine learning (ML) algorithms has been the staple research area for all software practitioners in the SDP domain. An important assumption in ML algorithms is that both train and test data must follow similar data distribution for better accuracy. These assumptions may hold in the within-project defect prediction (WPDP) scenario where both train and test data belong to the same project. However, it is impossible in the CPDP scenario where the train and test data belong to different projects. So, in the CPDP scenario, researchers tried to use a matched metrics approach to handling this issue. However, in this case, there may be an issue if only a small-sized source (train) dataset matches the data distribution with the target (test) dataset, leading to an insufficient training dataset. Hence, we have proposed a cross-project data preprocessing method, namely knowledge transfer from target data to source data using correlation (KTTSC), to handle this issue and hence to improve the CPDP accuracy of ML models. The experimental results demonstrate that using the dropout regularization-based deep neural network, k nearest neighbor, decision tree, logistic regression, and Naive Bayes classifiers with the proposed KTTSC method show an improvement of 22%, 17%, 23.2%, 13.5%, and 9.5%, respectively, in terms of average AUC scores as compared to the traditional CPDP method and an improvement in the range of 6.6% to 11.1% as compared to existing works on CPDP.
期刊介绍:
IEEE Transactions on Reliability is a refereed journal for the reliability and allied disciplines including, but not limited to, maintainability, physics of failure, life testing, prognostics, design and manufacture for reliability, reliability for systems of systems, network availability, mission success, warranty, safety, and various measures of effectiveness. Topics eligible for publication range from hardware to software, from materials to systems, from consumer and industrial devices to manufacturing plants, from individual items to networks, from techniques for making things better to ways of predicting and measuring behavior in the field. As an engineering subject that supports new and existing technologies, we constantly expand into new areas of the assurance sciences.