Osayande P. Omondiagbe, Sherlock A. Licorish, Stephen G. MacDonell
{"title":"Negative Transfer in Cross Project Defect Prediction: Effect of Domain Divergence","authors":"Osayande P. Omondiagbe, Sherlock A. Licorish, Stephen G. MacDonell","doi":"10.1109/SEAA56994.2022.00010","DOIUrl":null,"url":null,"abstract":"Cross-project defect prediction (CPDP) models are used in new software project prediction tasks to improve defect prediction rates. The development of these CPDP models could be challenging in cases where there is little or no historical data. For this reason, researchers may need to rely on multiple sources and use transfer learning-based CPDP for building defect prediction models. These data are typically taken from similar and related projects, but their distributions can be different from the new software project (target data). Although, transfer learning-based CPDP models are designed to handle these distribution differences, but if not correctly handled by the model, may lead to negative transfer. To this end, recent works have focused on building transfer CPDP models, but little is known about how similar or dissimilar sources should be to avoid negative transfer. This paper provides the first empirical investigation to understand the effect of combining different sources with different levels of similarities in transfer CPDP. We introduce the use of the Population Stability Index (PSI) to interpret whether the distribution of the combined or single-source data is similar to the target data. This was validated using an adversarial approach. Experimental results on three public datasets reveal that when the source and target distribution are very similar, the probability of false alarm is improved by 3% to 7% and the recall indicator is reduced from 1% to 8%. Interestingly, we also found that when dissimilar source data are combined with different source datasets, the overall domain divergence is lowered, and the performance is improved. The results highlight the importance of using the right source to aid the learning process.","PeriodicalId":269970,"journal":{"name":"2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SEAA56994.2022.00010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Cross-project defect prediction (CPDP) models are used in new software project prediction tasks to improve defect prediction rates. The development of these CPDP models could be challenging in cases where there is little or no historical data. For this reason, researchers may need to rely on multiple sources and use transfer learning-based CPDP for building defect prediction models. These data are typically taken from similar and related projects, but their distributions can be different from the new software project (target data). Although, transfer learning-based CPDP models are designed to handle these distribution differences, but if not correctly handled by the model, may lead to negative transfer. To this end, recent works have focused on building transfer CPDP models, but little is known about how similar or dissimilar sources should be to avoid negative transfer. This paper provides the first empirical investigation to understand the effect of combining different sources with different levels of similarities in transfer CPDP. We introduce the use of the Population Stability Index (PSI) to interpret whether the distribution of the combined or single-source data is similar to the target data. This was validated using an adversarial approach. Experimental results on three public datasets reveal that when the source and target distribution are very similar, the probability of false alarm is improved by 3% to 7% and the recall indicator is reduced from 1% to 8%. Interestingly, we also found that when dissimilar source data are combined with different source datasets, the overall domain divergence is lowered, and the performance is improved. The results highlight the importance of using the right source to aid the learning process.