Negative Transfer in Cross Project Defect Prediction: Effect of Domain Divergence

2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) Pub Date : 2022-08-01 DOI:10.1109/SEAA56994.2022.00010

Osayande P. Omondiagbe, Sherlock A. Licorish, Stephen G. MacDonell

{"title":"Negative Transfer in Cross Project Defect Prediction: Effect of Domain Divergence","authors":"Osayande P. Omondiagbe, Sherlock A. Licorish, Stephen G. MacDonell","doi":"10.1109/SEAA56994.2022.00010","DOIUrl":null,"url":null,"abstract":"Cross-project defect prediction (CPDP) models are used in new software project prediction tasks to improve defect prediction rates. The development of these CPDP models could be challenging in cases where there is little or no historical data. For this reason, researchers may need to rely on multiple sources and use transfer learning-based CPDP for building defect prediction models. These data are typically taken from similar and related projects, but their distributions can be different from the new software project (target data). Although, transfer learning-based CPDP models are designed to handle these distribution differences, but if not correctly handled by the model, may lead to negative transfer. To this end, recent works have focused on building transfer CPDP models, but little is known about how similar or dissimilar sources should be to avoid negative transfer. This paper provides the first empirical investigation to understand the effect of combining different sources with different levels of similarities in transfer CPDP. We introduce the use of the Population Stability Index (PSI) to interpret whether the distribution of the combined or single-source data is similar to the target data. This was validated using an adversarial approach. Experimental results on three public datasets reveal that when the source and target distribution are very similar, the probability of false alarm is improved by 3% to 7% and the recall indicator is reduced from 1% to 8%. Interestingly, we also found that when dissimilar source data are combined with different source datasets, the overall domain divergence is lowered, and the performance is improved. The results highlight the importance of using the right source to aid the learning process.","PeriodicalId":269970,"journal":{"name":"2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SEAA56994.2022.00010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Cross-project defect prediction (CPDP) models are used in new software project prediction tasks to improve defect prediction rates. The development of these CPDP models could be challenging in cases where there is little or no historical data. For this reason, researchers may need to rely on multiple sources and use transfer learning-based CPDP for building defect prediction models. These data are typically taken from similar and related projects, but their distributions can be different from the new software project (target data). Although, transfer learning-based CPDP models are designed to handle these distribution differences, but if not correctly handled by the model, may lead to negative transfer. To this end, recent works have focused on building transfer CPDP models, but little is known about how similar or dissimilar sources should be to avoid negative transfer. This paper provides the first empirical investigation to understand the effect of combining different sources with different levels of similarities in transfer CPDP. We introduce the use of the Population Stability Index (PSI) to interpret whether the distribution of the combined or single-source data is similar to the target data. This was validated using an adversarial approach. Experimental results on three public datasets reveal that when the source and target distribution are very similar, the probability of false alarm is improved by 3% to 7% and the recall indicator is reduced from 1% to 8%. Interestingly, we also found that when dissimilar source data are combined with different source datasets, the overall domain divergence is lowered, and the performance is improved. The results highlight the importance of using the right source to aid the learning process.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

跨项目缺陷预测中的负迁移:领域发散的影响

跨项目缺陷预测(CPDP)模型用于新的软件项目预测任务，以提高缺陷预测率。在历史数据很少或没有历史数据的情况下，这些CPDP模型的开发可能具有挑战性。由于这个原因，研究人员可能需要依赖多种来源，并使用基于迁移学习的CPDP来构建缺陷预测模型。这些数据通常来自相似的和相关的项目，但是它们的分布可能不同于新的软件项目(目标数据)。虽然，基于迁移学习的CPDP模型是为了处理这些分布差异而设计的，但如果模型处理不当，可能会导致负迁移。为此，最近的工作集中在建立迁移CPDP模型上，但很少有人知道相似或不相似的来源应该如何避免负迁移。本文首次通过实证研究了解不同来源、不同相似度对迁移CPDP的影响。我们引入了使用人口稳定指数(PSI)来解释组合或单一来源数据的分布是否与目标数据相似。使用对抗性方法验证了这一点。在三个公开数据集上的实验结果表明，当源和目标分布非常相似时，误报概率提高3%至7%，召回率指标从1%降至8%。有趣的是，我们还发现，当不同的源数据与不同的源数据集结合时，整体的域散度降低，性能得到提高。结果强调了使用正确的资源来帮助学习过程的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)

自引率

0.00%

发文量