Negative Transfer in Cross Project Defect Prediction: Effect of Domain Divergence

Osayande P. Omondiagbe, Sherlock A. Licorish, Stephen G. MacDonell
{"title":"Negative Transfer in Cross Project Defect Prediction: Effect of Domain Divergence","authors":"Osayande P. Omondiagbe, Sherlock A. Licorish, Stephen G. MacDonell","doi":"10.1109/SEAA56994.2022.00010","DOIUrl":null,"url":null,"abstract":"Cross-project defect prediction (CPDP) models are used in new software project prediction tasks to improve defect prediction rates. The development of these CPDP models could be challenging in cases where there is little or no historical data. For this reason, researchers may need to rely on multiple sources and use transfer learning-based CPDP for building defect prediction models. These data are typically taken from similar and related projects, but their distributions can be different from the new software project (target data). Although, transfer learning-based CPDP models are designed to handle these distribution differences, but if not correctly handled by the model, may lead to negative transfer. To this end, recent works have focused on building transfer CPDP models, but little is known about how similar or dissimilar sources should be to avoid negative transfer. This paper provides the first empirical investigation to understand the effect of combining different sources with different levels of similarities in transfer CPDP. We introduce the use of the Population Stability Index (PSI) to interpret whether the distribution of the combined or single-source data is similar to the target data. This was validated using an adversarial approach. Experimental results on three public datasets reveal that when the source and target distribution are very similar, the probability of false alarm is improved by 3% to 7% and the recall indicator is reduced from 1% to 8%. Interestingly, we also found that when dissimilar source data are combined with different source datasets, the overall domain divergence is lowered, and the performance is improved. The results highlight the importance of using the right source to aid the learning process.","PeriodicalId":269970,"journal":{"name":"2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SEAA56994.2022.00010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Cross-project defect prediction (CPDP) models are used in new software project prediction tasks to improve defect prediction rates. The development of these CPDP models could be challenging in cases where there is little or no historical data. For this reason, researchers may need to rely on multiple sources and use transfer learning-based CPDP for building defect prediction models. These data are typically taken from similar and related projects, but their distributions can be different from the new software project (target data). Although, transfer learning-based CPDP models are designed to handle these distribution differences, but if not correctly handled by the model, may lead to negative transfer. To this end, recent works have focused on building transfer CPDP models, but little is known about how similar or dissimilar sources should be to avoid negative transfer. This paper provides the first empirical investigation to understand the effect of combining different sources with different levels of similarities in transfer CPDP. We introduce the use of the Population Stability Index (PSI) to interpret whether the distribution of the combined or single-source data is similar to the target data. This was validated using an adversarial approach. Experimental results on three public datasets reveal that when the source and target distribution are very similar, the probability of false alarm is improved by 3% to 7% and the recall indicator is reduced from 1% to 8%. Interestingly, we also found that when dissimilar source data are combined with different source datasets, the overall domain divergence is lowered, and the performance is improved. The results highlight the importance of using the right source to aid the learning process.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
跨项目缺陷预测中的负迁移:领域发散的影响
跨项目缺陷预测(CPDP)模型用于新的软件项目预测任务,以提高缺陷预测率。在历史数据很少或没有历史数据的情况下,这些CPDP模型的开发可能具有挑战性。由于这个原因,研究人员可能需要依赖多种来源,并使用基于迁移学习的CPDP来构建缺陷预测模型。这些数据通常来自相似的和相关的项目,但是它们的分布可能不同于新的软件项目(目标数据)。虽然,基于迁移学习的CPDP模型是为了处理这些分布差异而设计的,但如果模型处理不当,可能会导致负迁移。为此,最近的工作集中在建立迁移CPDP模型上,但很少有人知道相似或不相似的来源应该如何避免负迁移。本文首次通过实证研究了解不同来源、不同相似度对迁移CPDP的影响。我们引入了使用人口稳定指数(PSI)来解释组合或单一来源数据的分布是否与目标数据相似。使用对抗性方法验证了这一点。在三个公开数据集上的实验结果表明,当源和目标分布非常相似时,误报概率提高3%至7%,召回率指标从1%降至8%。有趣的是,我们还发现,当不同的源数据与不同的源数据集结合时,整体的域散度降低,性能得到提高。结果强调了使用正确的资源来帮助学习过程的重要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Service Classification through Machine Learning: Aiding in the Efficient Identification of Reusable Assets in Cloud Application Development Handling Environmental Uncertainty in Design Time Access Control Analysis How are software datasets constructed in Empirical Software Engineering studies? A systematic mapping study Microservices smell detection through dynamic analysis Towards Secure Agile Software Development Process: A Practice-Based Model
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1