An Approach for Cross Project Defect Prediction Using Identical Metrics Matching and Deep Neural Network

IF 5.7 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE IEEE Transactions on Reliability Pub Date : 2024-08-14 DOI:10.1109/TR.2024.3435709
Pravas Ranjan Bal;Sandeep Kumar
{"title":"An Approach for Cross Project Defect Prediction Using Identical Metrics Matching and Deep Neural Network","authors":"Pravas Ranjan Bal;Sandeep Kumar","doi":"10.1109/TR.2024.3435709","DOIUrl":null,"url":null,"abstract":"Advancements in software defect prediction (SDP) to handle the scenario of no or limited historical data have introduced the concept of cross-project defect prediction (CPDP). CPDP using machine learning (ML) algorithms has been the staple research area for all software practitioners in the SDP domain. An important assumption in ML algorithms is that both train and test data must follow similar data distribution for better accuracy. These assumptions may hold in the within-project defect prediction (WPDP) scenario where both train and test data belong to the same project. However, it is impossible in the CPDP scenario where the train and test data belong to different projects. So, in the CPDP scenario, researchers tried to use a matched metrics approach to handling this issue. However, in this case, there may be an issue if only a small-sized source (train) dataset matches the data distribution with the target (test) dataset, leading to an insufficient training dataset. Hence, we have proposed a cross-project data preprocessing method, namely knowledge transfer from target data to source data using correlation (KTTSC), to handle this issue and hence to improve the CPDP accuracy of ML models. The experimental results demonstrate that using the dropout regularization-based deep neural network, k nearest neighbor, decision tree, logistic regression, and Naive Bayes classifiers with the proposed KTTSC method show an improvement of 22%, 17%, 23.2%, 13.5%, and 9.5%, respectively, in terms of average AUC scores as compared to the traditional CPDP method and an improvement in the range of 6.6% to 11.1% as compared to existing works on CPDP.","PeriodicalId":56305,"journal":{"name":"IEEE Transactions on Reliability","volume":"74 2","pages":"2678-2692"},"PeriodicalIF":5.7000,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Reliability","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10636799/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

Advancements in software defect prediction (SDP) to handle the scenario of no or limited historical data have introduced the concept of cross-project defect prediction (CPDP). CPDP using machine learning (ML) algorithms has been the staple research area for all software practitioners in the SDP domain. An important assumption in ML algorithms is that both train and test data must follow similar data distribution for better accuracy. These assumptions may hold in the within-project defect prediction (WPDP) scenario where both train and test data belong to the same project. However, it is impossible in the CPDP scenario where the train and test data belong to different projects. So, in the CPDP scenario, researchers tried to use a matched metrics approach to handling this issue. However, in this case, there may be an issue if only a small-sized source (train) dataset matches the data distribution with the target (test) dataset, leading to an insufficient training dataset. Hence, we have proposed a cross-project data preprocessing method, namely knowledge transfer from target data to source data using correlation (KTTSC), to handle this issue and hence to improve the CPDP accuracy of ML models. The experimental results demonstrate that using the dropout regularization-based deep neural network, k nearest neighbor, decision tree, logistic regression, and Naive Bayes classifiers with the proposed KTTSC method show an improvement of 22%, 17%, 23.2%, 13.5%, and 9.5%, respectively, in terms of average AUC scores as compared to the traditional CPDP method and an improvement in the range of 6.6% to 11.1% as compared to existing works on CPDP.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用相同度量匹配和深度神经网络进行跨项目缺陷预测的方法
软件缺陷预测(SDP)在处理没有或有限的历史数据场景方面的进步引入了跨项目缺陷预测(CPDP)的概念。使用机器学习(ML)算法的CPDP已经成为SDP领域所有软件从业者的主要研究领域。机器学习算法中的一个重要假设是训练和测试数据必须遵循相似的数据分布以获得更好的准确性。这些假设可能在项目内缺陷预测(WPDP)场景中成立,其中训练和测试数据都属于同一个项目。然而,在训练数据和测试数据属于不同项目的CPDP场景中,这是不可能的。因此,在CPDP场景中,研究人员试图使用匹配的度量方法来处理这个问题。然而,在这种情况下,如果只有一个小规模的源(训练)数据集与目标(测试)数据集的数据分布相匹配,可能会出现问题,导致训练数据集不足。因此,我们提出了一种跨项目的数据预处理方法,即利用关联(KTTSC)从目标数据到源数据的知识转移,来处理这一问题,从而提高ML模型的CPDP精度。实验结果表明,使用基于dropout正则化的深度神经网络、k近邻、决策树、逻辑回归和朴素贝叶斯分类器,与传统的CPDP方法相比,KTTSC方法的平均AUC分数分别提高了22%、17%、23.2%、13.5%和9.5%,与现有的CPDP方法相比,提高了6.6%至11.1%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Transactions on Reliability
IEEE Transactions on Reliability 工程技术-工程:电子与电气
CiteScore
12.20
自引率
8.50%
发文量
153
审稿时长
7.5 months
期刊介绍: IEEE Transactions on Reliability is a refereed journal for the reliability and allied disciplines including, but not limited to, maintainability, physics of failure, life testing, prognostics, design and manufacture for reliability, reliability for systems of systems, network availability, mission success, warranty, safety, and various measures of effectiveness. Topics eligible for publication range from hardware to software, from materials to systems, from consumer and industrial devices to manufacturing plants, from individual items to networks, from techniques for making things better to ways of predicting and measuring behavior in the field. As an engineering subject that supports new and existing technologies, we constantly expand into new areas of the assurance sciences.
期刊最新文献
URL2Path: A Robust Graph Learning Approach for Malicious URL Detection A Multisource Data Feature Fusion Method Based on FCN and Residual Attention Mechanism for Remaining Life Prediction of Gas Turbine CoWAR: A General Complementary Web API Recommendation Framework Based on Learning Model Decentralized Event-Triggered Quantized Control for Cyber-Physical Systems Under Multiple-Channel Denial-of-Service Attacks Zero Forgetting Lifelong Dictionary Learning Based on Low-Rank Decomposition for Multimode Process Monitoring
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1