Cross-Project Defect Prediction Based on Domain Adaptation and LSTM Optimization

Algorithms Pub Date : 2024-04-24 DOI:10.3390/a17050175
Khadija Javed, Shengbing Ren, M. Asim, M. A. Wani
{"title":"Cross-Project Defect Prediction Based on Domain Adaptation and LSTM Optimization","authors":"Khadija Javed, Shengbing Ren, M. Asim, M. A. Wani","doi":"10.3390/a17050175","DOIUrl":null,"url":null,"abstract":"Cross-project defect prediction (CPDP) aims to predict software defects in a target project domain by leveraging information from different source project domains, allowing testers to identify defective modules quickly. However, CPDP models often underperform due to different data distributions between source and target domains, class imbalances, and the presence of noisy and irrelevant instances in both source and target projects. Additionally, standard features often fail to capture sufficient semantic and contextual information from the source project, leading to poor prediction performance in the target project. To address these challenges, this research proposes Smote Correlation and Attention Gated recurrent unit based Long Short-Term Memory optimization (SCAG-LSTM), which first employs a novel hybrid technique that extends the synthetic minority over-sampling technique (SMOTE) with edited nearest neighbors (ENN) to rebalance class distributions and mitigate the issues caused by noisy and irrelevant instances in both source and target domains. Furthermore, correlation-based feature selection (CFS) with best-first search (BFS) is utilized to identify and select the most important features, aiming to reduce the differences in data distribution among projects. Additionally, SCAG-LSTM integrates bidirectional gated recurrent unit (Bi-GRU) and bidirectional long short-term memory (Bi-LSTM) networks to enhance the effectiveness of the long short-term memory (LSTM) model. These components efficiently capture semantic and contextual information as well as dependencies within the data, leading to more accurate predictions. Moreover, an attention mechanism is incorporated into the model to focus on key features, further improving prediction performance. Experiments are conducted on apache_lucene, equinox, eclipse_jdt_core, eclipse_pde_ui, and mylyn (AEEEM) and predictor models in software engineering (PROMISE) datasets and compared with active learning-based method (ALTRA), multi-source-based cross-project defect prediction method (MSCPDP), the two-phase feature importance amplification method (TFIA) on AEEEM and the two-phase transfer learning method (TPTL), domain adaptive kernel twin support vector machines method (DA-KTSVMO), and generative adversarial long-short term memory neural networks method (GB-CPDP) on PROMISE datasets. The results demonstrate that the proposed SCAG-LSTM model enhances the baseline models by 33.03%, 29.15% and 1.48% in terms of F1- measure and by 16.32%, 34.41% and 3.59% in terms of Area Under the Curve (AUC) on the AEEEM dataset, while on the PROMISE dataset it enhances the baseline models’ F1- measure by 42.60%, 32.00% and 25.10% and AUC by 34.90%, 27.80% and 12.96%. These findings suggest that the proposed model exhibits strong predictive performance.","PeriodicalId":502609,"journal":{"name":"Algorithms","volume":"55 6","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/a17050175","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Cross-project defect prediction (CPDP) aims to predict software defects in a target project domain by leveraging information from different source project domains, allowing testers to identify defective modules quickly. However, CPDP models often underperform due to different data distributions between source and target domains, class imbalances, and the presence of noisy and irrelevant instances in both source and target projects. Additionally, standard features often fail to capture sufficient semantic and contextual information from the source project, leading to poor prediction performance in the target project. To address these challenges, this research proposes Smote Correlation and Attention Gated recurrent unit based Long Short-Term Memory optimization (SCAG-LSTM), which first employs a novel hybrid technique that extends the synthetic minority over-sampling technique (SMOTE) with edited nearest neighbors (ENN) to rebalance class distributions and mitigate the issues caused by noisy and irrelevant instances in both source and target domains. Furthermore, correlation-based feature selection (CFS) with best-first search (BFS) is utilized to identify and select the most important features, aiming to reduce the differences in data distribution among projects. Additionally, SCAG-LSTM integrates bidirectional gated recurrent unit (Bi-GRU) and bidirectional long short-term memory (Bi-LSTM) networks to enhance the effectiveness of the long short-term memory (LSTM) model. These components efficiently capture semantic and contextual information as well as dependencies within the data, leading to more accurate predictions. Moreover, an attention mechanism is incorporated into the model to focus on key features, further improving prediction performance. Experiments are conducted on apache_lucene, equinox, eclipse_jdt_core, eclipse_pde_ui, and mylyn (AEEEM) and predictor models in software engineering (PROMISE) datasets and compared with active learning-based method (ALTRA), multi-source-based cross-project defect prediction method (MSCPDP), the two-phase feature importance amplification method (TFIA) on AEEEM and the two-phase transfer learning method (TPTL), domain adaptive kernel twin support vector machines method (DA-KTSVMO), and generative adversarial long-short term memory neural networks method (GB-CPDP) on PROMISE datasets. The results demonstrate that the proposed SCAG-LSTM model enhances the baseline models by 33.03%, 29.15% and 1.48% in terms of F1- measure and by 16.32%, 34.41% and 3.59% in terms of Area Under the Curve (AUC) on the AEEEM dataset, while on the PROMISE dataset it enhances the baseline models’ F1- measure by 42.60%, 32.00% and 25.10% and AUC by 34.90%, 27.80% and 12.96%. These findings suggest that the proposed model exhibits strong predictive performance.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于领域适应和 LSTM 优化的跨项目缺陷预测
跨项目缺陷预测(CPDP)旨在利用来自不同源项目域的信息预测目标项目域中的软件缺陷,使测试人员能够快速识别缺陷模块。然而,由于源领域和目标领域的数据分布不同、类不平衡以及源项目和目标项目中都存在噪声和不相关的实例,CPDP 模型往往表现不佳。此外,标准特征往往无法从源项目中获取足够的语义和上下文信息,从而导致目标项目中的预测性能不佳。为了应对这些挑战,本研究提出了基于长短期记忆优化的 Smote Correlation and Attention Gated 循环单元(SCAG-LSTM),它首先采用了一种新颖的混合技术,将合成少数过度采样技术(SMOTE)与编辑近邻技术(ENN)进行了扩展,以重新平衡类分布,缓解源域和目标域中由噪声和不相关实例引起的问题。此外,还利用基于相关性的特征选择(CFS)和最佳优先搜索(BFS)来识别和选择最重要的特征,以减少项目间数据分布的差异。此外,SCAG-LSTM 还集成了双向门控递归单元(Bi-GRU)和双向长短期记忆(Bi-LSTM)网络,以增强长短期记忆(LSTM)模型的有效性。这些组件能有效捕捉语义和上下文信息以及数据中的依赖关系,从而实现更准确的预测。此外,还在模型中加入了注意力机制,以关注关键特征,从而进一步提高预测性能。在 apache_lucene、equinox、eclipse_jdt_core、eclipse_pde_ui 和 mylyn(AEEEM)以及软件工程(PROMISE)数据集中的预测模型上进行了实验,并与基于主动学习的方法(ALTRA)、基于多源的跨项目缺陷预测方法(MSCPDP)进行了比较、在 PROMISE 数据集上与基于主动学习的方法(ALTRA)、基于多源的跨项目缺陷预测方法(MSCPDP)、AEEEM 上的两阶段特征重要性放大方法(TFIA)、两阶段迁移学习方法(TPTL)、域自适应核孪生支持向量机方法(DA-KTSVMO)和生成对抗长短期记忆神经网络方法(GB-CPDP)进行了比较。结果表明,在 AEEEM 数据集上,拟议的 SCAG-LSTM 模型的 F1- 测量值比基线模型分别提高了 33.03%、29.15% 和 1.48%,曲线下面积(AUC)比基线模型分别提高了 16.32%、34.41% 和 3.59%;在 PROMISE 数据集上,拟议的 SCAG-LSTM 模型的 F1- 测量值比基线模型分别提高了 42.60%、32.00% 和 25.10%,曲线下面积(AUC)比基线模型分别提高了 34.90%、27.80% 和 12.96%。这些发现表明,所提出的模型具有很强的预测性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Label-Setting Algorithm for Multi-Destination K Simple Shortest Paths Problem and Application A Quantum Approach for Exploring the Numerical Results of the Heat Equation Enhancing Indoor Positioning Accuracy with WLAN and WSN: A QPSO Hybrid Algorithm with Surface Tessellation Trajectory Classification and Recognition of Planar Mechanisms Based on ResNet18 Network Computational Test for Conditional Independence
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1