Cross-Project Defect Prediction Based on Domain Adaptation and LSTM Optimization

Algorithms Pub Date : 2024-04-24 DOI:10.3390/a17050175

Khadija Javed, Shengbing Ren, M. Asim, M. A. Wani

{"title":"Cross-Project Defect Prediction Based on Domain Adaptation and LSTM Optimization","authors":"Khadija Javed, Shengbing Ren, M. Asim, M. A. Wani","doi":"10.3390/a17050175","DOIUrl":null,"url":null,"abstract":"Cross-project defect prediction (CPDP) aims to predict software defects in a target project domain by leveraging information from different source project domains, allowing testers to identify defective modules quickly. However, CPDP models often underperform due to different data distributions between source and target domains, class imbalances, and the presence of noisy and irrelevant instances in both source and target projects. Additionally, standard features often fail to capture sufficient semantic and contextual information from the source project, leading to poor prediction performance in the target project. To address these challenges, this research proposes Smote Correlation and Attention Gated recurrent unit based Long Short-Term Memory optimization (SCAG-LSTM), which first employs a novel hybrid technique that extends the synthetic minority over-sampling technique (SMOTE) with edited nearest neighbors (ENN) to rebalance class distributions and mitigate the issues caused by noisy and irrelevant instances in both source and target domains. Furthermore, correlation-based feature selection (CFS) with best-first search (BFS) is utilized to identify and select the most important features, aiming to reduce the differences in data distribution among projects. Additionally, SCAG-LSTM integrates bidirectional gated recurrent unit (Bi-GRU) and bidirectional long short-term memory (Bi-LSTM) networks to enhance the effectiveness of the long short-term memory (LSTM) model. These components efficiently capture semantic and contextual information as well as dependencies within the data, leading to more accurate predictions. Moreover, an attention mechanism is incorporated into the model to focus on key features, further improving prediction performance. Experiments are conducted on apache_lucene, equinox, eclipse_jdt_core, eclipse_pde_ui, and mylyn (AEEEM) and predictor models in software engineering (PROMISE) datasets and compared with active learning-based method (ALTRA), multi-source-based cross-project defect prediction method (MSCPDP), the two-phase feature importance amplification method (TFIA) on AEEEM and the two-phase transfer learning method (TPTL), domain adaptive kernel twin support vector machines method (DA-KTSVMO), and generative adversarial long-short term memory neural networks method (GB-CPDP) on PROMISE datasets. The results demonstrate that the proposed SCAG-LSTM model enhances the baseline models by 33.03%, 29.15% and 1.48% in terms of F1- measure and by 16.32%, 34.41% and 3.59% in terms of Area Under the Curve (AUC) on the AEEEM dataset, while on the PROMISE dataset it enhances the baseline models’ F1- measure by 42.60%, 32.00% and 25.10% and AUC by 34.90%, 27.80% and 12.96%. These findings suggest that the proposed model exhibits strong predictive performance.","PeriodicalId":502609,"journal":{"name":"Algorithms","volume":"55 6","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/a17050175","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Cross-project defect prediction (CPDP) aims to predict software defects in a target project domain by leveraging information from different source project domains, allowing testers to identify defective modules quickly. However, CPDP models often underperform due to different data distributions between source and target domains, class imbalances, and the presence of noisy and irrelevant instances in both source and target projects. Additionally, standard features often fail to capture sufficient semantic and contextual information from the source project, leading to poor prediction performance in the target project. To address these challenges, this research proposes Smote Correlation and Attention Gated recurrent unit based Long Short-Term Memory optimization (SCAG-LSTM), which first employs a novel hybrid technique that extends the synthetic minority over-sampling technique (SMOTE) with edited nearest neighbors (ENN) to rebalance class distributions and mitigate the issues caused by noisy and irrelevant instances in both source and target domains. Furthermore, correlation-based feature selection (CFS) with best-first search (BFS) is utilized to identify and select the most important features, aiming to reduce the differences in data distribution among projects. Additionally, SCAG-LSTM integrates bidirectional gated recurrent unit (Bi-GRU) and bidirectional long short-term memory (Bi-LSTM) networks to enhance the effectiveness of the long short-term memory (LSTM) model. These components efficiently capture semantic and contextual information as well as dependencies within the data, leading to more accurate predictions. Moreover, an attention mechanism is incorporated into the model to focus on key features, further improving prediction performance. Experiments are conducted on apache_lucene, equinox, eclipse_jdt_core, eclipse_pde_ui, and mylyn (AEEEM) and predictor models in software engineering (PROMISE) datasets and compared with active learning-based method (ALTRA), multi-source-based cross-project defect prediction method (MSCPDP), the two-phase feature importance amplification method (TFIA) on AEEEM and the two-phase transfer learning method (TPTL), domain adaptive kernel twin support vector machines method (DA-KTSVMO), and generative adversarial long-short term memory neural networks method (GB-CPDP) on PROMISE datasets. The results demonstrate that the proposed SCAG-LSTM model enhances the baseline models by 33.03%, 29.15% and 1.48% in terms of F1- measure and by 16.32%, 34.41% and 3.59% in terms of Area Under the Curve (AUC) on the AEEEM dataset, while on the PROMISE dataset it enhances the baseline models’ F1- measure by 42.60%, 32.00% and 25.10% and AUC by 34.90%, 27.80% and 12.96%. These findings suggest that the proposed model exhibits strong predictive performance.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于领域适应和 LSTM 优化的跨项目缺陷预测

跨项目缺陷预测（CPDP）旨在利用来自不同源项目域的信息预测目标项目域中的软件缺陷，使测试人员能够快速识别缺陷模块。然而，由于源领域和目标领域的数据分布不同、类不平衡以及源项目和目标项目中都存在噪声和不相关的实例，CPDP 模型往往表现不佳。此外，标准特征往往无法从源项目中获取足够的语义和上下文信息，从而导致目标项目中的预测性能不佳。为了应对这些挑战，本研究提出了基于长短期记忆优化的 Smote Correlation and Attention Gated 循环单元（SCAG-LSTM），它首先采用了一种新颖的混合技术，将合成少数过度采样技术（SMOTE）与编辑近邻技术（ENN）进行了扩展，以重新平衡类分布，缓解源域和目标域中由噪声和不相关实例引起的问题。此外，还利用基于相关性的特征选择（CFS）和最佳优先搜索（BFS）来识别和选择最重要的特征，以减少项目间数据分布的差异。此外，SCAG-LSTM 还集成了双向门控递归单元（Bi-GRU）和双向长短期记忆（Bi-LSTM）网络，以增强长短期记忆（LSTM）模型的有效性。这些组件能有效捕捉语义和上下文信息以及数据中的依赖关系，从而实现更准确的预测。此外，还在模型中加入了注意力机制，以关注关键特征，从而进一步提高预测性能。在 apache_lucene、equinox、eclipse_jdt_core、eclipse_pde_ui 和 mylyn（AEEEM）以及软件工程（PROMISE）数据集中的预测模型上进行了实验，并与基于主动学习的方法（ALTRA）、基于多源的跨项目缺陷预测方法（MSCPDP）进行了比较、在 PROMISE 数据集上与基于主动学习的方法（ALTRA）、基于多源的跨项目缺陷预测方法（MSCPDP）、AEEEM 上的两阶段特征重要性放大方法（TFIA）、两阶段迁移学习方法（TPTL）、域自适应核孪生支持向量机方法（DA-KTSVMO）和生成对抗长短期记忆神经网络方法（GB-CPDP）进行了比较。结果表明，在 AEEEM 数据集上，拟议的 SCAG-LSTM 模型的 F1- 测量值比基线模型分别提高了 33.03%、29.15% 和 1.48%，曲线下面积（AUC）比基线模型分别提高了 16.32%、34.41% 和 3.59%；在 PROMISE 数据集上，拟议的 SCAG-LSTM 模型的 F1- 测量值比基线模型分别提高了 42.60%、32.00% 和 25.10%，曲线下面积（AUC）比基线模型分别提高了 34.90%、27.80% 和 12.96%。这些发现表明，所提出的模型具有很强的预测性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Algorithms

自引率

0.00%

发文量