IC-GraF: An Improved Clustering with Graph-Embedding-Based Features for Software Defect Prediction

IF 1.5 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING IET Software Pub Date : 2024-09-16 DOI:10.1049/2024/8027037

Xuanye Wang, Lu Lu, Qingyan Tian, Haishan Lin

{"title":"IC-GraF: An Improved Clustering with Graph-Embedding-Based Features for Software Defect Prediction","authors":"Xuanye Wang, Lu Lu, Qingyan Tian, Haishan Lin","doi":"10.1049/2024/8027037","DOIUrl":null,"url":null,"abstract":"<div>\n <p>Software defect prediction (SDP) has been a prominent area of research in software engineering. Previous SDP methods often struggled in industrial applications, primarily due to the need for sufficient historical data. Thus, clustering-based unsupervised defect prediction (CUDP) and cross-project defect prediction (CPDP) emerged to address this challenge. However, the former exhibited limitations in capturing semantic and structural features, while the latter encountered constraints due to differences in data distribution across projects. Therefore, we introduce a novel framework called improved clustering with graph-embedding-based features (IC-GraF) for SDP without the reliance on historical data. First, a preprocessing operation is performed to extract program dependence graphs (PDGs) and mark distinct dependency relationships within them. Second, the improved deep graph infomax (IDGI) model, an extension of the DGI model specifically for SDP, is designed to generate graph-level representations of PDGs. Finally, a heuristic-based k-means clustering algorithm is employed to classify the features generated by IDGI. To validate the efficacy of IC-GraF, we conduct experiments based on 24 releases of the PROMISE dataset, using F-measure and G-measure as evaluation criteria. The findings indicate that IC-GraF achieves 5.0%−42.7% higher F-measure, 5%−39.4% higher G-measure, and 2.5%−11.4% higher AUC over existing CUDP methods. Even when compared with eight supervised learning-based SDP methods, IC-GraF maintains a superior competitive edge.</p>\n </div>","PeriodicalId":50378,"journal":{"name":"IET Software","volume":"2024 1","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/2024/8027037","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Software","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/2024/8027037","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Software defect prediction (SDP) has been a prominent area of research in software engineering. Previous SDP methods often struggled in industrial applications, primarily due to the need for sufficient historical data. Thus, clustering-based unsupervised defect prediction (CUDP) and cross-project defect prediction (CPDP) emerged to address this challenge. However, the former exhibited limitations in capturing semantic and structural features, while the latter encountered constraints due to differences in data distribution across projects. Therefore, we introduce a novel framework called improved clustering with graph-embedding-based features (IC-GraF) for SDP without the reliance on historical data. First, a preprocessing operation is performed to extract program dependence graphs (PDGs) and mark distinct dependency relationships within them. Second, the improved deep graph infomax (IDGI) model, an extension of the DGI model specifically for SDP, is designed to generate graph-level representations of PDGs. Finally, a heuristic-based k-means clustering algorithm is employed to classify the features generated by IDGI. To validate the efficacy of IC-GraF, we conduct experiments based on 24 releases of the PROMISE dataset, using F-measure and G-measure as evaluation criteria. The findings indicate that IC-GraF achieves 5.0%−42.7% higher F-measure, 5%−39.4% higher G-measure, and 2.5%−11.4% higher AUC over existing CUDP methods. Even when compared with eight supervised learning-based SDP methods, IC-GraF maintains a superior competitive edge.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

IC-GraF：基于图形嵌入特征的改进聚类，用于软件缺陷预测

软件缺陷预测（SDP）一直是软件工程的一个重要研究领域。以前的 SDP 方法在工业应用中往往举步维艰，主要原因是需要足够的历史数据。因此，基于聚类的无监督缺陷预测（CUDP）和跨项目缺陷预测（CPDP）应运而生，以应对这一挑战。然而，前者在捕捉语义和结构特征方面表现出局限性，而后者则因跨项目数据分布的差异而遇到限制。因此，我们为 SDP 引入了一种新的框架，即基于图嵌入特征的改进聚类（IC-GraF），而无需依赖历史数据。首先，进行预处理操作以提取程序依赖图（PDGs），并标记其中不同的依赖关系。其次，设计了改进的深度图 infomax（IDGI）模型，该模型是专门针对 SDP 的 DGI 模型的扩展，用于生成 PDGs 的图级表示。最后，采用基于启发式的 k-means 聚类算法对 IDGI 生成的特征进行分类。为了验证 IC-GraF 的功效，我们使用 F-measure 和 G-measure 作为评估标准，基于 24 个发布的 PROMISE 数据集进行了实验。结果表明，与现有的 CUDP 方法相比，IC-GraF 的 F-measure 高出 5.0%-42.7%，G-measure 高出 5%-39.4%，AUC 高出 2.5%-11.4%。即使与八种基于监督学习的 SDP 方法相比，IC-GraF 也保持了卓越的竞争优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IET Software 工程技术-计算机：软件工程

CiteScore

4.20

自引率

0.00%

发文量

审稿时长

9 months

期刊介绍： IET Software publishes papers on all aspects of the software lifecycle, including design, development, implementation and maintenance. The focus of the journal is on the methods used to develop and maintain software, and their practical application. Authors are especially encouraged to submit papers on the following topics, although papers on all aspects of software engineering are welcome: Software and systems requirements engineering Formal methods, design methods, practice and experience Software architecture, aspect and object orientation, reuse and re-engineering Testing, verification and validation techniques Software dependability and measurement Human systems engineering and human-computer interaction Knowledge engineering; expert and knowledge-based systems, intelligent agents Information systems engineering Application of software engineering in industry and commerce Software engineering technology transfer Management of software development Theoretical aspects of software development Machine learning Big data and big code Cloud computing Current Special Issue. Call for papers: Knowledge Discovery for Software Development - https://digital-library.theiet.org/files/IET_SEN_CFP_KDSD.pdf Big Data Analytics for Sustainable Software Development - https://digital-library.theiet.org/files/IET_SEN_CFP_BDASSD.pdf