The impact of feature selection techniques on effort-aware defect prediction: An empirical study

IF 1.5 4区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING IET Software Pub Date : 2023-02-05 DOI:10.1049/sfw2.12099
Fuyang Li, Wanpeng Lu, Jacky Wai Keung, Xiao Yu, Lina Gong, Juan Li
{"title":"The impact of feature selection techniques on effort-aware defect prediction: An empirical study","authors":"Fuyang Li,&nbsp;Wanpeng Lu,&nbsp;Jacky Wai Keung,&nbsp;Xiao Yu,&nbsp;Lina Gong,&nbsp;Juan Li","doi":"10.1049/sfw2.12099","DOIUrl":null,"url":null,"abstract":"<p>Effort-Aware Defect Prediction (EADP) methods sort software modules based on the defect density and guide the testing team to inspect the modules with high defect density first. Previous studies indicated that some feature selection methods could improve the performance of Classification-Based Defect Prediction (CBDP) models, and the Correlation-based feature subset selection method with the Best First strategy (CorBF) performed the best. However, the practical benefits of feature selection methods on EADP performance are still unknown, and blindly employing the best-performing CorBF method in CBDP to pre-process the defect datasets may not improve the performance of EADP models but possibly result in performance degradation. To assess the impact of the feature selection techniques on EADP, a total of 24 feature selection methods with 10 classifiers embedded in a state-of-the-art EADP model (CBS+) on the 41 PROMISE defect datasets were examined. We employ six evaluation metrics to assess the performance of EADP models comprehensively. The results show that (1) The impact of the feature selection methods varies in classifiers and datasets. (2) The four wrapper-based feature subset selection methods with forwards search, that is, AdaBoost with Forwards Search, Deep Forest with Forwards Search, Random Forest with Forwards Search, and XGBoost with Forwards Search (XGBF) are better than other methods across the studied classifiers and the used datasets. And XGBF with XGBoost as the embedded classifier in CBS+ performs the best on the datasets. (3) The best-performing CorBF method in CBDP does not perform well on the EADP task. (4) The selected features vary with different feature selection methods and different datasets, and the features <i>noc</i> (number of children), <i>ic</i> (inheritance coupling), <i>cbo</i> (coupling between object classes), and <i>cbm</i> (coupling between methods) are frequently selected by the four wrapper-based feature subset selection methods with forwards search. (5) Using AdaBoost, deep forest, random forest, and XGBoost as the base classifiers embedded in CBS+ can achieve the best performance. In summary, we recommend the software testing team should employ XGBF with XGBoost as the embedded classifier in CBS+ to enhance the EADP performance.</p>","PeriodicalId":50378,"journal":{"name":"IET Software","volume":"17 2","pages":"168-193"},"PeriodicalIF":1.5000,"publicationDate":"2023-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/sfw2.12099","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Software","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/sfw2.12099","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 9

Abstract

Effort-Aware Defect Prediction (EADP) methods sort software modules based on the defect density and guide the testing team to inspect the modules with high defect density first. Previous studies indicated that some feature selection methods could improve the performance of Classification-Based Defect Prediction (CBDP) models, and the Correlation-based feature subset selection method with the Best First strategy (CorBF) performed the best. However, the practical benefits of feature selection methods on EADP performance are still unknown, and blindly employing the best-performing CorBF method in CBDP to pre-process the defect datasets may not improve the performance of EADP models but possibly result in performance degradation. To assess the impact of the feature selection techniques on EADP, a total of 24 feature selection methods with 10 classifiers embedded in a state-of-the-art EADP model (CBS+) on the 41 PROMISE defect datasets were examined. We employ six evaluation metrics to assess the performance of EADP models comprehensively. The results show that (1) The impact of the feature selection methods varies in classifiers and datasets. (2) The four wrapper-based feature subset selection methods with forwards search, that is, AdaBoost with Forwards Search, Deep Forest with Forwards Search, Random Forest with Forwards Search, and XGBoost with Forwards Search (XGBF) are better than other methods across the studied classifiers and the used datasets. And XGBF with XGBoost as the embedded classifier in CBS+ performs the best on the datasets. (3) The best-performing CorBF method in CBDP does not perform well on the EADP task. (4) The selected features vary with different feature selection methods and different datasets, and the features noc (number of children), ic (inheritance coupling), cbo (coupling between object classes), and cbm (coupling between methods) are frequently selected by the four wrapper-based feature subset selection methods with forwards search. (5) Using AdaBoost, deep forest, random forest, and XGBoost as the base classifiers embedded in CBS+ can achieve the best performance. In summary, we recommend the software testing team should employ XGBF with XGBoost as the embedded classifier in CBS+ to enhance the EADP performance.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
特征选择技术对努力感知缺陷预测的影响:一项实证研究
Effort Aware Defect Prediction(EADP)方法根据缺陷密度对软件模块进行排序,并引导测试团队首先检查缺陷密度高的模块。先前的研究表明,一些特征选择方法可以提高基于分类的缺陷预测(CBDP)模型的性能,而基于相关性的特征子集选择方法和最佳优先策略(CorBF)表现最好。然而,特征选择方法对EADP性能的实际好处仍然未知,在CBDP中盲目使用性能最好的CorBF方法来预处理缺陷数据集可能不会提高EADP模型的性能,但可能导致性能下降。为了评估特征选择技术对EADP的影响,在41个PROMISE缺陷数据集上检查了总共24种特征选择方法,其中10个分类器嵌入在最先进的EADP模型(CBS+)中。我们采用了六个评估指标来全面评估EADP模型的性能。结果表明:(1)特征选择方法对分类器和数据集的影响各不相同。(2) 在所研究的分类器和所使用的数据集中,四种基于包装器的前向搜索特征子集选择方法,即AdaBoost with forwards search、Deep Forest with Forward search、Random Forest with forwards search和XGBoost with forward search(XGBF),都优于其他方法。以XGBoost作为CBS+中嵌入分类器的XGBF在数据集上表现最好。(3) CBDP中性能最好的CorBF方法在EADP任务中表现不佳。(4) 所选择的特征随着不同的特征选择方法和不同的数据集而变化,并且基于前向搜索的四种基于包装器的特征子集选择方法经常选择特征noc(子数)、ic(继承耦合)、cbo(对象类之间的耦合)和cbm(方法之间的耦合。(5) 使用AdaBoost、深层森林、随机森林和XGBoost作为嵌入CBS+的基础分类器可以获得最佳性能。总之,我们建议软件测试团队使用XGBF和XGBoost作为CBS+中的嵌入式分类器,以提高EADP性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IET Software
IET Software 工程技术-计算机:软件工程
CiteScore
4.20
自引率
0.00%
发文量
27
审稿时长
9 months
期刊介绍: IET Software publishes papers on all aspects of the software lifecycle, including design, development, implementation and maintenance. The focus of the journal is on the methods used to develop and maintain software, and their practical application. Authors are especially encouraged to submit papers on the following topics, although papers on all aspects of software engineering are welcome: Software and systems requirements engineering Formal methods, design methods, practice and experience Software architecture, aspect and object orientation, reuse and re-engineering Testing, verification and validation techniques Software dependability and measurement Human systems engineering and human-computer interaction Knowledge engineering; expert and knowledge-based systems, intelligent agents Information systems engineering Application of software engineering in industry and commerce Software engineering technology transfer Management of software development Theoretical aspects of software development Machine learning Big data and big code Cloud computing Current Special Issue. Call for papers: Knowledge Discovery for Software Development - https://digital-library.theiet.org/files/IET_SEN_CFP_KDSD.pdf Big Data Analytics for Sustainable Software Development - https://digital-library.theiet.org/files/IET_SEN_CFP_BDASSD.pdf
期刊最新文献
Software Defect Prediction Method Based on Clustering Ensemble Learning ConCPDP: A Cross-Project Defect Prediction Method Integrating Contrastive Pretraining and Category Boundary Adjustment Breaking the Blockchain Trilemma: A Comprehensive Consensus Mechanism for Ensuring Security, Scalability, and Decentralization IC-GraF: An Improved Clustering with Graph-Embedding-Based Features for Software Defect Prediction IAPCP: An Effective Cross-Project Defect Prediction Model via Intra-Domain Alignment and Programming-Based Distribution Adaptation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1