CLAMI: Defect Prediction on Unlabeled Datasets (T)

2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE) Pub Date : 2015-11-09 DOI:10.1109/ASE.2015.56

Jaechang Nam, Sunghun Kim

{"title":"CLAMI: Defect Prediction on Unlabeled Datasets (T)","authors":"Jaechang Nam, Sunghun Kim","doi":"10.1109/ASE.2015.56","DOIUrl":null,"url":null,"abstract":"Defect prediction on new projects or projects with limited historical data is an interesting problem in software engineering. This is largely because it is difficult to collect defect information to label a dataset for training a prediction model. Cross-project defect prediction (CPDP) has tried to address this problem by reusing prediction models built by other projects that have enough historical data. However, CPDP does not always build a strong prediction model because of the different distributions among datasets. Approaches for defect prediction on unlabeled datasets have also tried to address the problem by adopting unsupervised learning but it has one major limitation, the necessity for manual effort. In this study, we propose novel approaches, CLA and CLAMI, that show the potential for defect prediction on unlabeled datasets in an automated manner without need for manual effort. The key idea of the CLA and CLAMI approaches is to label an unlabeled dataset by using the magnitude of metric values. In our empirical study on seven open-source projects, the CLAMI approach led to the promising prediction performances, 0.636 and 0.723 in average f-measure and AUC, that are comparable to those of defect prediction based on supervised learning.","PeriodicalId":6586,"journal":{"name":"2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE)","volume":"7 1","pages":"452-463"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"124","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASE.2015.56","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 124

Abstract

Defect prediction on new projects or projects with limited historical data is an interesting problem in software engineering. This is largely because it is difficult to collect defect information to label a dataset for training a prediction model. Cross-project defect prediction (CPDP) has tried to address this problem by reusing prediction models built by other projects that have enough historical data. However, CPDP does not always build a strong prediction model because of the different distributions among datasets. Approaches for defect prediction on unlabeled datasets have also tried to address the problem by adopting unsupervised learning but it has one major limitation, the necessity for manual effort. In this study, we propose novel approaches, CLA and CLAMI, that show the potential for defect prediction on unlabeled datasets in an automated manner without need for manual effort. The key idea of the CLA and CLAMI approaches is to label an unlabeled dataset by using the magnitude of metric values. In our empirical study on seven open-source projects, the CLAMI approach led to the promising prediction performances, 0.636 and 0.723 in average f-measure and AUC, that are comparable to those of defect prediction based on supervised learning.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

CLAMI:未标记数据集的缺陷预测(T)

对新项目或历史数据有限的项目进行缺陷预测是软件工程中一个有趣的问题。这主要是因为很难收集缺陷信息来标记训练预测模型的数据集。跨项目缺陷预测(CPDP)试图通过重用其他有足够历史数据的项目构建的预测模型来解决这个问题。然而，由于数据集之间的分布不同，CPDP并不总是建立一个强大的预测模型。对未标记数据集进行缺陷预测的方法也试图通过采用无监督学习来解决问题，但它有一个主要的限制，即需要人工努力。在这项研究中，我们提出了新的方法，CLA和CLAMI，它们显示了在不需要人工的情况下，以自动化的方式对未标记的数据集进行缺陷预测的潜力。CLA和CLAMI方法的关键思想是通过使用度量值的大小来标记未标记的数据集。在我们对7个开源项目的实证研究中，CLAMI方法的预测性能很好，平均f-measure和AUC分别为0.636和0.723，与基于监督学习的缺陷预测相当。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE)

自引率

0.00%

发文量

期刊最新文献

Cost-Efficient Sampling for Performance Prediction of Configurable Systems (T) Refactorings for Android Asynchronous Programming Study and Refactoring of Android Asynchronous Programming (T) The iMPAcT Tool: Testing UI Patterns on Mobile Applications Combining Deep Learning with Information Retrieval to Localize Buggy Files for Bug Reports (N)