基于连通性的无监督分类器的跨项目缺陷预测

2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE) Pub Date : 2016-05-14 DOI:10.1145/2884781.2884839

Feng Zhang, Q. Zheng, Ying Zou, A. Hassan

{"title":"基于连通性的无监督分类器的跨项目缺陷预测","authors":"Feng Zhang, Q. Zheng, Ying Zou, A. Hassan","doi":"10.1145/2884781.2884839","DOIUrl":null,"url":null,"abstract":"Defect prediction on projects with limited historical data has attracted great interest from both researchers and practitioners. Cross-project defect prediction has been the main area of progress by reusing classifiers from other projects. However, existing approaches require some degree of homogeneity (e.g., a similar distribution of metric values) between the training projects and the target project. Satisfying the homogeneity requirement often requires significant effort (currently a very active area of research). An unsupervised classifier does not require any training data, therefore the heterogeneity challenge is no longer an issue. In this paper, we examine two types of unsupervised classifiers: a) distance-based classifiers (e.g., k-means); and b) connectivity-based classifiers. While distance-based unsupervised classifiers have been previously used in the defect prediction literature with disappointing performance, connectivity-based classifiers have never been explored before in our community. We compare the performance of unsupervised classifiers versus supervised classifiers using data from 26 projects from three publicly available datasets (i.e., AEEEM, NASA, and PROMISE). In the cross-project setting, our proposed connectivity-based classifier (via spectral clustering) ranks as one of the top classifiers among five widely-used supervised classifiers (i.e., random forest, naive Bayes, logistic regression, decision tree, and logistic model tree) and five unsupervised classifiers (i.e., k-means, partition around medoids, fuzzy C-means, neural-gas, and spectral clustering). In the within-project setting (i.e., models are built and applied on the same project), our spectral classifier ranks in the second tier, while only random forest ranks in the first tier. Hence, connectivity-based unsupervised classifiers offer a viable solution for cross and within project defect predictions.","PeriodicalId":6485,"journal":{"name":"2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)","volume":"157 1","pages":"309-320"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"210","resultStr":"{\"title\":\"Cross-Project Defect Prediction Using a Connectivity-Based Unsupervised Classifier\",\"authors\":\"Feng Zhang, Q. Zheng, Ying Zou, A. Hassan\",\"doi\":\"10.1145/2884781.2884839\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Defect prediction on projects with limited historical data has attracted great interest from both researchers and practitioners. Cross-project defect prediction has been the main area of progress by reusing classifiers from other projects. However, existing approaches require some degree of homogeneity (e.g., a similar distribution of metric values) between the training projects and the target project. Satisfying the homogeneity requirement often requires significant effort (currently a very active area of research). An unsupervised classifier does not require any training data, therefore the heterogeneity challenge is no longer an issue. In this paper, we examine two types of unsupervised classifiers: a) distance-based classifiers (e.g., k-means); and b) connectivity-based classifiers. While distance-based unsupervised classifiers have been previously used in the defect prediction literature with disappointing performance, connectivity-based classifiers have never been explored before in our community. We compare the performance of unsupervised classifiers versus supervised classifiers using data from 26 projects from three publicly available datasets (i.e., AEEEM, NASA, and PROMISE). In the cross-project setting, our proposed connectivity-based classifier (via spectral clustering) ranks as one of the top classifiers among five widely-used supervised classifiers (i.e., random forest, naive Bayes, logistic regression, decision tree, and logistic model tree) and five unsupervised classifiers (i.e., k-means, partition around medoids, fuzzy C-means, neural-gas, and spectral clustering). In the within-project setting (i.e., models are built and applied on the same project), our spectral classifier ranks in the second tier, while only random forest ranks in the first tier. Hence, connectivity-based unsupervised classifiers offer a viable solution for cross and within project defect predictions.\",\"PeriodicalId\":6485,\"journal\":{\"name\":\"2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)\",\"volume\":\"157 1\",\"pages\":\"309-320\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-05-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"210\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2884781.2884839\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2884781.2884839","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 210

摘要

基于有限历史数据的项目缺陷预测已经引起了研究者和实践者的极大兴趣。通过重用来自其他项目的分类器，跨项目缺陷预测一直是进展的主要领域。然而，现有的方法在训练项目和目标项目之间需要某种程度的同质性(例如，度量值的相似分布)。满足同质性要求通常需要大量的努力(目前是一个非常活跃的研究领域)。无监督分类器不需要任何训练数据，因此异构性挑战不再是一个问题。在本文中，我们研究了两种类型的无监督分类器:a)基于距离的分类器(例如k-means);b)基于连接的分类器。虽然基于距离的无监督分类器以前在缺陷预测文献中使用过，但性能令人失望，但基于连接的分类器以前从未在我们的社区中进行过探索。我们使用来自三个公开数据集(即AEEEM, NASA和PROMISE)的26个项目的数据来比较无监督分类器和监督分类器的性能。在跨项目设置中，我们提出的基于连通性的分类器(通过谱聚类)在五个广泛使用的监督分类器(即随机森林、朴素贝叶斯、逻辑回归、决策树和逻辑模型树)和五个无监督分类器(即k-means、围绕介质的划分、模糊C-means、神经气体和谱聚类)中排名第一。在项目内设置(即在同一项目上构建和应用模型)中，我们的光谱分类器排在第二层，而只有随机森林排在第一层。因此，基于连通性的无监督分类器为跨项目和项目内部缺陷预测提供了可行的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Cross-Project Defect Prediction Using a Connectivity-Based Unsupervised Classifier

Defect prediction on projects with limited historical data has attracted great interest from both researchers and practitioners. Cross-project defect prediction has been the main area of progress by reusing classifiers from other projects. However, existing approaches require some degree of homogeneity (e.g., a similar distribution of metric values) between the training projects and the target project. Satisfying the homogeneity requirement often requires significant effort (currently a very active area of research). An unsupervised classifier does not require any training data, therefore the heterogeneity challenge is no longer an issue. In this paper, we examine two types of unsupervised classifiers: a) distance-based classifiers (e.g., k-means); and b) connectivity-based classifiers. While distance-based unsupervised classifiers have been previously used in the defect prediction literature with disappointing performance, connectivity-based classifiers have never been explored before in our community. We compare the performance of unsupervised classifiers versus supervised classifiers using data from 26 projects from three publicly available datasets (i.e., AEEEM, NASA, and PROMISE). In the cross-project setting, our proposed connectivity-based classifier (via spectral clustering) ranks as one of the top classifiers among five widely-used supervised classifiers (i.e., random forest, naive Bayes, logistic regression, decision tree, and logistic model tree) and five unsupervised classifiers (i.e., k-means, partition around medoids, fuzzy C-means, neural-gas, and spectral clustering). In the within-project setting (i.e., models are built and applied on the same project), our spectral classifier ranks in the second tier, while only random forest ranks in the first tier. Hence, connectivity-based unsupervised classifiers offer a viable solution for cross and within project defect predictions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)

自引率

0.00%

发文量

期刊最新文献

Scalable Thread Sharing Analysis Overcoming Open Source Project Entry Barriers with a Portal for Newcomers Nomen est Omen: Exploring and Exploiting Similarities between Argument and Parameter Names Reliability of Run-Time Quality-of-Service Evaluation Using Parametric Model Checking On the Techniques We Create, the Tools We Build, and Their Misalignments: A Study of KLEE