Study of Data Mining Algorithms Using a Dataset from the Size-Effect on Open Source Software Defects

mjl@ jm`@ krkwk ldrst l`lmy@ Pub Date : 2020-06-01 DOI:10.32894/kujss.2020.15.2.3

Muthana Yaseen Nawaf, M. M. Rashid

{"title":"Study of Data Mining Algorithms Using a Dataset from the Size-Effect on Open Source Software Defects","authors":"Muthana Yaseen Nawaf, M. M. Rashid","doi":"10.32894/kujss.2020.15.2.3","DOIUrl":null,"url":null,"abstract":"This article focuses on the quality of data mining algorithms in terms of the accuracy ratio and time consumption. So, in order to figure out the best algorithm among the classification and clustering algorithms, the WEKA program will be testing all algorithms using a real dataset from the size effect on defect proneness for open source software. The Mozilla product is adopted as an example of open source software. The dataset that is used in this paper represents the output of the study of the size effect on defect proneness in the open source software. The study of Mozilla product shows a significant relationship between the size of software and the number of defect proneness in software. The Mozilla product study produced a dataset to be as inputs of the WEKA program in order to compare the data mining tools (algorithms). We use the Naive Bayes, Decision Trees J48, Expectation-maximization for classifying and K-Star and Simple KMeans for clustering methods. The findings demonstrate the difference between the algorithms according to the accuracy, and the time consuming to reach the result in each algorithm. Furthermore, the effect of the software size is significant on defect proneness. Finally, the experiments are conducted in WEKA with the aim of this research is finding out the best algorithm in terms of accuracy and timeconsuming. At the end, the paper will be figuring out the best algorithm in order to choose and depending on it in the tests of classification and clustering.","PeriodicalId":34247,"journal":{"name":"mjl@ jm`@ krkwk ldrst l`lmy@","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"mjl@ jm`@ krkwk ldrst l`lmy@","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32894/kujss.2020.15.2.3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This article focuses on the quality of data mining algorithms in terms of the accuracy ratio and time consumption. So, in order to figure out the best algorithm among the classification and clustering algorithms, the WEKA program will be testing all algorithms using a real dataset from the size effect on defect proneness for open source software. The Mozilla product is adopted as an example of open source software. The dataset that is used in this paper represents the output of the study of the size effect on defect proneness in the open source software. The study of Mozilla product shows a significant relationship between the size of software and the number of defect proneness in software. The Mozilla product study produced a dataset to be as inputs of the WEKA program in order to compare the data mining tools (algorithms). We use the Naive Bayes, Decision Trees J48, Expectation-maximization for classifying and K-Star and Simple KMeans for clustering methods. The findings demonstrate the difference between the algorithms according to the accuracy, and the time consuming to reach the result in each algorithm. Furthermore, the effect of the software size is significant on defect proneness. Finally, the experiments are conducted in WEKA with the aim of this research is finding out the best algorithm in terms of accuracy and timeconsuming. At the end, the paper will be figuring out the best algorithm in order to choose and depending on it in the tests of classification and clustering.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于开源软件缺陷规模效应数据集的数据挖掘算法研究

本文主要从正确率和耗时两个方面讨论数据挖掘算法的质量。因此，为了在分类和聚类算法中找出最好的算法，WEKA程序将使用来自开源软件缺陷倾向的大小效应的真实数据集测试所有算法。本文以Mozilla产品为例介绍开源软件。本文使用的数据集代表了开源软件中缺陷倾向性的大小效应研究的输出。对Mozilla产品的研究表明，软件的大小与软件中缺陷倾向的数量之间存在显著的关系。Mozilla产品研究生成了一个数据集作为WEKA程序的输入，以便比较数据挖掘工具(算法)。我们使用朴素贝叶斯，决策树J48，期望最大化分类和K-Star和简单的k - means聚类方法。研究结果显示了不同算法在精度上的差异，以及每种算法达到结果所需的时间。此外，软件大小对缺陷倾向的影响是显著的。最后，在WEKA中进行了实验，目的是找出在准确率和耗时方面最好的算法。最后，本文将找出最佳算法，以便在分类和聚类测试中进行选择和依赖。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

mjl@ jm`@ krkwk ldrst l`lmy@

自引率

0.00%

发文量

审稿时长

10 weeks