具有不同训练数据大小和缺失属性的监督学习算法的性能评估

Chaluemwut Noyunsan, Tatpong Katanyukul, K. Saikaew
{"title":"具有不同训练数据大小和缺失属性的监督学习算法的性能评估","authors":"Chaluemwut Noyunsan, Tatpong Katanyukul, K. Saikaew","doi":"10.14456/EASR.2018.28","DOIUrl":null,"url":null,"abstract":"Supervised learning is a machine learning technique used for creating a data prediction model. This article focuses on finding high performance supervised learning algorithms with varied training data sizes, varied number of attributes, and time spent on prediction. This studied evaluated seven algorithms, Boosting, Random Forest, Bagging, Naive Bayes, K-Nearest Neighbours (K-NN), Decision Tree, and Support Vector Machine (SVM), on seven data sets that are the standard benchmark from University of California, Irvine (UCI) with two evaluation metrics and experimental settings of various training data sizes and missing key attributes. Our findings reveal that Bagging, Random Forest, and SVM are overall the three most accurate algorithms. However, when presence of key attribute values is of concern, K-NN is recommended as its performance is affected the least. Alternatively, when training data sizes may be not large enough, Naive Bayes is preferable since it is the most insensitive algorithm to training data sizes. The algorithms are characterized on a two-dimension chart based on prediction performance and computation time. This chart is expected to guide a novice user to choose an appropriate method for his/her demand. Based on this chart, in general, Bagging and Random Forest are the two most recommended algorithms because of their high performance and speed.","PeriodicalId":37310,"journal":{"name":"Engineering and Applied Science Research","volume":"45 1","pages":"221-229"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Performance evaluation of supervised learning algorithms with various training data sizes and missing attributes\",\"authors\":\"Chaluemwut Noyunsan, Tatpong Katanyukul, K. Saikaew\",\"doi\":\"10.14456/EASR.2018.28\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Supervised learning is a machine learning technique used for creating a data prediction model. This article focuses on finding high performance supervised learning algorithms with varied training data sizes, varied number of attributes, and time spent on prediction. This studied evaluated seven algorithms, Boosting, Random Forest, Bagging, Naive Bayes, K-Nearest Neighbours (K-NN), Decision Tree, and Support Vector Machine (SVM), on seven data sets that are the standard benchmark from University of California, Irvine (UCI) with two evaluation metrics and experimental settings of various training data sizes and missing key attributes. Our findings reveal that Bagging, Random Forest, and SVM are overall the three most accurate algorithms. However, when presence of key attribute values is of concern, K-NN is recommended as its performance is affected the least. Alternatively, when training data sizes may be not large enough, Naive Bayes is preferable since it is the most insensitive algorithm to training data sizes. The algorithms are characterized on a two-dimension chart based on prediction performance and computation time. This chart is expected to guide a novice user to choose an appropriate method for his/her demand. Based on this chart, in general, Bagging and Random Forest are the two most recommended algorithms because of their high performance and speed.\",\"PeriodicalId\":37310,\"journal\":{\"name\":\"Engineering and Applied Science Research\",\"volume\":\"45 1\",\"pages\":\"221-229\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Engineering and Applied Science Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.14456/EASR.2018.28\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Engineering\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering and Applied Science Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14456/EASR.2018.28","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Engineering","Score":null,"Total":0}
引用次数: 5

摘要

监督学习是一种用于创建数据预测模型的机器学习技术。本文的重点是寻找具有不同训练数据大小、不同属性数量和用于预测的时间的高性能监督学习算法。本研究评估了七种算法,Boosting, Random Forest, Bagging,朴素贝叶斯,k -近邻(K-NN),决策树和支持向量机(SVM),这些算法是来自加州大学欧文分校(UCI)的七个数据集的标准基准,具有两个评估指标和各种训练数据大小和缺失关键属性的实验设置。我们的研究结果表明,Bagging、Random Forest和SVM是总体上最准确的三种算法。然而,当关注关键属性值的存在时,推荐使用K-NN,因为它对性能的影响最小。或者,当训练数据的大小可能不够大时,朴素贝叶斯是更可取的,因为它是对训练数据大小最不敏感的算法。基于预测性能和计算时间,用二维图对算法进行表征。这张图表旨在指导新手用户根据自己的需求选择合适的方法。根据这张图表,总的来说,Bagging和Random Forest是两种最推荐的算法,因为它们的性能和速度都很高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Performance evaluation of supervised learning algorithms with various training data sizes and missing attributes
Supervised learning is a machine learning technique used for creating a data prediction model. This article focuses on finding high performance supervised learning algorithms with varied training data sizes, varied number of attributes, and time spent on prediction. This studied evaluated seven algorithms, Boosting, Random Forest, Bagging, Naive Bayes, K-Nearest Neighbours (K-NN), Decision Tree, and Support Vector Machine (SVM), on seven data sets that are the standard benchmark from University of California, Irvine (UCI) with two evaluation metrics and experimental settings of various training data sizes and missing key attributes. Our findings reveal that Bagging, Random Forest, and SVM are overall the three most accurate algorithms. However, when presence of key attribute values is of concern, K-NN is recommended as its performance is affected the least. Alternatively, when training data sizes may be not large enough, Naive Bayes is preferable since it is the most insensitive algorithm to training data sizes. The algorithms are characterized on a two-dimension chart based on prediction performance and computation time. This chart is expected to guide a novice user to choose an appropriate method for his/her demand. Based on this chart, in general, Bagging and Random Forest are the two most recommended algorithms because of their high performance and speed.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Engineering and Applied Science Research
Engineering and Applied Science Research Engineering-Engineering (all)
CiteScore
2.10
自引率
0.00%
发文量
2
审稿时长
11 weeks
期刊介绍: Publication of the journal started in 1974. Its original name was “KKU Engineering Journal”. English and Thai manuscripts were accepted. The journal was originally aimed at publishing research that was conducted and implemented in the northeast of Thailand. It is regarded a national journal and has been indexed in the Thai-journal Citation Index (TCI) database since 2004. The journal now accepts only English language manuscripts and became open-access in 2015 to attract more international readers. It was renamed Engineering and Applied Science Research in 2017. The editorial team agreed to publish more international papers, therefore, the new journal title is more appropriate. The journal focuses on research in the field of engineering that not only presents highly original ideas and advanced technology, but also are practical applications of appropriate technology.
期刊最新文献
Logistic model for adherence to ministry of health protocols and guidelines by public transport vehicles in Kenya during covid-19 pandemic Disclosing fast moving consumer goods demand forecasting predictor using multi linear regression Evaluation of ground-based, daily, gridded precipitation products for Upper Benue River basin, Nigeria Corrosion behavior of ca6nm in simulated geothermal brine highlighted by electrochemical impedance spectroscopy (Eis) Exploring determinants of travel-mode choice during the covid-19 pandemic outbreak: A case study of Islamabad, Pakistan
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1