Performance of Machine Learning Algorithms with Different K Values in K-fold CrossValidation

Isaac Kofi Nti, Owusu N yarko-Boateng, J. Aning
{"title":"Performance of Machine Learning Algorithms with Different K Values in K-fold CrossValidation","authors":"Isaac Kofi Nti, Owusu N yarko-Boateng, J. Aning","doi":"10.5815/ijitcs.2021.06.05","DOIUrl":null,"url":null,"abstract":"The numerical value of k in a k-fold cross-validation training technique of machine learning predictive models is an essential element that impacts the model’s performance. A right choice of k results in better accuracy, while a poorly chosen value for k might affect the model’s performance. In literature, the most commonly used values of k are five (5) or ten (10), as these two values are believed to give test error rate estimates that suffer neither from extremely high bias nor very high variance. However, there is no formal rule. To the best of our knowledge, few experimental studies attempted to investigate the effect of diverse k values in training different machine learning models. This paper empirically analyses the prevalence and effect of distinct k values (3, 5, 7, 10, 15 and 20) on the validation performance of four well-known machine learning algorithms (Gradient Boosting Machine (GBM), Logistic Regression (LR), Decision Tree (DT) and K-Nearest Neighbours (KNN)). It was observed that the value of k and model validation performance differ from one machine-learning algorithm to another for the same classification task. However, our empirical suggest that k = 7 offers a slight increase in validations accuracy and area under the curve measure with lesser computational complexity than k = 10 across most MLA. We discuss in detail the study outcomes and outline some guidelines for beginners in the machine learning field in selecting the best k value and machine learning algorithm for a given task.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"140 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Information Technology and Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5815/ijitcs.2021.06.05","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20

Abstract

The numerical value of k in a k-fold cross-validation training technique of machine learning predictive models is an essential element that impacts the model’s performance. A right choice of k results in better accuracy, while a poorly chosen value for k might affect the model’s performance. In literature, the most commonly used values of k are five (5) or ten (10), as these two values are believed to give test error rate estimates that suffer neither from extremely high bias nor very high variance. However, there is no formal rule. To the best of our knowledge, few experimental studies attempted to investigate the effect of diverse k values in training different machine learning models. This paper empirically analyses the prevalence and effect of distinct k values (3, 5, 7, 10, 15 and 20) on the validation performance of four well-known machine learning algorithms (Gradient Boosting Machine (GBM), Logistic Regression (LR), Decision Tree (DT) and K-Nearest Neighbours (KNN)). It was observed that the value of k and model validation performance differ from one machine-learning algorithm to another for the same classification task. However, our empirical suggest that k = 7 offers a slight increase in validations accuracy and area under the curve measure with lesser computational complexity than k = 10 across most MLA. We discuss in detail the study outcomes and outline some guidelines for beginners in the machine learning field in selecting the best k value and machine learning algorithm for a given task.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
不同K值的机器学习算法在K-fold交叉验证中的性能
在机器学习预测模型的k-fold交叉验证训练技术中,k的数值是影响模型性能的重要因素。正确选择k可以提高准确率,而选择不当的k值可能会影响模型的性能。在文献中,最常用的k值是5(5)或10(10),因为这两个值被认为给出的测试错误率估计既不会受到极高的偏差也不会受到很高的方差。然而,没有正式的规定。据我们所知,很少有实验研究试图调查不同k值在训练不同机器学习模型中的影响。本文实证分析了不同k值(3、5、7、10、15和20)对四种知名机器学习算法(梯度增强机(GBM)、逻辑回归(LR)、决策树(DT)和k近邻(KNN))验证性能的普遍性和影响。可以观察到,对于相同的分类任务,不同的机器学习算法的k值和模型验证性能是不同的。然而,我们的经验表明,在大多数MLA中,与k = 10相比,k = 7在验证精度和曲线测量下的面积方面略有增加,计算复杂度更低。我们详细讨论了研究结果,并为机器学习领域的初学者在为给定任务选择最佳k值和机器学习算法时概述了一些指导方针。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Enhancing Healthcare Provision in Conflict Zones: Queuing System Models for Mobile and Flexible Medical Care Units with a Limited Number of Treatment Stations A Machine Learning Based Intelligent Diabetic and Hypertensive Patient Prediction Scheme and A Mobile Application for Patients Assistance Mimicking Nature: Analysis of Dragonfly Pursuit Strategies Using LSTM and Kalman Filter Securing the Internet of Things: Evaluating Machine Learning Algorithms for Detecting IoT Cyberattacks Using CIC-IoT2023 Dataset Analyzing Test Performance of BSIT Students and Question Quality: A Study on Item Difficulty Index and Item Discrimination Index for Test Question Improvement
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1