Performance of Machine Learning Algorithms with Different K Values in K-fold CrossValidation

International Journal of Information Technology and Computer Science Pub Date : 2021-12-08 DOI:10.5815/ijitcs.2021.06.05

Isaac Kofi Nti, Owusu N yarko-Boateng, J. Aning

{"title":"Performance of Machine Learning Algorithms with Different K Values in K-fold CrossValidation","authors":"Isaac Kofi Nti, Owusu N yarko-Boateng, J. Aning","doi":"10.5815/ijitcs.2021.06.05","DOIUrl":null,"url":null,"abstract":"The numerical value of k in a k-fold cross-validation training technique of machine learning predictive models is an essential element that impacts the model’s performance. A right choice of k results in better accuracy, while a poorly chosen value for k might affect the model’s performance. In literature, the most commonly used values of k are five (5) or ten (10), as these two values are believed to give test error rate estimates that suffer neither from extremely high bias nor very high variance. However, there is no formal rule. To the best of our knowledge, few experimental studies attempted to investigate the effect of diverse k values in training different machine learning models. This paper empirically analyses the prevalence and effect of distinct k values (3, 5, 7, 10, 15 and 20) on the validation performance of four well-known machine learning algorithms (Gradient Boosting Machine (GBM), Logistic Regression (LR), Decision Tree (DT) and K-Nearest Neighbours (KNN)). It was observed that the value of k and model validation performance differ from one machine-learning algorithm to another for the same classification task. However, our empirical suggest that k = 7 offers a slight increase in validations accuracy and area under the curve measure with lesser computational complexity than k = 10 across most MLA. We discuss in detail the study outcomes and outline some guidelines for beginners in the machine learning field in selecting the best k value and machine learning algorithm for a given task.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"140 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Information Technology and Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5815/ijitcs.2021.06.05","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

Abstract

The numerical value of k in a k-fold cross-validation training technique of machine learning predictive models is an essential element that impacts the model’s performance. A right choice of k results in better accuracy, while a poorly chosen value for k might affect the model’s performance. In literature, the most commonly used values of k are five (5) or ten (10), as these two values are believed to give test error rate estimates that suffer neither from extremely high bias nor very high variance. However, there is no formal rule. To the best of our knowledge, few experimental studies attempted to investigate the effect of diverse k values in training different machine learning models. This paper empirically analyses the prevalence and effect of distinct k values (3, 5, 7, 10, 15 and 20) on the validation performance of four well-known machine learning algorithms (Gradient Boosting Machine (GBM), Logistic Regression (LR), Decision Tree (DT) and K-Nearest Neighbours (KNN)). It was observed that the value of k and model validation performance differ from one machine-learning algorithm to another for the same classification task. However, our empirical suggest that k = 7 offers a slight increase in validations accuracy and area under the curve measure with lesser computational complexity than k = 10 across most MLA. We discuss in detail the study outcomes and outline some guidelines for beginners in the machine learning field in selecting the best k value and machine learning algorithm for a given task.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

不同K值的机器学习算法在K-fold交叉验证中的性能

在机器学习预测模型的k-fold交叉验证训练技术中，k的数值是影响模型性能的重要因素。正确选择k可以提高准确率，而选择不当的k值可能会影响模型的性能。在文献中，最常用的k值是5(5)或10(10)，因为这两个值被认为给出的测试错误率估计既不会受到极高的偏差也不会受到很高的方差。然而，没有正式的规定。据我们所知，很少有实验研究试图调查不同k值在训练不同机器学习模型中的影响。本文实证分析了不同k值(3、5、7、10、15和20)对四种知名机器学习算法(梯度增强机(GBM)、逻辑回归(LR)、决策树(DT)和k近邻(KNN))验证性能的普遍性和影响。可以观察到，对于相同的分类任务，不同的机器学习算法的k值和模型验证性能是不同的。然而，我们的经验表明，在大多数MLA中，与k = 10相比，k = 7在验证精度和曲线测量下的面积方面略有增加，计算复杂度更低。我们详细讨论了研究结果，并为机器学习领域的初学者在为给定任务选择最佳k值和机器学习算法时概述了一些指导方针。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Journal of Information Technology and Computer Science

自引率

0.00%

发文量