Handling Missing Values in Chronic Kidney Disease Datasets Using KNN, K-Means and K-Medoids Algorithms

2018 12th International Conference on Open Source Systems and Technologies (ICOSST) Pub Date : 2018-12-01 DOI:10.1109/ICOSST.2018.8632179

Tahira Mahboob, A. Ijaz, Amber Shahzad, Muqadas Kalsoom

{"title":"Handling Missing Values in Chronic Kidney Disease Datasets Using KNN, K-Means and K-Medoids Algorithms","authors":"Tahira Mahboob, A. Ijaz, Amber Shahzad, Muqadas Kalsoom","doi":"10.1109/ICOSST.2018.8632179","DOIUrl":null,"url":null,"abstract":"Missing values in large datasets have become a difficult task for researchers and industrialists. Specifically in the field of medicine, the datasets contain missing values due to human error or non-availability of data. If these datasets have to utilized for inference purposes or predictive studies, the resutls are not that reliable. Discarding such instances is an option but effects overall accuracy and thus it is viable to perform some replacement or imputation technique. Here, imputaiton technique enable to estimate the missing values in the datasets by applying various algorithms. Therefore, in this paper we present a framework that assists in imouting missing values in a large Chronic Kidney Disease (CKD) datasets. We have used three machine learning algorithms i.e., K-Nearest Neighbors, K-Means and K-Medoids Clustering to impute the missing values. Performance evaluation of the proposed technique has been carried out by application of Decision Tree and Random Forest algorithms. Experimental results demonstrate that KNN algorithm provides the most accurate results compared with K-Means and K-Medoids clustering algorithms. KNN achieves an accuracy of 86.67% for Decision Tree algorithm, and 75.25% for Random Forest algorithm. Additionally it also has a less relative, absolute and root mean square error. Conclusively, KNN imputed datasets are used in our research for future predictions.","PeriodicalId":261288,"journal":{"name":"2018 12th International Conference on Open Source Systems and Technologies (ICOSST)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 12th International Conference on Open Source Systems and Technologies (ICOSST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICOSST.2018.8632179","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

Missing values in large datasets have become a difficult task for researchers and industrialists. Specifically in the field of medicine, the datasets contain missing values due to human error or non-availability of data. If these datasets have to utilized for inference purposes or predictive studies, the resutls are not that reliable. Discarding such instances is an option but effects overall accuracy and thus it is viable to perform some replacement or imputation technique. Here, imputaiton technique enable to estimate the missing values in the datasets by applying various algorithms. Therefore, in this paper we present a framework that assists in imouting missing values in a large Chronic Kidney Disease (CKD) datasets. We have used three machine learning algorithms i.e., K-Nearest Neighbors, K-Means and K-Medoids Clustering to impute the missing values. Performance evaluation of the proposed technique has been carried out by application of Decision Tree and Random Forest algorithms. Experimental results demonstrate that KNN algorithm provides the most accurate results compared with K-Means and K-Medoids clustering algorithms. KNN achieves an accuracy of 86.67% for Decision Tree algorithm, and 75.25% for Random Forest algorithm. Additionally it also has a less relative, absolute and root mean square error. Conclusively, KNN imputed datasets are used in our research for future predictions.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用KNN, K-Means和k - mediids算法处理慢性肾脏疾病数据集的缺失值

对于研究人员和实业家来说，大型数据集中的缺失值已经成为一项艰巨的任务。特别是在医学领域，由于人为错误或数据不可用，数据集包含缺失值。如果这些数据集必须用于推理目的或预测研究，则结果不那么可靠。丢弃这样的实例是一种选择，但会影响整体准确性，因此执行一些替代或插入技术是可行的。在这里，估算技术可以通过应用各种算法来估计数据集中的缺失值。因此，在本文中，我们提出了一个框架，有助于在大型慢性肾脏疾病(CKD)数据集中引入缺失值。我们使用了三种机器学习算法，即k -近邻，k -均值和k -媒质聚类来估算缺失值。应用决策树和随机森林算法对所提出的技术进行了性能评估。实验结果表明，与K-Means和K-Medoids聚类算法相比，KNN算法提供了最准确的聚类结果。决策树算法的KNN准确率为86.67%，随机森林算法的准确率为75.25%。此外，它还具有较小的相对、绝对和均方根误差。最后，我们的研究中使用了KNN估算的数据集来预测未来。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2018 12th International Conference on Open Source Systems and Technologies (ICOSST)

自引率

0.00%

发文量