Handling Missing Values in Chronic Kidney Disease Datasets Using KNN, K-Means and K-Medoids Algorithms

Tahira Mahboob, A. Ijaz, Amber Shahzad, Muqadas Kalsoom
{"title":"Handling Missing Values in Chronic Kidney Disease Datasets Using KNN, K-Means and K-Medoids Algorithms","authors":"Tahira Mahboob, A. Ijaz, Amber Shahzad, Muqadas Kalsoom","doi":"10.1109/ICOSST.2018.8632179","DOIUrl":null,"url":null,"abstract":"Missing values in large datasets have become a difficult task for researchers and industrialists. Specifically in the field of medicine, the datasets contain missing values due to human error or non-availability of data. If these datasets have to utilized for inference purposes or predictive studies, the resutls are not that reliable. Discarding such instances is an option but effects overall accuracy and thus it is viable to perform some replacement or imputation technique. Here, imputaiton technique enable to estimate the missing values in the datasets by applying various algorithms. Therefore, in this paper we present a framework that assists in imouting missing values in a large Chronic Kidney Disease (CKD) datasets. We have used three machine learning algorithms i.e., K-Nearest Neighbors, K-Means and K-Medoids Clustering to impute the missing values. Performance evaluation of the proposed technique has been carried out by application of Decision Tree and Random Forest algorithms. Experimental results demonstrate that KNN algorithm provides the most accurate results compared with K-Means and K-Medoids clustering algorithms. KNN achieves an accuracy of 86.67% for Decision Tree algorithm, and 75.25% for Random Forest algorithm. Additionally it also has a less relative, absolute and root mean square error. Conclusively, KNN imputed datasets are used in our research for future predictions.","PeriodicalId":261288,"journal":{"name":"2018 12th International Conference on Open Source Systems and Technologies (ICOSST)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 12th International Conference on Open Source Systems and Technologies (ICOSST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICOSST.2018.8632179","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14

Abstract

Missing values in large datasets have become a difficult task for researchers and industrialists. Specifically in the field of medicine, the datasets contain missing values due to human error or non-availability of data. If these datasets have to utilized for inference purposes or predictive studies, the resutls are not that reliable. Discarding such instances is an option but effects overall accuracy and thus it is viable to perform some replacement or imputation technique. Here, imputaiton technique enable to estimate the missing values in the datasets by applying various algorithms. Therefore, in this paper we present a framework that assists in imouting missing values in a large Chronic Kidney Disease (CKD) datasets. We have used three machine learning algorithms i.e., K-Nearest Neighbors, K-Means and K-Medoids Clustering to impute the missing values. Performance evaluation of the proposed technique has been carried out by application of Decision Tree and Random Forest algorithms. Experimental results demonstrate that KNN algorithm provides the most accurate results compared with K-Means and K-Medoids clustering algorithms. KNN achieves an accuracy of 86.67% for Decision Tree algorithm, and 75.25% for Random Forest algorithm. Additionally it also has a less relative, absolute and root mean square error. Conclusively, KNN imputed datasets are used in our research for future predictions.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用KNN, K-Means和k - mediids算法处理慢性肾脏疾病数据集的缺失值
对于研究人员和实业家来说,大型数据集中的缺失值已经成为一项艰巨的任务。特别是在医学领域,由于人为错误或数据不可用,数据集包含缺失值。如果这些数据集必须用于推理目的或预测研究,则结果不那么可靠。丢弃这样的实例是一种选择,但会影响整体准确性,因此执行一些替代或插入技术是可行的。在这里,估算技术可以通过应用各种算法来估计数据集中的缺失值。因此,在本文中,我们提出了一个框架,有助于在大型慢性肾脏疾病(CKD)数据集中引入缺失值。我们使用了三种机器学习算法,即k -近邻,k -均值和k -媒质聚类来估算缺失值。应用决策树和随机森林算法对所提出的技术进行了性能评估。实验结果表明,与K-Means和K-Medoids聚类算法相比,KNN算法提供了最准确的聚类结果。决策树算法的KNN准确率为86.67%,随机森林算法的准确率为75.25%。此外,它还具有较小的相对、绝对和均方根误差。最后,我们的研究中使用了KNN估算的数据集来预测未来。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
ApproxCT: Approximate Clustering Techniques for Energy Efficient Computer Vision in Cyber-Physical Systems Singular Adaptive Multi-Role Intelligent Personal Assistant (SAM-IPA) for Human Computer Interaction Consensus Algorithms in Blockchain: Comparative Analysis, Challenges and Opportunities A Comparative Analysis of DAG-Based Blockchain Architectures 2018 International Conference on Open Source Systems and Technologies (ICOSST)
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1