COMPARISON OF SIMPLE MISSING DATA IMPUTATION TECHNIQUES FOR NUMERICAL AND CATEGORICAL DATASETS

Ramu Gautam, Shahram Latifi
{"title":"COMPARISON OF SIMPLE MISSING DATA IMPUTATION TECHNIQUES FOR NUMERICAL AND CATEGORICAL DATASETS","authors":"Ramu Gautam, Shahram Latifi","doi":"10.46565/jreas.202381468-475","DOIUrl":null,"url":null,"abstract":"Almost every dataset has missing data. The common reasons are sensor error, equipment malfunction, human error, or translation loss. We study the efficacy of statistical (mean, median, mode) and machine learning based (k-nearest neighbors) imputation methods in accurately imputing missing data in numerical datasets with data missing not at random (MNAR) and data missing completely at random (MCAR) as well as categorical datasets. Imputed datasets are used to make prediction on the test set and Mean squared error (MSE) in prediction is used as the measure of performance of the imputation. Mean absolute difference between the original and imputed data is also observed. When the data is MCAR, kNN imputation results in lowest MSE for all datasets, making it the most accurate method. When less than 20% of data is missing, mean and median imputations are effective in regression problems. kNN imputation is better at 20% missingness and significantly better when 50% or more data is missing. For the kNN method, k = 5 gives better results than k=3 but k=10 gives similar results to k=5. For MNAR datasets, statistical methods result in similar or lower MSE compared to kNN imputation when less than 25% of instances have a missing feature. For higher missing levels, kNN imputation is superior. Given enough data points without missing features, deleting the instances with missing data may be a better choice at lower missingness levels. For categorical data imputation, kNN and Mode imputation are both effective.","PeriodicalId":14343,"journal":{"name":"International Journal of Research in Engineering and Applied Sciences","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Research in Engineering and Applied Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.46565/jreas.202381468-475","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Almost every dataset has missing data. The common reasons are sensor error, equipment malfunction, human error, or translation loss. We study the efficacy of statistical (mean, median, mode) and machine learning based (k-nearest neighbors) imputation methods in accurately imputing missing data in numerical datasets with data missing not at random (MNAR) and data missing completely at random (MCAR) as well as categorical datasets. Imputed datasets are used to make prediction on the test set and Mean squared error (MSE) in prediction is used as the measure of performance of the imputation. Mean absolute difference between the original and imputed data is also observed. When the data is MCAR, kNN imputation results in lowest MSE for all datasets, making it the most accurate method. When less than 20% of data is missing, mean and median imputations are effective in regression problems. kNN imputation is better at 20% missingness and significantly better when 50% or more data is missing. For the kNN method, k = 5 gives better results than k=3 but k=10 gives similar results to k=5. For MNAR datasets, statistical methods result in similar or lower MSE compared to kNN imputation when less than 25% of instances have a missing feature. For higher missing levels, kNN imputation is superior. Given enough data points without missing features, deleting the instances with missing data may be a better choice at lower missingness levels. For categorical data imputation, kNN and Mode imputation are both effective.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
数值和分类数据集的简单缺失数据输入技术的比较
几乎每个数据集都有缺失数据。常见的原因是传感器错误、设备故障、人为错误或平移丢失。我们研究了统计(均值、中位数、模式)和基于机器学习(k近邻)的方法在非随机数据缺失(MNAR)和完全随机数据缺失(MCAR)的数值数据集以及分类数据集中准确输入缺失数据的有效性。使用输入的数据集对测试集进行预测,并使用预测中的均方误差(MSE)作为输入性能的度量。还观察到原始数据和输入数据之间的平均绝对差。当数据为MCAR时,kNN法在所有数据集上的均方差最低,是最准确的方法。当丢失的数据少于20%时,均值和中位数估算在回归问题中是有效的。当丢失20%的数据时,kNN imputation效果更好,当丢失50%或更多数据时,效果明显更好。对于kNN方法,k=5给出比k=3更好的结果,但k=10给出与k=5相似的结果。对于MNAR数据集,当少于25%的实例具有缺失特征时,统计方法产生的MSE与kNN imputation相似或更低。对于更高的缺失水平,kNN imputation是优越的。如果有足够的数据点而不缺少特性,那么在缺失程度较低的情况下,删除缺少数据的实例可能是更好的选择。对于分类数据的输入,kNN和Mode输入都是有效的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
EYE-DIRECTION-BASED SAFETY NAVIGATION SYSTEM FOR ELDERLY AND PHYSICALLY CHALLENGED PERSONS AAHAR AYOJAN: LEFT OVER FOOD MANAGEMENT SYSTEM ENERGY SAVING AND DISTANCE TRAVELED OF THE RAILWAY TRAIN FROM ITS BIRTH TO THE FOURTH INDUSTRIAL REVOLUTION MODELING DESIGN OF A UAV BLADE MONOPOLE WITH THE USE OF DIFFERENT RTADIATING ELEMENTS AND SIMULATION OF DRA ANTENNA Rammed Earth Construction Using Cement & Coir Fibers
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1