{"title":"COMPARISON OF SIMPLE MISSING DATA IMPUTATION TECHNIQUES FOR NUMERICAL AND CATEGORICAL DATASETS","authors":"Ramu Gautam, Shahram Latifi","doi":"10.46565/jreas.202381468-475","DOIUrl":null,"url":null,"abstract":"Almost every dataset has missing data. The common reasons are sensor error, equipment malfunction, human error, or translation loss. We study the efficacy of statistical (mean, median, mode) and machine learning based (k-nearest neighbors) imputation methods in accurately imputing missing data in numerical datasets with data missing not at random (MNAR) and data missing completely at random (MCAR) as well as categorical datasets. Imputed datasets are used to make prediction on the test set and Mean squared error (MSE) in prediction is used as the measure of performance of the imputation. Mean absolute difference between the original and imputed data is also observed. When the data is MCAR, kNN imputation results in lowest MSE for all datasets, making it the most accurate method. When less than 20% of data is missing, mean and median imputations are effective in regression problems. kNN imputation is better at 20% missingness and significantly better when 50% or more data is missing. For the kNN method, k = 5 gives better results than k=3 but k=10 gives similar results to k=5. For MNAR datasets, statistical methods result in similar or lower MSE compared to kNN imputation when less than 25% of instances have a missing feature. For higher missing levels, kNN imputation is superior. Given enough data points without missing features, deleting the instances with missing data may be a better choice at lower missingness levels. For categorical data imputation, kNN and Mode imputation are both effective.","PeriodicalId":14343,"journal":{"name":"International Journal of Research in Engineering and Applied Sciences","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Research in Engineering and Applied Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.46565/jreas.202381468-475","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Almost every dataset has missing data. The common reasons are sensor error, equipment malfunction, human error, or translation loss. We study the efficacy of statistical (mean, median, mode) and machine learning based (k-nearest neighbors) imputation methods in accurately imputing missing data in numerical datasets with data missing not at random (MNAR) and data missing completely at random (MCAR) as well as categorical datasets. Imputed datasets are used to make prediction on the test set and Mean squared error (MSE) in prediction is used as the measure of performance of the imputation. Mean absolute difference between the original and imputed data is also observed. When the data is MCAR, kNN imputation results in lowest MSE for all datasets, making it the most accurate method. When less than 20% of data is missing, mean and median imputations are effective in regression problems. kNN imputation is better at 20% missingness and significantly better when 50% or more data is missing. For the kNN method, k = 5 gives better results than k=3 but k=10 gives similar results to k=5. For MNAR datasets, statistical methods result in similar or lower MSE compared to kNN imputation when less than 25% of instances have a missing feature. For higher missing levels, kNN imputation is superior. Given enough data points without missing features, deleting the instances with missing data may be a better choice at lower missingness levels. For categorical data imputation, kNN and Mode imputation are both effective.