Sainath Reddy Sankepally, Nishoak Kosaraju, K. Mallikharjuna Rao
{"title":"Data Imputation Techniques: An Empirical Study using Chronic Kidney Disease and Life Expectancy Datasets","authors":"Sainath Reddy Sankepally, Nishoak Kosaraju, K. Mallikharjuna Rao","doi":"10.1109/ICITIIT54346.2022.9744211","DOIUrl":null,"url":null,"abstract":"Data is a collection of information from the activities of the real world. The file in which such data is stored after transforming into a form that machines can process is generally known as data set. In the real world, many data sets are not complete, and they contain various types of noise. Missing values is of one such kind. Thus, imputing data of these missing values is one of the significant task of data pre-processing. This paper deals with two real time health care data sets namely life expectancy (LE) dataset and chronic kidney disease (CKD) dataset, which are very different in their nature. This paper provides insights on various data imputation techniques to fill missing values by analyzing them. When coming to Data imputation, it is very common to impute the missing values with measure of central tendencies like mean, median, mode Which can represent the central value of distribution but choosing the apt choice is real challenge. In accordance with best of our knowledge this is the first and foremost paper which provides the complete analysis of impact of basic data imputation techniques on various data distributions which can be classified based on the size of data set, number of missing values, type of data (categorical/numerical), etc. This paper compared and analyzed the original data distribution with the data distribution after each imputation in terms of their skewness, outliers and by various descriptive statistic parameters.","PeriodicalId":184353,"journal":{"name":"2022 International Conference on Innovative Trends in Information Technology (ICITIIT)","volume":"205 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Innovative Trends in Information Technology (ICITIIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICITIIT54346.2022.9744211","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Data is a collection of information from the activities of the real world. The file in which such data is stored after transforming into a form that machines can process is generally known as data set. In the real world, many data sets are not complete, and they contain various types of noise. Missing values is of one such kind. Thus, imputing data of these missing values is one of the significant task of data pre-processing. This paper deals with two real time health care data sets namely life expectancy (LE) dataset and chronic kidney disease (CKD) dataset, which are very different in their nature. This paper provides insights on various data imputation techniques to fill missing values by analyzing them. When coming to Data imputation, it is very common to impute the missing values with measure of central tendencies like mean, median, mode Which can represent the central value of distribution but choosing the apt choice is real challenge. In accordance with best of our knowledge this is the first and foremost paper which provides the complete analysis of impact of basic data imputation techniques on various data distributions which can be classified based on the size of data set, number of missing values, type of data (categorical/numerical), etc. This paper compared and analyzed the original data distribution with the data distribution after each imputation in terms of their skewness, outliers and by various descriptive statistic parameters.