Data Imputation Techniques: An Empirical Study using Chronic Kidney Disease and Life Expectancy Datasets

2022 International Conference on Innovative Trends in Information Technology (ICITIIT) Pub Date : 2022-02-12 DOI:10.1109/ICITIIT54346.2022.9744211

Sainath Reddy Sankepally, Nishoak Kosaraju, K. Mallikharjuna Rao

{"title":"Data Imputation Techniques: An Empirical Study using Chronic Kidney Disease and Life Expectancy Datasets","authors":"Sainath Reddy Sankepally, Nishoak Kosaraju, K. Mallikharjuna Rao","doi":"10.1109/ICITIIT54346.2022.9744211","DOIUrl":null,"url":null,"abstract":"Data is a collection of information from the activities of the real world. The file in which such data is stored after transforming into a form that machines can process is generally known as data set. In the real world, many data sets are not complete, and they contain various types of noise. Missing values is of one such kind. Thus, imputing data of these missing values is one of the significant task of data pre-processing. This paper deals with two real time health care data sets namely life expectancy (LE) dataset and chronic kidney disease (CKD) dataset, which are very different in their nature. This paper provides insights on various data imputation techniques to fill missing values by analyzing them. When coming to Data imputation, it is very common to impute the missing values with measure of central tendencies like mean, median, mode Which can represent the central value of distribution but choosing the apt choice is real challenge. In accordance with best of our knowledge this is the first and foremost paper which provides the complete analysis of impact of basic data imputation techniques on various data distributions which can be classified based on the size of data set, number of missing values, type of data (categorical/numerical), etc. This paper compared and analyzed the original data distribution with the data distribution after each imputation in terms of their skewness, outliers and by various descriptive statistic parameters.","PeriodicalId":184353,"journal":{"name":"2022 International Conference on Innovative Trends in Information Technology (ICITIIT)","volume":"205 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Innovative Trends in Information Technology (ICITIIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICITIIT54346.2022.9744211","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Data is a collection of information from the activities of the real world. The file in which such data is stored after transforming into a form that machines can process is generally known as data set. In the real world, many data sets are not complete, and they contain various types of noise. Missing values is of one such kind. Thus, imputing data of these missing values is one of the significant task of data pre-processing. This paper deals with two real time health care data sets namely life expectancy (LE) dataset and chronic kidney disease (CKD) dataset, which are very different in their nature. This paper provides insights on various data imputation techniques to fill missing values by analyzing them. When coming to Data imputation, it is very common to impute the missing values with measure of central tendencies like mean, median, mode Which can represent the central value of distribution but choosing the apt choice is real challenge. In accordance with best of our knowledge this is the first and foremost paper which provides the complete analysis of impact of basic data imputation techniques on various data distributions which can be classified based on the size of data set, number of missing values, type of data (categorical/numerical), etc. This paper compared and analyzed the original data distribution with the data distribution after each imputation in terms of their skewness, outliers and by various descriptive statistic parameters.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

数据代入技术:使用慢性肾脏疾病和预期寿命数据集的实证研究

数据是来自现实世界活动的信息集合。将这些数据转换成机器可以处理的形式后存储在其中的文件通常称为数据集。在现实世界中，许多数据集是不完整的，并且它们包含各种类型的噪声。缺失价值就是其中一种。因此，对这些缺失值进行数据的输入是数据预处理的重要任务之一。本文处理了两个实时医疗保健数据集，即预期寿命(LE)数据集和慢性肾脏疾病(CKD)数据集，这两个数据集在性质上有很大的不同。本文通过分析各种数据的缺失值，提供了对各种数据补全技术的见解。在数据的输入过程中，通常会使用均值、中位数、众数等集中趋势的度量来输入缺失值，这些方法可以表示分布的中心值，但选择合适的方法是一个真正的挑战。据我们所知，这是第一篇最重要的论文，它提供了基本数据输入技术对各种数据分布的影响的完整分析，这些数据分布可以根据数据集的大小、缺失值的数量、数据类型(分类/数值)等进行分类。本文从偏度、离群值以及各种描述性统计参数等方面，对原始数据分布与每次归算后的数据分布进行了比较分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 International Conference on Innovative Trends in Information Technology (ICITIIT)

自引率

0.00%

发文量