{"title":"Techniques to deal with missing data","authors":"Jadran Sessa, Dabeeruddin Syed","doi":"10.1109/ICEDSA.2016.7818486","DOIUrl":null,"url":null,"abstract":"Data is available to us in humongous amounts in the real world, but none of it is of practical use if not converted to useful information. However, the knowledge discovery is hindered because the real data is often incomplete and noisy. Nowadays, the problem of recovering missing data has found most important place in the field of data mining. Filling the missing data is a significant task, as it is paramount to use all available data for the given datasets are generally very small. In this paper, we deal with the real data with many missing values. Furthermore, we deal with the given data in three phases. The first phase considers the concept of feature selection, while the second phase iteratively considers filling in the missing values using probabilistic approach, keeping in mind the fact that features can be either nominal or numerical. Finally, the third phase deals with correcting the missing values that have been filled in. In our work, we have compared two imputation methods for dealing with the missing data, namely k-NN imputation method and mean and median imputation method. As a result, we have found that both of the imputation methods are efficient and yield more or less the same accuracy.","PeriodicalId":247318,"journal":{"name":"2016 5th International Conference on Electronic Devices, Systems and Applications (ICEDSA)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"35","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 5th International Conference on Electronic Devices, Systems and Applications (ICEDSA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEDSA.2016.7818486","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 35
Abstract
Data is available to us in humongous amounts in the real world, but none of it is of practical use if not converted to useful information. However, the knowledge discovery is hindered because the real data is often incomplete and noisy. Nowadays, the problem of recovering missing data has found most important place in the field of data mining. Filling the missing data is a significant task, as it is paramount to use all available data for the given datasets are generally very small. In this paper, we deal with the real data with many missing values. Furthermore, we deal with the given data in three phases. The first phase considers the concept of feature selection, while the second phase iteratively considers filling in the missing values using probabilistic approach, keeping in mind the fact that features can be either nominal or numerical. Finally, the third phase deals with correcting the missing values that have been filled in. In our work, we have compared two imputation methods for dealing with the missing data, namely k-NN imputation method and mean and median imputation method. As a result, we have found that both of the imputation methods are efficient and yield more or less the same accuracy.