Fateh ur Rehman, M. Abbas, Sajjad Murtaza, Wasi Haider Butt, S. Rehman, Usman Qamar
{"title":"SimFiller. Similarity-Based Missing Values Filling Algorithm","authors":"Fateh ur Rehman, M. Abbas, Sajjad Murtaza, Wasi Haider Butt, S. Rehman, Usman Qamar","doi":"10.1109/ICDIM.2018.8846983","DOIUrl":null,"url":null,"abstract":"With the growth of heterogeneous data generation sources low-quality data volumes are expanding on a daily basis. This research proposed SimFiller: similarity-based missing (null) values filling algorithm, to enhance the quality of data for the data mining process. The proposed algorithm calculates the similarity of record pairs from the input data in such a way that at least one member of the pair has a non-null value for the attribute under consideration. After finding similar pairs, the algorithm fills the missing values by considering the pair having greatest similarity under the specified similarity threshold. The quality of resulted data is evaluated by analyzing the classification accuracy results for Audiology dataset. Five other missing values filling algorithms were selected and total six copies of filled Audiology dataset were created. All six copies of filled Audiology dataset were tested for their classification accuracy. Results show a huge boost in classification accuracy for the copy of the dataset filled with the proposed algorithm and indicate that the quality of the dataset is enhanced. The proposed algorithm can also be tested on other datasets for filling their missing (null) values and can also be extended to remove other inconsistencies from the datasets.","PeriodicalId":120884,"journal":{"name":"2018 Thirteenth International Conference on Digital Information Management (ICDIM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Thirteenth International Conference on Digital Information Management (ICDIM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDIM.2018.8846983","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
With the growth of heterogeneous data generation sources low-quality data volumes are expanding on a daily basis. This research proposed SimFiller: similarity-based missing (null) values filling algorithm, to enhance the quality of data for the data mining process. The proposed algorithm calculates the similarity of record pairs from the input data in such a way that at least one member of the pair has a non-null value for the attribute under consideration. After finding similar pairs, the algorithm fills the missing values by considering the pair having greatest similarity under the specified similarity threshold. The quality of resulted data is evaluated by analyzing the classification accuracy results for Audiology dataset. Five other missing values filling algorithms were selected and total six copies of filled Audiology dataset were created. All six copies of filled Audiology dataset were tested for their classification accuracy. Results show a huge boost in classification accuracy for the copy of the dataset filled with the proposed algorithm and indicate that the quality of the dataset is enhanced. The proposed algorithm can also be tested on other datasets for filling their missing (null) values and can also be extended to remove other inconsistencies from the datasets.