{"title":"Identifying and Correcting Mislabeled Training Instances","authors":"Jiangwen Sun, Feng-ying Zhao, Chong-Jun Wang, Shifu Chen","doi":"10.1109/FGCN.2007.146","DOIUrl":null,"url":null,"abstract":"In order to form a good generalization from a set of training instances, a clean training dataset is important. Unfortunately, real world data is never as perfect as we would like it to be and can often suffered from corruptions. In this paper, a new approach is proposed to identify and correct mislabeled training instances. For a given instance, we employ a Bayesian classifier to evaluate the probabilities of the instance belonging to all considered class labels. Then information entropy calculated from the probability distributions is used to evaluate the typicality of the instance belonging to considered class labels. Finally, the instance with low entropy, but with error prediction result, would be identified as mislabeled instance. Experimental results indicate that our approach gains comparative or better performance than previous techniques.","PeriodicalId":254368,"journal":{"name":"Future Generation Communication and Networking (FGCN 2007)","volume":"2021 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"53","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Communication and Networking (FGCN 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FGCN.2007.146","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 53
Abstract
In order to form a good generalization from a set of training instances, a clean training dataset is important. Unfortunately, real world data is never as perfect as we would like it to be and can often suffered from corruptions. In this paper, a new approach is proposed to identify and correct mislabeled training instances. For a given instance, we employ a Bayesian classifier to evaluate the probabilities of the instance belonging to all considered class labels. Then information entropy calculated from the probability distributions is used to evaluate the typicality of the instance belonging to considered class labels. Finally, the instance with low entropy, but with error prediction result, would be identified as mislabeled instance. Experimental results indicate that our approach gains comparative or better performance than previous techniques.