{"title":"An Efficient Heuristic for Discovering Multiple Ill-Defined Attributes in Datasets","authors":"Sylvain Hallé","doi":"10.1109/ICMLA.2006.14","DOIUrl":null,"url":null,"abstract":"The accuracy of the rules produced by a concept learning system can be hindered by the presence of errors in the data, such as \"ill-defined\" attributes that are too general or too specific for the concept to learn. In this paper, we devise a method that uses the Boolean differences computed by a program called Newton to identify multiple ill-defined attributes in a dataset in a single pass. The method is based on a compound heuristic that assigns a real-valued rank to each possible hypothesis based on its key characteristics. We show by extensive empirical testing on randomly generated classifiers that the hypothesis with the highest rank is the correct one with an observed probability quickly converging to 100%. Moreover, the monotonicity of the function enables us to use it as a rough estimator of its own likelihood","PeriodicalId":297071,"journal":{"name":"2006 5th International Conference on Machine Learning and Applications (ICMLA'06)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 5th International Conference on Machine Learning and Applications (ICMLA'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2006.14","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The accuracy of the rules produced by a concept learning system can be hindered by the presence of errors in the data, such as "ill-defined" attributes that are too general or too specific for the concept to learn. In this paper, we devise a method that uses the Boolean differences computed by a program called Newton to identify multiple ill-defined attributes in a dataset in a single pass. The method is based on a compound heuristic that assigns a real-valued rank to each possible hypothesis based on its key characteristics. We show by extensive empirical testing on randomly generated classifiers that the hypothesis with the highest rank is the correct one with an observed probability quickly converging to 100%. Moreover, the monotonicity of the function enables us to use it as a rough estimator of its own likelihood