J. Luengo, Alberto Fernández, S. García, F. Herrera
{"title":"Addressing Data-Complexity for Imbalanced Data-Sets: A Preliminary Study on the Use of Preprocessing for C4.5","authors":"J. Luengo, Alberto Fernández, S. García, F. Herrera","doi":"10.1109/ISDA.2009.233","DOIUrl":null,"url":null,"abstract":"In this work we analyse the behaviour of the C4.5 classification method with respect to a bunch of imbalanced data-sets. We consider the use of two metrics of data complexity known as “maximum Fishers discriminant ratio” and “nonlinearity of 1NN classifier”, to analyse the effect of preprocessing (oversampling in this case) in order to deal with the imbalance problem. In order to do that, we analyse C4.5 over a wide range of imbalanced data-sets built from real data, and try to extract behaviour patterns from the results. We obtain rules that describe both good or bad behaviours of C4.5 in the case of using the original data-sets (absence of preprocessing) and when applying preprocessing. These rules allow us to determine the effect of the use of preprocessing and to predict the response of C4.5 to preprocessing from the data-set’s complexity metrics prior to its application, and then establish when the preprocessing would be useful to.","PeriodicalId":330324,"journal":{"name":"2009 Ninth International Conference on Intelligent Systems Design and Applications","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 Ninth International Conference on Intelligent Systems Design and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISDA.2009.233","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
In this work we analyse the behaviour of the C4.5 classification method with respect to a bunch of imbalanced data-sets. We consider the use of two metrics of data complexity known as “maximum Fishers discriminant ratio” and “nonlinearity of 1NN classifier”, to analyse the effect of preprocessing (oversampling in this case) in order to deal with the imbalance problem. In order to do that, we analyse C4.5 over a wide range of imbalanced data-sets built from real data, and try to extract behaviour patterns from the results. We obtain rules that describe both good or bad behaviours of C4.5 in the case of using the original data-sets (absence of preprocessing) and when applying preprocessing. These rules allow us to determine the effect of the use of preprocessing and to predict the response of C4.5 to preprocessing from the data-set’s complexity metrics prior to its application, and then establish when the preprocessing would be useful to.