{"title":"Dimension reduction based on categorical fuzzy correlation degree for document categorization","authors":"Qiang Li, Liang He, Xin Lin","doi":"10.1109/GrC.2013.6740405","DOIUrl":null,"url":null,"abstract":"High dimensionality of the feature space is a common problem in document categorization. Most of the features obtained through conventional feature selection algorithms such as IG are relevant and redundant. In this paper, a two-step feature selection method is proposed. At the first step redundancy analysis among original features based on categorical fuzzy correlation degree is applied to filter the redundant features with the similar categorical term frequency distribution. In the second step, conventional IG feature selection algorithm is adopted to select the final feature set for document categorization. Experiments dealing with the well-known Reuters-21578 and 20news-18828 corpuses show that the proposed method can eliminate redundant features with high fuzzy correlation degree between each other and obtain a compressed feature space where the dimension of feature space is dramatically reduced. The document categorization results on two corpuses show that the conventional IG feature selection algorithm can achieve a better document categorization performance on the compressed feature space and demonstrate the effectiveness of the proposed method.","PeriodicalId":415445,"journal":{"name":"2013 IEEE International Conference on Granular Computing (GrC)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE International Conference on Granular Computing (GrC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GrC.2013.6740405","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
High dimensionality of the feature space is a common problem in document categorization. Most of the features obtained through conventional feature selection algorithms such as IG are relevant and redundant. In this paper, a two-step feature selection method is proposed. At the first step redundancy analysis among original features based on categorical fuzzy correlation degree is applied to filter the redundant features with the similar categorical term frequency distribution. In the second step, conventional IG feature selection algorithm is adopted to select the final feature set for document categorization. Experiments dealing with the well-known Reuters-21578 and 20news-18828 corpuses show that the proposed method can eliminate redundant features with high fuzzy correlation degree between each other and obtain a compressed feature space where the dimension of feature space is dramatically reduced. The document categorization results on two corpuses show that the conventional IG feature selection algorithm can achieve a better document categorization performance on the compressed feature space and demonstrate the effectiveness of the proposed method.