{"title":"A Frequency-Category Based Feature Selection in Big Data for Text Classification","authors":"Houda Amazal, M. Ramdani, M. Kissi","doi":"10.1145/3419604.3419620","DOIUrl":null,"url":null,"abstract":"In big data era, text classification is considered as one of the most important machine learning application domain. However, to build an efficient algorithm for classification, feature selection is a fundamental step to reduce dimensionality, achieve better accuracy and improve time execution. In the literature, most of the feature ranking techniques are document based. The major weakness of this approach is that it favours the terms occurring frequently in the documents and neglects the correlation between the terms and the categories. In this work, unlike the traditional approaches which deal with documents individually, we use mapreduce paradigm to process the documents of each category as a single document. Then, we introduce a parallel frequency-category feature selection method independently of any classifier to select the most relevant features. Experimental results on the 20-Newsgroups dataset showed that our approach improves the classification accuracy to 90.3%. Moreover, the system maintains the simplicity and lower execution time.","PeriodicalId":250715,"journal":{"name":"Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3419604.3419620","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In big data era, text classification is considered as one of the most important machine learning application domain. However, to build an efficient algorithm for classification, feature selection is a fundamental step to reduce dimensionality, achieve better accuracy and improve time execution. In the literature, most of the feature ranking techniques are document based. The major weakness of this approach is that it favours the terms occurring frequently in the documents and neglects the correlation between the terms and the categories. In this work, unlike the traditional approaches which deal with documents individually, we use mapreduce paradigm to process the documents of each category as a single document. Then, we introduce a parallel frequency-category feature selection method independently of any classifier to select the most relevant features. Experimental results on the 20-Newsgroups dataset showed that our approach improves the classification accuracy to 90.3%. Moreover, the system maintains the simplicity and lower execution time.