A Frequency-Category Based Feature Selection in Big Data for Text Classification

Houda Amazal, M. Ramdani, M. Kissi
{"title":"A Frequency-Category Based Feature Selection in Big Data for Text Classification","authors":"Houda Amazal, M. Ramdani, M. Kissi","doi":"10.1145/3419604.3419620","DOIUrl":null,"url":null,"abstract":"In big data era, text classification is considered as one of the most important machine learning application domain. However, to build an efficient algorithm for classification, feature selection is a fundamental step to reduce dimensionality, achieve better accuracy and improve time execution. In the literature, most of the feature ranking techniques are document based. The major weakness of this approach is that it favours the terms occurring frequently in the documents and neglects the correlation between the terms and the categories. In this work, unlike the traditional approaches which deal with documents individually, we use mapreduce paradigm to process the documents of each category as a single document. Then, we introduce a parallel frequency-category feature selection method independently of any classifier to select the most relevant features. Experimental results on the 20-Newsgroups dataset showed that our approach improves the classification accuracy to 90.3%. Moreover, the system maintains the simplicity and lower execution time.","PeriodicalId":250715,"journal":{"name":"Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3419604.3419620","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In big data era, text classification is considered as one of the most important machine learning application domain. However, to build an efficient algorithm for classification, feature selection is a fundamental step to reduce dimensionality, achieve better accuracy and improve time execution. In the literature, most of the feature ranking techniques are document based. The major weakness of this approach is that it favours the terms occurring frequently in the documents and neglects the correlation between the terms and the categories. In this work, unlike the traditional approaches which deal with documents individually, we use mapreduce paradigm to process the documents of each category as a single document. Then, we introduce a parallel frequency-category feature selection method independently of any classifier to select the most relevant features. Experimental results on the 20-Newsgroups dataset showed that our approach improves the classification accuracy to 90.3%. Moreover, the system maintains the simplicity and lower execution time.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于频率分类的大数据文本分类特征选择
在大数据时代,文本分类被认为是机器学习最重要的应用领域之一。然而,要构建高效的分类算法,特征选择是降低维数、提高准确率和提高执行时间的基本步骤。在文献中,大多数特征排序技术都是基于文档的。这种方法的主要缺点是,它偏爱在文件中经常出现的术语,而忽略了术语与类别之间的相关性。在这项工作中,与传统的单独处理文档的方法不同,我们使用mapreduce范式将每个类别的文档作为单个文档进行处理。然后,我们引入了一种独立于任何分类器的并行频率-类别特征选择方法来选择最相关的特征。在20个新闻组数据集上的实验结果表明,我们的方法将分类准确率提高到90.3%。并且保持了系统的简单性和较低的执行时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Towards Mining Semantically Enriched Configurable Process Models Optimized Switch-Controller Association For Data Center Test Generation Tool for Modified Condition/Decision Coverage: Model Based Testing SHAMan Use of formative assessment to improve the online teaching materials content quality
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1