Particle Swarm Optimization Based Two-Stage Feature Selection in Text Mining

Xiaohan Bai, Xiaoying Gao, Bing Xue
{"title":"Particle Swarm Optimization Based Two-Stage Feature Selection in Text Mining","authors":"Xiaohan Bai, Xiaoying Gao, Bing Xue","doi":"10.1109/CEC.2018.8477773","DOIUrl":null,"url":null,"abstract":"Text mining is an important and popular data mining topic, where a fundamental objective is to enable users to extract informative data from text-based assets and perform related operations on the text, like retrieval, classification, and summarization. For text classification, one of the most important steps is feature selection, because not all the features in the text dataset are useful for classification. Irrelevant and redundant features should be removed to increase the accuracy and decrease the complexity and running time, but it is often an expensive process, and most existing methods using a simple filter to remove features, which might potentially loose some useful ones because of feature interactions. Furthermore, there is little research using particle swarm optimization (PSO) algorithms to select informative features for text classification. This paper presents an approach using a novel two-stage method for text feature selection, where with the features selected by four different filter ranking methods at the first stage, more irrelevant features are removed by PSO to compose the final feature subset. The proposed algorithm is compared with four traditional feature selection methods on the commonly used Reuter-21578 dataset. The experimental results show that the proposed two-stage method can substantially reduce the dimensionality of the feature space and improve the classification accuracy.","PeriodicalId":212677,"journal":{"name":"2018 IEEE Congress on Evolutionary Computation (CEC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE Congress on Evolutionary Computation (CEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CEC.2018.8477773","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 29

Abstract

Text mining is an important and popular data mining topic, where a fundamental objective is to enable users to extract informative data from text-based assets and perform related operations on the text, like retrieval, classification, and summarization. For text classification, one of the most important steps is feature selection, because not all the features in the text dataset are useful for classification. Irrelevant and redundant features should be removed to increase the accuracy and decrease the complexity and running time, but it is often an expensive process, and most existing methods using a simple filter to remove features, which might potentially loose some useful ones because of feature interactions. Furthermore, there is little research using particle swarm optimization (PSO) algorithms to select informative features for text classification. This paper presents an approach using a novel two-stage method for text feature selection, where with the features selected by four different filter ranking methods at the first stage, more irrelevant features are removed by PSO to compose the final feature subset. The proposed algorithm is compared with four traditional feature selection methods on the commonly used Reuter-21578 dataset. The experimental results show that the proposed two-stage method can substantially reduce the dimensionality of the feature space and improve the classification accuracy.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于粒子群优化的文本挖掘两阶段特征选择
文本挖掘是一个重要且流行的数据挖掘主题,其基本目标是使用户能够从基于文本的资产中提取信息数据,并对文本执行相关操作,如检索、分类和摘要。对于文本分类,最重要的步骤之一是特征选择,因为并不是文本数据集中的所有特征都对分类有用。应该删除不相关和冗余的特征以提高准确性并减少复杂性和运行时间,但这通常是一个昂贵的过程,并且大多数现有方法使用简单的过滤器来删除特征,这可能会因为特征交互而潜在地丢失一些有用的特征。此外,利用粒子群算法选择信息特征进行文本分类的研究很少。本文提出了一种新的两阶段文本特征选择方法,在第一阶段使用四种不同的过滤器排序方法选择特征后,通过粒子群算法去除更多不相关的特征组成最终的特征子集。在常用的reuters -21578数据集上,与四种传统的特征选择方法进行了比较。实验结果表明,所提出的两阶段方法能够显著降低特征空间的维数,提高分类精度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Automatic Evolution of AutoEncoders for Compressed Representations Landscape-Based Differential Evolution for Constrained Optimization Problems A Novel Approach for Optimizing Ensemble Components in Rainfall Prediction A Many-Objective Evolutionary Algorithm with Fast Clustering and Reference Point Redistribution Manyobjective Optimization to Design Physical Topology of Optical Networks with Undefined Node Locations
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1