Particle Swarm Optimization Based Two-Stage Feature Selection in Text Mining

2018 IEEE Congress on Evolutionary Computation (CEC) Pub Date : 2018-07-01 DOI:10.1109/CEC.2018.8477773

Xiaohan Bai, Xiaoying Gao, Bing Xue

{"title":"Particle Swarm Optimization Based Two-Stage Feature Selection in Text Mining","authors":"Xiaohan Bai, Xiaoying Gao, Bing Xue","doi":"10.1109/CEC.2018.8477773","DOIUrl":null,"url":null,"abstract":"Text mining is an important and popular data mining topic, where a fundamental objective is to enable users to extract informative data from text-based assets and perform related operations on the text, like retrieval, classification, and summarization. For text classification, one of the most important steps is feature selection, because not all the features in the text dataset are useful for classification. Irrelevant and redundant features should be removed to increase the accuracy and decrease the complexity and running time, but it is often an expensive process, and most existing methods using a simple filter to remove features, which might potentially loose some useful ones because of feature interactions. Furthermore, there is little research using particle swarm optimization (PSO) algorithms to select informative features for text classification. This paper presents an approach using a novel two-stage method for text feature selection, where with the features selected by four different filter ranking methods at the first stage, more irrelevant features are removed by PSO to compose the final feature subset. The proposed algorithm is compared with four traditional feature selection methods on the commonly used Reuter-21578 dataset. The experimental results show that the proposed two-stage method can substantially reduce the dimensionality of the feature space and improve the classification accuracy.","PeriodicalId":212677,"journal":{"name":"2018 IEEE Congress on Evolutionary Computation (CEC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE Congress on Evolutionary Computation (CEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CEC.2018.8477773","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 29

Abstract

Text mining is an important and popular data mining topic, where a fundamental objective is to enable users to extract informative data from text-based assets and perform related operations on the text, like retrieval, classification, and summarization. For text classification, one of the most important steps is feature selection, because not all the features in the text dataset are useful for classification. Irrelevant and redundant features should be removed to increase the accuracy and decrease the complexity and running time, but it is often an expensive process, and most existing methods using a simple filter to remove features, which might potentially loose some useful ones because of feature interactions. Furthermore, there is little research using particle swarm optimization (PSO) algorithms to select informative features for text classification. This paper presents an approach using a novel two-stage method for text feature selection, where with the features selected by four different filter ranking methods at the first stage, more irrelevant features are removed by PSO to compose the final feature subset. The proposed algorithm is compared with four traditional feature selection methods on the commonly used Reuter-21578 dataset. The experimental results show that the proposed two-stage method can substantially reduce the dimensionality of the feature space and improve the classification accuracy.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于粒子群优化的文本挖掘两阶段特征选择

文本挖掘是一个重要且流行的数据挖掘主题，其基本目标是使用户能够从基于文本的资产中提取信息数据，并对文本执行相关操作，如检索、分类和摘要。对于文本分类，最重要的步骤之一是特征选择，因为并不是文本数据集中的所有特征都对分类有用。应该删除不相关和冗余的特征以提高准确性并减少复杂性和运行时间，但这通常是一个昂贵的过程，并且大多数现有方法使用简单的过滤器来删除特征，这可能会因为特征交互而潜在地丢失一些有用的特征。此外，利用粒子群算法选择信息特征进行文本分类的研究很少。本文提出了一种新的两阶段文本特征选择方法，在第一阶段使用四种不同的过滤器排序方法选择特征后，通过粒子群算法去除更多不相关的特征组成最终的特征子集。在常用的reuters -21578数据集上，与四种传统的特征选择方法进行了比较。实验结果表明，所提出的两阶段方法能够显著降低特征空间的维数，提高分类精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2018 IEEE Congress on Evolutionary Computation (CEC)

自引率

0.00%

发文量