Feature redundancy removal for text classification using correlated feature subsets

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computational Intelligence Pub Date : 2023-12-21 DOI:10.1111/coin.12621

Lazhar Farek, Amira Benaidja

{"title":"Feature redundancy removal for text classification using correlated feature subsets","authors":"Lazhar Farek, Amira Benaidja","doi":"10.1111/coin.12621","DOIUrl":null,"url":null,"abstract":"<p>The curse of high dimensionality in text classification is a worrisome problem that requires efficient and optimal feature selection (FS) methods to improve classification accuracy and reduce learning time. Existing filter-based FS methods evaluate features independently of other related ones, which can then lead to selecting a large number of redundant features, especially in high-dimensional datasets, resulting in more learning time and less classification performance, whereas information theory-based methods aim to maximize feature dependency with the class variable and minimize its redundancy for all selected features, which gradually becomes impractical when increasing the feature space. To overcome the time complexity issue of information theory-based methods while taking into account the redundancy issue, in this article, we propose a new feature selection method for text classification termed correlation-based redundancy removal, which aims to minimize the redundancy using subsets of features having close mutual information scores without sequentially seeking already selected features. The idea is that it is not important to assess the redundancy of a dominant feature having high classification information with another irrelevant feature having low classification information and vice-versa since they are implicitly weakly correlated. Our method, tested on seven datasets using both traditional classifiers (Naive Bayes and support vector machines) and deep learning models (long short-term memory and convolutional neural networks), demonstrated strong performance by reducing redundancy and improving classification compared to ten competitive metrics.</p>","PeriodicalId":55228,"journal":{"name":"Computational Intelligence","volume":"40 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/coin.12621","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The curse of high dimensionality in text classification is a worrisome problem that requires efficient and optimal feature selection (FS) methods to improve classification accuracy and reduce learning time. Existing filter-based FS methods evaluate features independently of other related ones, which can then lead to selecting a large number of redundant features, especially in high-dimensional datasets, resulting in more learning time and less classification performance, whereas information theory-based methods aim to maximize feature dependency with the class variable and minimize its redundancy for all selected features, which gradually becomes impractical when increasing the feature space. To overcome the time complexity issue of information theory-based methods while taking into account the redundancy issue, in this article, we propose a new feature selection method for text classification termed correlation-based redundancy removal, which aims to minimize the redundancy using subsets of features having close mutual information scores without sequentially seeking already selected features. The idea is that it is not important to assess the redundancy of a dominant feature having high classification information with another irrelevant feature having low classification information and vice-versa since they are implicitly weakly correlated. Our method, tested on seven datasets using both traditional classifiers (Naive Bayes and support vector machines) and deep learning models (long short-term memory and convolutional neural networks), demonstrated strong performance by reducing redundancy and improving classification compared to ten competitive metrics.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用相关特征子集去除文本分类中的特征冗余

文本分类中的高维诅咒是一个令人担忧的问题，需要高效、优化的特征选择（FS）方法来提高分类准确率并缩短学习时间。现有的基于滤波器的特征选择方法在评估特征时不考虑其他相关特征，这可能会导致选择大量冗余特征，尤其是在高维数据集中，从而导致学习时间增加，分类性能降低；而基于信息论的方法旨在最大化特征与类变量的依赖性，并最小化所有被选特征的冗余度，当特征空间增大时，这种方法逐渐变得不切实际。为了克服基于信息论方法的时间复杂性问题，同时考虑到冗余问题，我们在本文中提出了一种新的文本分类特征选择方法，即基于相关性的冗余去除方法，其目的是使用互信息得分接近的特征子集来最小化冗余，而不需要依次寻找已选定的特征。我们的想法是，评估一个具有高分类信息的主要特征与另一个具有低分类信息的不相关特征之间的冗余度并不重要，反之亦然，因为它们之间隐含着弱相关性。我们的方法使用传统分类器（奈夫贝叶斯和支持向量机）和深度学习模型（长短期记忆和卷积神经网络）在七个数据集上进行了测试，与十个竞争指标相比，我们的方法减少了冗余，提高了分类效果，表现出了强劲的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computational Intelligence 工程技术-计算机：人工智能

CiteScore

6.90

自引率

3.60%

发文量

审稿时长

>12 weeks

期刊介绍： This leading international journal promotes and stimulates research in the field of artificial intelligence (AI). Covering a wide range of issues - from the tools and languages of AI to its philosophical implications - Computational Intelligence provides a vigorous forum for the publication of both experimental and theoretical research, as well as surveys and impact studies. The journal is designed to meet the needs of a wide range of AI workers in academic and industrial research.