Semantic similarity-aware feature selection and redundancy removal for text classification using joint mutual information

IF 2.5 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Knowledge and Information Systems Pub Date : 2024-06-13 DOI:10.1007/s10115-024-02143-1
Farek Lazhar, Benaidja Amira
{"title":"Semantic similarity-aware feature selection and redundancy removal for text classification using joint mutual information","authors":"Farek Lazhar, Benaidja Amira","doi":"10.1007/s10115-024-02143-1","DOIUrl":null,"url":null,"abstract":"<p>The high dimensionality of text data is a challenging issue that requires efficient methods to reduce vector space and improve classification accuracy. Existing filter-based methods fail to address the redundancy issue, resulting in the selection of irrelevant and redundant features. Information theory-based methods effectively solve this problem but are not practical for large amounts of data due to their high time complexity. The proposed method, termed semantic similarity-aware feature selection and redundancy removal (SS-FSRR), employs joint mutual information between the pairs of semantically related terms and the class label to capture redundant features. It is predicated on the assumption that semantically related terms imply potentially redundant ones, which can significantly reduce execution time by avoiding sequential search strategies. In this work, we use Word2Vec’s CBOW model to obtain semantic similarity between terms. The efficiency of the SS-FSRR is compared to six state-of-the-art competitive selection methods for categorical data using two traditional classifiers (SVM and NB) and a robust deep learning model (LSTM) on seven datasets with 10-fold cross-validation, where experimental results show that the SS-FSRR outperforms the other methods on most tested datasets with high stability as measured by the Jaccard’s Index.</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":"48 1","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge and Information Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10115-024-02143-1","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

The high dimensionality of text data is a challenging issue that requires efficient methods to reduce vector space and improve classification accuracy. Existing filter-based methods fail to address the redundancy issue, resulting in the selection of irrelevant and redundant features. Information theory-based methods effectively solve this problem but are not practical for large amounts of data due to their high time complexity. The proposed method, termed semantic similarity-aware feature selection and redundancy removal (SS-FSRR), employs joint mutual information between the pairs of semantically related terms and the class label to capture redundant features. It is predicated on the assumption that semantically related terms imply potentially redundant ones, which can significantly reduce execution time by avoiding sequential search strategies. In this work, we use Word2Vec’s CBOW model to obtain semantic similarity between terms. The efficiency of the SS-FSRR is compared to six state-of-the-art competitive selection methods for categorical data using two traditional classifiers (SVM and NB) and a robust deep learning model (LSTM) on seven datasets with 10-fold cross-validation, where experimental results show that the SS-FSRR outperforms the other methods on most tested datasets with high stability as measured by the Jaccard’s Index.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用联合互信息为文本分类选择语义相似性感知特征并去除冗余
文本数据的高维性是一个具有挑战性的问题,需要高效的方法来减少向量空间并提高分类精度。现有的基于滤波器的方法无法解决冗余问题,导致选择不相关的冗余特征。基于信息论的方法能有效解决这一问题,但由于时间复杂度高,对于海量数据来说并不实用。所提出的方法被称为语义相似性感知特征选择和冗余去除(SS-FSRR),它利用语义相关术语对和类别标签之间的联合互信息来捕捉冗余特征。它的前提假设是,语义相关的术语意味着潜在的冗余术语,这就避免了顺序搜索策略,从而大大缩短了执行时间。在这项工作中,我们使用 Word2Vec 的 CBOW 模型来获取术语之间的语义相似性。实验结果表明,在大多数测试数据集上,SS-FSRR 的性能都优于其他方法,而且以 Jaccard 指数衡量,SS-FSRR 具有很高的稳定性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Knowledge and Information Systems
Knowledge and Information Systems 工程技术-计算机:人工智能
CiteScore
5.70
自引率
7.40%
发文量
152
审稿时长
7.2 months
期刊介绍: Knowledge and Information Systems (KAIS) provides an international forum for researchers and professionals to share their knowledge and report new advances on all topics related to knowledge systems and advanced information systems. This monthly peer-reviewed archival journal publishes state-of-the-art research reports on emerging topics in KAIS, reviews of important techniques in related areas, and application papers of interest to a general readership.
期刊最新文献
Dynamic evolution of causal relationships among cryptocurrencies: an analysis via Bayesian networks Deep multi-semantic fuzzy K-means with adaptive weight adjustment Class incremental named entity recognition without forgetting Spectral clustering with scale fairness constraints Supervised kernel-based multi-modal Bhattacharya distance learning for imbalanced data classification
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1