Online feature subset selection for mining feature streams in big data via incremental learning and evolutionary computation

IF 8.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Swarm and Evolutionary Computation Pub Date : 2025-04-01 Epub Date: 2025-02-26 DOI:10.1016/j.swevo.2025.101896

Yelleti Vivek , Vadlamani Ravi , P. Radha Krishna

{"title":"Online feature subset selection for mining feature streams in big data via incremental learning and evolutionary computation","authors":"Yelleti Vivek , Vadlamani Ravi , P. Radha Krishna","doi":"10.1016/j.swevo.2025.101896","DOIUrl":null,"url":null,"abstract":"<div><div>Online streaming feature subset selection (OSFSS) presents a noteworthy challenge when data samples arrive rapidly and in a time-dependent manner. The complexity of this problem is further exacerbated when features arrive as a stream. Despite several attempts to solve OSFSS over feature streams, existing methods lack scalability, cannot handle interaction effects among features, and fail to efficiently handle high-velocity feature streams. To address these challenges, we propose a novel wrapper-for OSFSS named OSFSS-W (wrapper-for OSFSS), specifically designed to mine feature streams within the Apache Spark environment. Our proposed method incorporates (i) two vigilance tests: for removing (a) irrelevant features and (b) redundant features (ii) incremental learning and (iii) a tolerance-based feedback mechanism that retains and utilizes previous knowledge while adhering to the predefined tolerance thresholds. Additionally, for the purpose of optimization, we introduce a Bare Bones Particle Swarm Optimization (BBPSO-L) algorithm driven by the logistic distribution. Further, the BBPSO-L is parallelized within Apache Spark, following an island-based approach. We evaluated the performance of the proposed algorithm on the datasets taken from the cybersecurity, bioinformatics, and finance domains. The results demonstrate that incorporating two vigilance tests coupled with a tolerance-based feedback mechanism significantly improved the median Area under the receiver operating characteristic curve (AUC) and median cardinality across all datasets.</div></div>","PeriodicalId":48682,"journal":{"name":"Swarm and Evolutionary Computation","volume":"94 ","pages":"Article 101896"},"PeriodicalIF":8.5000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Swarm and Evolutionary Computation","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2210650225000549","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/26 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Online streaming feature subset selection (OSFSS) presents a noteworthy challenge when data samples arrive rapidly and in a time-dependent manner. The complexity of this problem is further exacerbated when features arrive as a stream. Despite several attempts to solve OSFSS over feature streams, existing methods lack scalability, cannot handle interaction effects among features, and fail to efficiently handle high-velocity feature streams. To address these challenges, we propose a novel wrapper-for OSFSS named OSFSS-W (wrapper-for OSFSS), specifically designed to mine feature streams within the Apache Spark environment. Our proposed method incorporates (i) two vigilance tests: for removing (a) irrelevant features and (b) redundant features (ii) incremental learning and (iii) a tolerance-based feedback mechanism that retains and utilizes previous knowledge while adhering to the predefined tolerance thresholds. Additionally, for the purpose of optimization, we introduce a Bare Bones Particle Swarm Optimization (BBPSO-L) algorithm driven by the logistic distribution. Further, the BBPSO-L is parallelized within Apache Spark, following an island-based approach. We evaluated the performance of the proposed algorithm on the datasets taken from the cybersecurity, bioinformatics, and finance domains. The results demonstrate that incorporating two vigilance tests coupled with a tolerance-based feedback mechanism significantly improved the median Area under the receiver operating characteristic curve (AUC) and median cardinality across all datasets.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于增量学习和进化计算的大数据特征流在线特征子集选择

在线流特征子集选择（OSFSS）在数据样本快速到达且依赖于时间的情况下提出了一个值得注意的挑战。当特性以流的形式出现时，这个问题的复杂性会进一步加剧。尽管有一些尝试通过特征流来解决OSFSS，但现有的方法缺乏可扩展性，无法处理特征之间的交互效果，并且无法有效地处理高速特征流。为了应对这些挑战，我们提出了一种新的OSFSS包装器，名为OSFSS- w (wrapper-for OSFSS)，专门用于在Apache Spark环境中挖掘特性流。我们提出的方法包含(i)两个警惕性测试：用于去除(a)不相关特征和(b)冗余特征（ii）增量学习和（iii）基于容忍度的反馈机制，该机制保留并利用先前的知识，同时坚持预定义的容忍度阈值。此外，为了优化目的，我们引入了一种由logistic分布驱动的裸骨架粒子群优化（BBPSO-L）算法。此外，BBPSO-L遵循基于孤岛的方法在Apache Spark中并行化。我们在网络安全、生物信息学和金融领域的数据集上评估了所提出算法的性能。结果表明，将两个警惕性测试与基于容差的反馈机制相结合，显著提高了所有数据集的接收者工作特征曲线下的中位数面积（AUC）和中位数基数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Swarm and Evolutionary Computation COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCEC-COMPUTER SCIENCE, THEORY & METHODS

CiteScore

16.00

自引率

12.00%

发文量

169

期刊介绍： Swarm and Evolutionary Computation is a pioneering peer-reviewed journal focused on the latest research and advancements in nature-inspired intelligent computation using swarm and evolutionary algorithms. It covers theoretical, experimental, and practical aspects of these paradigms and their hybrids, promoting interdisciplinary research. The journal prioritizes the publication of high-quality, original articles that push the boundaries of evolutionary computation and swarm intelligence. Additionally, it welcomes survey papers on current topics and novel applications. Topics of interest include but are not limited to: Genetic Algorithms, and Genetic Programming, Evolution Strategies, and Evolutionary Programming, Differential Evolution, Artificial Immune Systems, Particle Swarms, Ant Colony, Bacterial Foraging, Artificial Bees, Fireflies Algorithm, Harmony Search, Artificial Life, Digital Organisms, Estimation of Distribution Algorithms, Stochastic Diffusion Search, Quantum Computing, Nano Computing, Membrane Computing, Human-centric Computing, Hybridization of Algorithms, Memetic Computing, Autonomic Computing, Self-organizing systems, Combinatorial, Discrete, Binary, Constrained, Multi-objective, Multi-modal, Dynamic, and Large-scale Optimization.