Online imbalance learning with unpredictable feature evolution and label scarcity

IF 5.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neurocomputing Pub Date : 2024-09-03 DOI:10.1016/j.neucom.2024.128476

{"title":"Online imbalance learning with unpredictable feature evolution and label scarcity","authors":"","doi":"10.1016/j.neucom.2024.128476","DOIUrl":null,"url":null,"abstract":"<div><p>Recently, online learning with imbalanced data streams has aroused wide concern, which reflects an uneven distribution of different classes in data streams. Existing approaches have conventionally been conducted on stationary feature space and they assume that we can obtain the entire labels of data streams in the case of supervised learning. However, in many real scenarios, e.g., the environment monitoring task, new features flood in, and old features are partially lost during the changing environment as the different lifespans of different sensors. Besides, each instance needs to be labeled by experts, resulting in expensive costs and scarcity of labels. To address the above problems, this paper proposes a novel Online Imbalance learning with unpredictable Feature evolution and Label scarcity (OIFL) algorithm. First, we utilize margin-based online active learning to selectively label valuable instances. After obtaining the labels, we handle imbalanced class distribution by optimizing F-measure and transforming F-measure optimization into a weighted surrogate loss minimization. When data streams arrive with augmented features, we combine the online passive-aggressive algorithm and structural risk minimization to update the classifier in the divided feature space. When data streams arrive with incomplete features, we leverage variance to identify the most informative features following the empirical risk minimization principle and continue to update the existing classifier as before. Finally, we obtain a sparse but reliable learner by the strategy of projecting truncation. We derive theoretical analyses of OIFL. Also, experiments on the synthetic datasets and real-world data streams to validate the effectiveness of our method.</p></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":null,"pages":null},"PeriodicalIF":5.5000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224012475","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Recently, online learning with imbalanced data streams has aroused wide concern, which reflects an uneven distribution of different classes in data streams. Existing approaches have conventionally been conducted on stationary feature space and they assume that we can obtain the entire labels of data streams in the case of supervised learning. However, in many real scenarios, e.g., the environment monitoring task, new features flood in, and old features are partially lost during the changing environment as the different lifespans of different sensors. Besides, each instance needs to be labeled by experts, resulting in expensive costs and scarcity of labels. To address the above problems, this paper proposes a novel Online Imbalance learning with unpredictable Feature evolution and Label scarcity (OIFL) algorithm. First, we utilize margin-based online active learning to selectively label valuable instances. After obtaining the labels, we handle imbalanced class distribution by optimizing F-measure and transforming F-measure optimization into a weighted surrogate loss minimization. When data streams arrive with augmented features, we combine the online passive-aggressive algorithm and structural risk minimization to update the classifier in the divided feature space. When data streams arrive with incomplete features, we leverage variance to identify the most informative features following the empirical risk minimization principle and continue to update the existing classifier as before. Finally, we obtain a sparse but reliable learner by the strategy of projecting truncation. We derive theoretical analyses of OIFL. Also, experiments on the synthetic datasets and real-world data streams to validate the effectiveness of our method.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

具有不可预测特征演化和标签稀缺性的在线不平衡学习

最近，不平衡数据流的在线学习引起了广泛关注，这反映了数据流中不同类别的不均匀分布。现有的方法通常是在静态特征空间上进行的，它们假定在监督学习的情况下，我们可以获得数据流的全部标签。然而，在许多实际场景中，例如环境监测任务，新特征会大量涌入，而旧特征则会随着环境的变化而部分丢失，因为不同传感器的寿命不同。此外，每个实例都需要专家进行标注，成本高昂且标注稀缺。为解决上述问题，本文提出了一种新颖的具有不可预测特征演化和标签稀缺性的在线不平衡学习（OIFL）算法。首先，我们利用基于边际的在线主动学习来选择性地为有价值的实例贴标签。获得标签后，我们通过优化 F-measure，并将 F-measure 优化转化为加权代理损失最小化，来处理不平衡的类分布。当数据流带有增强特征时，我们结合在线被动攻击算法和结构风险最小化算法，在划分的特征空间中更新分类器。当数据流带着不完整的特征到达时，我们利用方差，按照经验风险最小化原则识别出信息量最大的特征，并像以前一样继续更新现有分类器。最后，我们通过投影截断策略获得稀疏但可靠的学习器。我们得出了 OIFL 的理论分析。此外，我们还在合成数据集和真实世界数据流上进行了实验，以验证我们方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.