Classifying Imbalanced Data Streams via Dynamic Feature Group Weighting with Importance Sampling.

Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining Pub Date : 2014-04-01 DOI:10.1137/1.9781611973440.83

Ke Wu, Andrea Edwards, Wei Fan, Jing Gao, Kun Zhang

{"title":"Classifying Imbalanced Data Streams via Dynamic Feature Group Weighting with Importance Sampling.","authors":"Ke Wu, Andrea Edwards, Wei Fan, Jing Gao, Kun Zhang","doi":"10.1137/1.9781611973440.83","DOIUrl":null,"url":null,"abstract":"<p><p>Data stream classification and imbalanced data learning are two important areas of data mining research. Each has been well studied to date with many interesting algorithms developed. However, only a few approaches reported in literature address the intersection of these two fields due to their complex interplay. In this work, we proposed an importance sampling driven, dynamic feature group weighting framework (DFGW-IS) for classifying data streams of imbalanced distribution. Two components are tightly incorporated into the proposed approach to address the intrinsic characteristics of concept-drifting, imbalanced streaming data. Specifically, the ever-evolving concepts are tackled by a weighted ensemble trained on a set of feature groups with each sub-classifier (i.e. a single classifier or an ensemble) weighed by its discriminative power and stable level. The un-even class distribution, on the other hand, is typically battled by the sub-classifier built in a specific feature group with the underlying distribution rebalanced by the importance sampling technique. We derived the theoretical upper bound for the generalization error of the proposed algorithm. We also studied the empirical performance of our method on a set of benchmark synthetic and real world data, and significant improvement has been achieved over the competing algorithms in terms of standard evaluation metrics and parallel running time. Algorithm implementations and datasets are available upon request.</p>","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"2014 ","pages":"722-730"},"PeriodicalIF":0.0000,"publicationDate":"2014-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1137/1.9781611973440.83","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1137/1.9781611973440.83","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 28

Abstract

Data stream classification and imbalanced data learning are two important areas of data mining research. Each has been well studied to date with many interesting algorithms developed. However, only a few approaches reported in literature address the intersection of these two fields due to their complex interplay. In this work, we proposed an importance sampling driven, dynamic feature group weighting framework (DFGW-IS) for classifying data streams of imbalanced distribution. Two components are tightly incorporated into the proposed approach to address the intrinsic characteristics of concept-drifting, imbalanced streaming data. Specifically, the ever-evolving concepts are tackled by a weighted ensemble trained on a set of feature groups with each sub-classifier (i.e. a single classifier or an ensemble) weighed by its discriminative power and stable level. The un-even class distribution, on the other hand, is typically battled by the sub-classifier built in a specific feature group with the underlying distribution rebalanced by the importance sampling technique. We derived the theoretical upper bound for the generalization error of the proposed algorithm. We also studied the empirical performance of our method on a set of benchmark synthetic and real world data, and significant improvement has been achieved over the competing algorithms in terms of standard evaluation metrics and parallel running time. Algorithm implementations and datasets are available upon request.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于重要性抽样的动态特征组加权分类不平衡数据流。

数据流分类和不平衡数据学习是数据挖掘研究的两个重要领域。迄今为止，每一种算法都得到了很好的研究，并开发了许多有趣的算法。然而，由于这两个领域复杂的相互作用，只有少数文献报道的方法解决了这两个领域的交集。在这项工作中，我们提出了一个重要采样驱动的动态特征组加权框架(DFGW-IS)，用于对不平衡分布的数据流进行分类。两个组成部分紧密结合到所提出的方法中，以解决概念漂移，不平衡流数据的内在特征。具体来说，不断发展的概念是由一组特征组训练的加权集成来处理的，每个子分类器(即单个分类器或集成)根据其判别能力和稳定水平进行加权。另一方面，不均匀的类分布通常由在特定特征组中构建的子分类器来解决，并通过重要性采样技术重新平衡底层分布。给出了该算法泛化误差的理论上界。我们还研究了我们的方法在一组基准合成数据和真实世界数据上的经验性能，并且在标准评估指标和并行运行时间方面比竞争算法取得了显着改进。算法实现和数据集可根据要求提供。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining

自引率

0.00%

发文量

期刊最新文献

Automated Fusion of Multimodal Electronic Health Records for Better Medical Predictions. MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data Augmentation. FAME: Fragment-based Conditional Molecular Generation for Phenotypic Drug Discovery. Harmonic Alignment. GRIA: Graphical Regularization for Integrative Analysis.