Extending k-Means-Based Algorithms for Evolving Data Streams with Variable Number of Clusters

2011 10th International Conference on Machine Learning and Applications and Workshops Pub Date : 2011-12-18 DOI:10.1109/ICMLA.2011.67

J. Silva, Eduardo R. Hruschka

{"title":"Extending k-Means-Based Algorithms for Evolving Data Streams with Variable Number of Clusters","authors":"J. Silva, Eduardo R. Hruschka","doi":"10.1109/ICMLA.2011.67","DOIUrl":null,"url":null,"abstract":"Many algorithms for clustering data streams based on the widely used k-Means have been proposed in the literature. Most of them assume that the number of clusters, k, is known and fixed a priori by the user. Aimed at relaxing this assumption, which is often unrealistic in practical applications, we describe an algorithmic framework that allows estimating k automatically from data. We illustrate the potential of the proposed framework by using three state-of-the-art algorithms for clustering data streams - Stream LSearch, CluStream, and Stream KM++ - combined with two well-known algorithms for estimating the number of clusters, namely: Ordered Multiple Runs of k-Means (OMRk) and Bisecting k-Means (BkM). As an additional contribution, we experimentally compare the resulting algorithmic instantiations in both synthetic and real-world data streams. Analyses of statistical significance suggest that OMRk yields to the best data partitions, while BkM is more computationally efficient. Also, the combination of Stream KM++ with OMRk leads to the best trade-off between accuracy and efficiency.","PeriodicalId":439926,"journal":{"name":"2011 10th International Conference on Machine Learning and Applications and Workshops","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 10th International Conference on Machine Learning and Applications and Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2011.67","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

Abstract

Many algorithms for clustering data streams based on the widely used k-Means have been proposed in the literature. Most of them assume that the number of clusters, k, is known and fixed a priori by the user. Aimed at relaxing this assumption, which is often unrealistic in practical applications, we describe an algorithmic framework that allows estimating k automatically from data. We illustrate the potential of the proposed framework by using three state-of-the-art algorithms for clustering data streams - Stream LSearch, CluStream, and Stream KM++ - combined with two well-known algorithms for estimating the number of clusters, namely: Ordered Multiple Runs of k-Means (OMRk) and Bisecting k-Means (BkM). As an additional contribution, we experimentally compare the resulting algorithmic instantiations in both synthetic and real-world data streams. Analyses of statistical significance suggest that OMRk yields to the best data partitions, while BkM is more computationally efficient. Also, the combination of Stream KM++ with OMRk leads to the best trade-off between accuracy and efficiency.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于k-均值的变聚类演化数据流扩展算法

文献中已经提出了许多基于广泛使用的k-Means的数据流聚类算法。它们中的大多数假设簇的数量k是已知的，并且是用户先验地固定的。为了放松这个在实际应用中通常不现实的假设，我们描述了一个允许从数据中自动估计k的算法框架。我们通过使用三种最先进的聚类数据流算法(Stream LSearch, CluStream和Stream k++)以及两种众所周知的估计聚类数量的算法(即:有序多次运行k-Means (OMRk)和平分k-Means (BkM))来说明所提出框架的潜力。作为额外的贡献，我们通过实验比较了合成数据流和真实数据流中产生的算法实例。统计显著性分析表明，OMRk产生最好的数据分区，而BkM的计算效率更高。此外，Stream k++与OMRk的结合在准确性和效率之间取得了最佳的平衡。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2011 10th International Conference on Machine Learning and Applications and Workshops

自引率

0.00%

发文量

期刊最新文献

A Data-Mining Approach to Travel Price Forecasting L1 vs. L2 Regularization in Text Classification when Learning from Labeled Features Nonlinear RANSAC Optimization for Parameter Estimation with Applications to Phagocyte Transmigration Speech Rating System through Space Mapping Kernel Methods for Minimum Entropy Encoding