DistStream: An Order-Aware Distributed Framework for Online-Offline Stream Clustering Algorithms

2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS) Pub Date : 2020-11-01 DOI:10.1109/ICDCS47774.2020.00075

Lijie Xu, Xingtong Ye, Kai Kang, Tian Guo, Wensheng Dou, Wei Wang, Jun Wei

{"title":"DistStream: An Order-Aware Distributed Framework for Online-Offline Stream Clustering Algorithms","authors":"Lijie Xu, Xingtong Ye, Kai Kang, Tian Guo, Wensheng Dou, Wei Wang, Jun Wei","doi":"10.1109/ICDCS47774.2020.00075","DOIUrl":null,"url":null,"abstract":"Stream clustering is an important data mining technique to capture the evolving patterns in real-time data streams. Today’s data streams, e.g., IoT events and Web clicks, are usually high-speed and contain dynamically-changing patterns. Existing stream clustering algorithms usually follow an online-offline paradigm with a one-record-at-a-time update model, which was designed for running in a single machine. These stream clustering algorithms, with this sequential update model, cannot be efficiently parallelized and fail to deliver the required high throughput for stream clustering.In this paper, we present DistStream, a distributed framework that can effectively scale out online-offline stream clustering algorithms. To parallelize these algorithms for high throughput, we develop a mini-batch update model with efficient parallelization approaches. To maintain high clustering quality, DistStream’s mini-batch update model preserves the update order in all the computation steps during parallel execution, which can reflect the recent changes for dynamically-changing streaming data. We implement DistStream atop Spark Streaming, as well as four representative stream clustering algorithms based on DistStream. Our evaluation on three real-world datasets shows that DistStream-based stream clustering algorithms can achieve sublinear throughput gain and comparable (99%) clustering quality with their single-machine counterparts.","PeriodicalId":158630,"journal":{"name":"2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS47774.2020.00075","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Stream clustering is an important data mining technique to capture the evolving patterns in real-time data streams. Today’s data streams, e.g., IoT events and Web clicks, are usually high-speed and contain dynamically-changing patterns. Existing stream clustering algorithms usually follow an online-offline paradigm with a one-record-at-a-time update model, which was designed for running in a single machine. These stream clustering algorithms, with this sequential update model, cannot be efficiently parallelized and fail to deliver the required high throughput for stream clustering.In this paper, we present DistStream, a distributed framework that can effectively scale out online-offline stream clustering algorithms. To parallelize these algorithms for high throughput, we develop a mini-batch update model with efficient parallelization approaches. To maintain high clustering quality, DistStream’s mini-batch update model preserves the update order in all the computation steps during parallel execution, which can reflect the recent changes for dynamically-changing streaming data. We implement DistStream atop Spark Streaming, as well as four representative stream clustering algorithms based on DistStream. Our evaluation on three real-world datasets shows that DistStream-based stream clustering algorithms can achieve sublinear throughput gain and comparable (99%) clustering quality with their single-machine counterparts.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

DistStream:在线-离线流聚类算法的顺序感知分布式框架

流聚类是一种重要的数据挖掘技术，用于捕捉实时数据流中不断变化的模式。今天的数据流，例如物联网事件和网页点击，通常是高速的，并且包含动态变化的模式。现有的流聚类算法通常采用在线-离线模式，采用一次一条记录的更新模型，该模型是为在一台机器上运行而设计的。使用这种顺序更新模型的流聚类算法不能有效地并行化，不能提供流聚类所需的高吞吐量。在本文中，我们提出了DistStream，一个分布式框架，可以有效地扩展在线-离线流聚类算法。为了并行化这些算法以获得高吞吐量，我们开发了一个具有高效并行化方法的小批量更新模型。为了保持高集群质量，DistStream的小批量更新模型在并行执行过程中保留了所有计算步骤的更新顺序，可以反映动态变化的流数据的最新变化。我们在Spark Streaming之上实现了DistStream，以及基于DistStream的四种代表性流聚类算法。我们对三个真实数据集的评估表明，基于diststream的流聚类算法可以实现亚线性吞吐量增益，并且与单机同类算法的聚类质量相当(99%)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊