FastJoin: A Skewness-Aware Distributed Stream Join System

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2019-05-20 DOI:10.1109/IPDPS.2019.00111

Shunjie Zhou, Fan Zhang, Hanhua Chen, Hai Jin, B. Zhou

{"title":"FastJoin: A Skewness-Aware Distributed Stream Join System","authors":"Shunjie Zhou, Fan Zhang, Hanhua Chen, Hai Jin, B. Zhou","doi":"10.1109/IPDPS.2019.00111","DOIUrl":null,"url":null,"abstract":"In the bigdata era, many applications are required to perform quick and accurate join operations on large-scale real-time data streams, such as stock trading and online advertisement analysis. To achieve high throughput and low latency, distributed stream join systems explore efficient stream partitioning strategies to execute the complex stream join procedure in parallel. Existing systems mainly deploy two kinds of partitioning strategies, i.e., random partitioning and hash partitioning. Random partitioning strategy partitions one data stream uniformly while broadcasting all the tuples of the other data stream. This simple strategy may incur lots of unnecessary computations for low-selectivity stream join. Hash partitioning strategy maps all the tuples of the two data streams according to their attributes for joining. However, hash partitioning strategy suffers from a serious load imbalance problem caused by the skew distribution of the attributes, which is common in real-world data. The skewed load may seriously affect the system performance. In this paper, we carefully model the load skewness problem in distributed join systems. We explore the key tuples which lead to the heavy load skewness, and propose an efficient key selection algorithm, GreedyFit to find out these key tuples. We design a lightweight tuple migration strategy to solve the load imbalance problem in real-time and implement a new distributed stream join system, FastJoin. Experimental results using real-world data show that FastJoin can significantly improve the system performance in terms of throughput and latency compared to the state-of-the-art stream join systems.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2019.00111","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

In the bigdata era, many applications are required to perform quick and accurate join operations on large-scale real-time data streams, such as stock trading and online advertisement analysis. To achieve high throughput and low latency, distributed stream join systems explore efficient stream partitioning strategies to execute the complex stream join procedure in parallel. Existing systems mainly deploy two kinds of partitioning strategies, i.e., random partitioning and hash partitioning. Random partitioning strategy partitions one data stream uniformly while broadcasting all the tuples of the other data stream. This simple strategy may incur lots of unnecessary computations for low-selectivity stream join. Hash partitioning strategy maps all the tuples of the two data streams according to their attributes for joining. However, hash partitioning strategy suffers from a serious load imbalance problem caused by the skew distribution of the attributes, which is common in real-world data. The skewed load may seriously affect the system performance. In this paper, we carefully model the load skewness problem in distributed join systems. We explore the key tuples which lead to the heavy load skewness, and propose an efficient key selection algorithm, GreedyFit to find out these key tuples. We design a lightweight tuple migration strategy to solve the load imbalance problem in real-time and implement a new distributed stream join system, FastJoin. Experimental results using real-world data show that FastJoin can significantly improve the system performance in terms of throughput and latency compared to the state-of-the-art stream join systems.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

FastJoin:一个感知偏度的分布式流连接系统

在大数据时代，许多应用需要对大规模实时数据流进行快速、准确的联接操作，如股票交易、在线广告分析等。为了实现高吞吐量和低延迟，分布式流连接系统探索有效的流分区策略来并行执行复杂的流连接过程。现有系统主要部署两种分区策略，即随机分区和哈希分区。随机分区策略对一个数据流进行统一分区，同时广播另一个数据流的所有元组。对于低选择性流连接，这个简单的策略可能会导致大量不必要的计算。哈希分区策略将两个数据流的所有元组根据其属性进行映射以进行连接。然而，哈希分区策略存在严重的负载不平衡问题，这是由属性的倾斜分布引起的，这在实际数据中很常见。负载倾斜会严重影响系统性能。本文对分布式连接系统中的负载偏度问题进行了详细的建模。我们探讨了导致重负载偏度的键元组，并提出了一种高效的键选择算法——GreedyFit来找出这些键元组。为了实时解决负载不平衡问题，我们设计了一种轻量级的元组迁移策略，并实现了一种新的分布式流连接系统FastJoin。使用真实数据的实验结果表明，与最先进的流连接系统相比，FastJoin在吞吐量和延迟方面可以显着提高系统性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量