{"title":"Popularity-aware differentiated distributed stream processing on skewed streams","authors":"Hanhua Chen, Fan Zhang, Hai Jin","doi":"10.1109/ICNP.2017.8117551","DOIUrl":null,"url":null,"abstract":"Real-world stream data with skewed distribution raises unique challenges to distributed stream processing systems. Existing stream workload partitioning schemes usually use a “one size fits all” design, which leverage either a shuffle grouping or a key grouping strategy for partitioning the stream workloads among multiple processing units, leading to notable problems of unsatisfied system throughput and processing latency. In this paper, we show that the key grouping based schemes result in serious load imbalance and low computation efficiency in the presence of data skewness while the shuffle grouping schemes are not scalable in terms of memory space. We argue that the key to efficient stream scheduling is the popularity of the stream data. We propose and implement a differentiated distributed stream processing system, call DStream, which assigns the popular keys using shuffle grouping while assigns unpopular ones using key grouping. We design a novel efficient and light-weighted probabilistic counting scheme for identifying the current hot keys in dynamic real-time streams. Two factors contribute to the power of this design: 1) the probabilistic counting scheme is extremely computation and memory efficient, so that it can be well integrated in processing instances in the system; 2) the scheme can adapt to the popularity changes in the dynamic stream processing environment. We implement the DStream system on top of Apache Storm. Experiment results using large-scale traces from real-world systems show that DStream achieves a 2.3× improvement in terms of processing throughput and reduces the processing latency by 64% compared to state-of-the-art designs.","PeriodicalId":6462,"journal":{"name":"2017 IEEE 25th International Conference on Network Protocols (ICNP)","volume":"19 1","pages":"1-10"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 25th International Conference on Network Protocols (ICNP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICNP.2017.8117551","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14
Abstract
Real-world stream data with skewed distribution raises unique challenges to distributed stream processing systems. Existing stream workload partitioning schemes usually use a “one size fits all” design, which leverage either a shuffle grouping or a key grouping strategy for partitioning the stream workloads among multiple processing units, leading to notable problems of unsatisfied system throughput and processing latency. In this paper, we show that the key grouping based schemes result in serious load imbalance and low computation efficiency in the presence of data skewness while the shuffle grouping schemes are not scalable in terms of memory space. We argue that the key to efficient stream scheduling is the popularity of the stream data. We propose and implement a differentiated distributed stream processing system, call DStream, which assigns the popular keys using shuffle grouping while assigns unpopular ones using key grouping. We design a novel efficient and light-weighted probabilistic counting scheme for identifying the current hot keys in dynamic real-time streams. Two factors contribute to the power of this design: 1) the probabilistic counting scheme is extremely computation and memory efficient, so that it can be well integrated in processing instances in the system; 2) the scheme can adapt to the popularity changes in the dynamic stream processing environment. We implement the DStream system on top of Apache Storm. Experiment results using large-scale traces from real-world systems show that DStream achieves a 2.3× improvement in terms of processing throughput and reduces the processing latency by 64% compared to state-of-the-art designs.