Prompt: Dynamic Data-Partitioning for Distributed Micro-batch Stream Processing Systems

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data Pub Date : 2020-06-11 DOI:10.1145/3318464.3389713

A. S. Abdelhamid, Ahmed R. Mahmood, Anas Daghistani, Walid G. Aref

{"title":"Prompt: Dynamic Data-Partitioning for Distributed Micro-batch Stream Processing Systems","authors":"A. S. Abdelhamid, Ahmed R. Mahmood, Anas Daghistani, Walid G. Aref","doi":"10.1145/3318464.3389713","DOIUrl":null,"url":null,"abstract":"Advances in real-world applications require high-throughput processing over large data streams. Micro-batching has been proposed to support the needs of these applications. In micro-batching, the processing and batching of the data are interleaved, where the incoming data tuples are first buffered as data blocks, and then are processed collectively using parallel function constructs (e.g., Map-Reduce). The size of a micro-batch is set to guarantee a certain response-time latency that is to conform to the application's service-level agreement. In contrast to tuple-at-a-time data stream processing, micro-batching has the potential to sustain higher data rates. However, existing micro-batch stream processing systems use basic data-partitioning techniques that do not account for data skew and variable data rates. Load-awareness is necessary to maintain performance and to enhance resource utilization. A new data partitioning scheme termed Prompt is presented that leverages the characteristics of the micro-batch processing model. In the batching phase, a frequency-aware buffering mechanism is introduced that progressively maintains run-time statistics, and provides online key-based sorting as data tuples arrive. Because achieving optimal data partitioning is NP-Hard in this context, a workload-aware greedy algorithm is introduced that partitions the buffered data tuples efficiently for the Map stage. In the processing phase, a load-aware distribution mechanism is presented that balances the size of the input to the Reduce stage without incurring inter-task communication overhead. Moreover, Prompt elastically adapts resource consumption according to workload changes. Experimental results using real and synthetic data sets demonstrate that Prompt is robust against fluctuations in data distribution and arrival rates. Furthermore, Prompt achieves up to 200% improvement in system throughput over state-of-the-art techniques without degradation in latency.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"219 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3318464.3389713","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Advances in real-world applications require high-throughput processing over large data streams. Micro-batching has been proposed to support the needs of these applications. In micro-batching, the processing and batching of the data are interleaved, where the incoming data tuples are first buffered as data blocks, and then are processed collectively using parallel function constructs (e.g., Map-Reduce). The size of a micro-batch is set to guarantee a certain response-time latency that is to conform to the application's service-level agreement. In contrast to tuple-at-a-time data stream processing, micro-batching has the potential to sustain higher data rates. However, existing micro-batch stream processing systems use basic data-partitioning techniques that do not account for data skew and variable data rates. Load-awareness is necessary to maintain performance and to enhance resource utilization. A new data partitioning scheme termed Prompt is presented that leverages the characteristics of the micro-batch processing model. In the batching phase, a frequency-aware buffering mechanism is introduced that progressively maintains run-time statistics, and provides online key-based sorting as data tuples arrive. Because achieving optimal data partitioning is NP-Hard in this context, a workload-aware greedy algorithm is introduced that partitions the buffered data tuples efficiently for the Map stage. In the processing phase, a load-aware distribution mechanism is presented that balances the size of the input to the Reduce stage without incurring inter-task communication overhead. Moreover, Prompt elastically adapts resource consumption according to workload changes. Experimental results using real and synthetic data sets demonstrate that Prompt is robust against fluctuations in data distribution and arrival rates. Furthermore, Prompt achieves up to 200% improvement in system throughput over state-of-the-art techniques without degradation in latency.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

提示:分布式微批处理流处理系统的动态数据分区

现实世界应用的进步需要对大数据流进行高吞吐量处理。微配料是为了满足这些应用的需要而提出的。在微批处理中，数据的处理和批处理是交错的，其中传入的数据元组首先作为数据块进行缓冲，然后使用并行函数结构(例如Map-Reduce)进行集体处理。设置微批的大小是为了保证一定的响应时间延迟，以符合应用程序的服务水平协议。与一次处理两个数据流相比，微批处理具有维持更高数据速率的潜力。然而，现有的微批流处理系统使用基本的数据分区技术，这些技术没有考虑到数据倾斜和可变数据速率。负载感知对于保持性能和提高资源利用率是必要的。利用微批处理模型的特点，提出了一种新的数据分区方案Prompt。在批处理阶段，引入了频率感知缓冲机制，该机制逐步维护运行时统计信息，并在数据元组到达时提供基于键的在线排序。由于在这种情况下实现最优数据分区是NP-Hard的，因此引入了一种工作负载敏感的贪心算法，该算法在Map阶段有效地对缓冲数据元组进行分区。在处理阶段，提出了一种负载敏感的分配机制，该机制在不产生任务间通信开销的情况下平衡Reduce阶段的输入大小。此外，Prompt可以根据工作负载的变化灵活地调整资源消耗。使用真实数据集和合成数据集的实验结果表明，Prompt对数据分布和到达率的波动具有鲁棒性。此外，与最先进的技术相比，Prompt在系统吞吐量方面提高了200%，而不会降低延迟。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

自引率

0.00%

发文量