Prompt: Dynamic Data-Partitioning for Distributed Micro-batch Stream Processing Systems

A. S. Abdelhamid, Ahmed R. Mahmood, Anas Daghistani, Walid G. Aref
{"title":"Prompt: Dynamic Data-Partitioning for Distributed Micro-batch Stream Processing Systems","authors":"A. S. Abdelhamid, Ahmed R. Mahmood, Anas Daghistani, Walid G. Aref","doi":"10.1145/3318464.3389713","DOIUrl":null,"url":null,"abstract":"Advances in real-world applications require high-throughput processing over large data streams. Micro-batching has been proposed to support the needs of these applications. In micro-batching, the processing and batching of the data are interleaved, where the incoming data tuples are first buffered as data blocks, and then are processed collectively using parallel function constructs (e.g., Map-Reduce). The size of a micro-batch is set to guarantee a certain response-time latency that is to conform to the application's service-level agreement. In contrast to tuple-at-a-time data stream processing, micro-batching has the potential to sustain higher data rates. However, existing micro-batch stream processing systems use basic data-partitioning techniques that do not account for data skew and variable data rates. Load-awareness is necessary to maintain performance and to enhance resource utilization. A new data partitioning scheme termed Prompt is presented that leverages the characteristics of the micro-batch processing model. In the batching phase, a frequency-aware buffering mechanism is introduced that progressively maintains run-time statistics, and provides online key-based sorting as data tuples arrive. Because achieving optimal data partitioning is NP-Hard in this context, a workload-aware greedy algorithm is introduced that partitions the buffered data tuples efficiently for the Map stage. In the processing phase, a load-aware distribution mechanism is presented that balances the size of the input to the Reduce stage without incurring inter-task communication overhead. Moreover, Prompt elastically adapts resource consumption according to workload changes. Experimental results using real and synthetic data sets demonstrate that Prompt is robust against fluctuations in data distribution and arrival rates. Furthermore, Prompt achieves up to 200% improvement in system throughput over state-of-the-art techniques without degradation in latency.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"219 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3318464.3389713","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

Advances in real-world applications require high-throughput processing over large data streams. Micro-batching has been proposed to support the needs of these applications. In micro-batching, the processing and batching of the data are interleaved, where the incoming data tuples are first buffered as data blocks, and then are processed collectively using parallel function constructs (e.g., Map-Reduce). The size of a micro-batch is set to guarantee a certain response-time latency that is to conform to the application's service-level agreement. In contrast to tuple-at-a-time data stream processing, micro-batching has the potential to sustain higher data rates. However, existing micro-batch stream processing systems use basic data-partitioning techniques that do not account for data skew and variable data rates. Load-awareness is necessary to maintain performance and to enhance resource utilization. A new data partitioning scheme termed Prompt is presented that leverages the characteristics of the micro-batch processing model. In the batching phase, a frequency-aware buffering mechanism is introduced that progressively maintains run-time statistics, and provides online key-based sorting as data tuples arrive. Because achieving optimal data partitioning is NP-Hard in this context, a workload-aware greedy algorithm is introduced that partitions the buffered data tuples efficiently for the Map stage. In the processing phase, a load-aware distribution mechanism is presented that balances the size of the input to the Reduce stage without incurring inter-task communication overhead. Moreover, Prompt elastically adapts resource consumption according to workload changes. Experimental results using real and synthetic data sets demonstrate that Prompt is robust against fluctuations in data distribution and arrival rates. Furthermore, Prompt achieves up to 200% improvement in system throughput over state-of-the-art techniques without degradation in latency.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
提示:分布式微批处理流处理系统的动态数据分区
现实世界应用的进步需要对大数据流进行高吞吐量处理。微配料是为了满足这些应用的需要而提出的。在微批处理中,数据的处理和批处理是交错的,其中传入的数据元组首先作为数据块进行缓冲,然后使用并行函数结构(例如Map-Reduce)进行集体处理。设置微批的大小是为了保证一定的响应时间延迟,以符合应用程序的服务水平协议。与一次处理两个数据流相比,微批处理具有维持更高数据速率的潜力。然而,现有的微批流处理系统使用基本的数据分区技术,这些技术没有考虑到数据倾斜和可变数据速率。负载感知对于保持性能和提高资源利用率是必要的。利用微批处理模型的特点,提出了一种新的数据分区方案Prompt。在批处理阶段,引入了频率感知缓冲机制,该机制逐步维护运行时统计信息,并在数据元组到达时提供基于键的在线排序。由于在这种情况下实现最优数据分区是NP-Hard的,因此引入了一种工作负载敏感的贪心算法,该算法在Map阶段有效地对缓冲数据元组进行分区。在处理阶段,提出了一种负载敏感的分配机制,该机制在不产生任务间通信开销的情况下平衡Reduce阶段的输入大小。此外,Prompt可以根据工作负载的变化灵活地调整资源消耗。使用真实数据集和合成数据集的实验结果表明,Prompt对数据分布和到达率的波动具有鲁棒性。此外,与最先进的技术相比,Prompt在系统吞吐量方面提高了200%,而不会降低延迟。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
MemFlow: Memory-Aware Distributed Deep Learning Densely Connected User Community and Location Cluster Search in Location-Based Social Networks Crowdsourcing Practice for Efficient Data Labeling: Aggregation, Incremental Relabeling, and Pricing Re-evaluating the Performance Trade-offs for Hash-Based Multi-Join Queries Improving Approximate Nearest Neighbor Search through Learned Adaptive Early Termination
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1