用于批处理和流工作负载的系统感知动态分区

Zoltan Zvara, Péter G. N. Szabó, Bal'azs Barnab'as L'or'ant, Andr'as A. Bencz'ur
{"title":"用于批处理和流工作负载的系统感知动态分区","authors":"Zoltan Zvara, Péter G. N. Szabó, Bal'azs Barnab'as L'or'ant, Andr'as A. Bencz'ur","doi":"10.1145/3468737.3494087","DOIUrl":null,"url":null,"abstract":"When processing data streams with highly skewed and nonstationary key distributions, we often observe overloaded partitions when the hash partitioning fails to balance data correctly. To avoid slow tasks that delay the completion of the whole stage of computation, it is necessary to apply adaptive, on-the-fly partitioning that continuously recomputes an optimal partitioner, given the observed key distribution. While such solutions exist for batch processing of static data sets and stateless stream processing, the task is difficult for long-running stateful streaming jobs where key distribution changes over time. Careful checkpointing and operator state migration is necessary to change the partitioning while the operation is running. Our key result is a lightweight on-the-fly Dynamic Repartitioning (DR) module for distributed data processing systems (DDPS), including Apache Spark and Flink, which improves the performance with negligible overhead. DR can adaptively repartition data during execution using our Key Isolator Partitioner (KIP). In our experiments with real workloads and power-law distributions, we reach a speedup of 1.5-6 for a variety of Spark and Flink jobs.","PeriodicalId":254382,"journal":{"name":"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing","volume":"372 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"System-aware dynamic partitioning for batch and streaming workloads\",\"authors\":\"Zoltan Zvara, Péter G. N. Szabó, Bal'azs Barnab'as L'or'ant, Andr'as A. Bencz'ur\",\"doi\":\"10.1145/3468737.3494087\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"When processing data streams with highly skewed and nonstationary key distributions, we often observe overloaded partitions when the hash partitioning fails to balance data correctly. To avoid slow tasks that delay the completion of the whole stage of computation, it is necessary to apply adaptive, on-the-fly partitioning that continuously recomputes an optimal partitioner, given the observed key distribution. While such solutions exist for batch processing of static data sets and stateless stream processing, the task is difficult for long-running stateful streaming jobs where key distribution changes over time. Careful checkpointing and operator state migration is necessary to change the partitioning while the operation is running. Our key result is a lightweight on-the-fly Dynamic Repartitioning (DR) module for distributed data processing systems (DDPS), including Apache Spark and Flink, which improves the performance with negligible overhead. DR can adaptively repartition data during execution using our Key Isolator Partitioner (KIP). In our experiments with real workloads and power-law distributions, we reach a speedup of 1.5-6 for a variety of Spark and Flink jobs.\",\"PeriodicalId\":254382,\"journal\":{\"name\":\"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing\",\"volume\":\"372 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3468737.3494087\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3468737.3494087","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

在处理具有高度倾斜和非平稳键分布的数据流时,当散列分区无法正确平衡数据时,我们经常观察到分区过载。为了避免延迟整个计算阶段完成的缓慢任务,有必要应用自适应的动态分区,在给定观察到的键分布的情况下,不断重新计算最优分区器。虽然此类解决方案适用于静态数据集的批处理和无状态流处理,但对于密钥分布随时间变化的长时间运行的有状态流作业来说,这项任务很困难。要在操作运行时更改分区,需要仔细的检查点和操作符状态迁移。我们的关键成果是一个轻量级的动态动态重分区(DR)模块,用于分布式数据处理系统(DDPS),包括Apache Spark和Flink,它以微不足道的开销提高了性能。DR可以在执行期间使用我们的密钥隔离分区器(KIP)自适应地重新分区数据。在我们对真实工作负载和幂律分布的实验中,对于各种Spark和Flink作业,我们达到了1.5-6的加速。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
System-aware dynamic partitioning for batch and streaming workloads
When processing data streams with highly skewed and nonstationary key distributions, we often observe overloaded partitions when the hash partitioning fails to balance data correctly. To avoid slow tasks that delay the completion of the whole stage of computation, it is necessary to apply adaptive, on-the-fly partitioning that continuously recomputes an optimal partitioner, given the observed key distribution. While such solutions exist for batch processing of static data sets and stateless stream processing, the task is difficult for long-running stateful streaming jobs where key distribution changes over time. Careful checkpointing and operator state migration is necessary to change the partitioning while the operation is running. Our key result is a lightweight on-the-fly Dynamic Repartitioning (DR) module for distributed data processing systems (DDPS), including Apache Spark and Flink, which improves the performance with negligible overhead. DR can adaptively repartition data during execution using our Key Isolator Partitioner (KIP). In our experiments with real workloads and power-law distributions, we reach a speedup of 1.5-6 for a variety of Spark and Flink jobs.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Distributed federated service chaining for heterogeneous network environments Accord RDS Leveraging vCPU-utilization rates to select cost-efficient VMs for parallel workloads Multi-cloud serverless function composition
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1