ShuffleBench: A Benchmark for Large-Scale Data Shuffling Operations with Distributed Stream Processing Frameworks

ArXiv Pub Date : 2024-03-07 DOI:10.1145/3629526.3645036
Sören Henning, Adriano Vogel, Michael Leichtfried, Otmar Ertl, Rick Rabiser
{"title":"ShuffleBench: A Benchmark for Large-Scale Data Shuffling Operations with Distributed Stream Processing Frameworks","authors":"Sören Henning, Adriano Vogel, Michael Leichtfried, Otmar Ertl, Rick Rabiser","doi":"10.1145/3629526.3645036","DOIUrl":null,"url":null,"abstract":"Distributed stream processing frameworks help building scalable and reliable applications that perform transformations and aggregations on continuous data streams. This paper introduces ShuffleBench, a novel benchmark to evaluate the performance of modern stream processing frameworks. In contrast to other benchmarks, it focuses on use cases where stream processing frameworks are mainly employed for shuffling (i.e., re-distributing) data records to perform state-local aggregations, while the actual aggregation logic is considered as black-box software components. ShuffleBench is inspired by requirements for near real-time analytics of a large cloud observability platform and takes up benchmarking metrics and methods for latency, throughput, and scalability established in the performance engineering research community. Although inspired by a real-world observability use case, it is highly configurable to allow domain-independent evaluations. ShuffleBench comes as a ready-to-use open-source software utilizing existing Kubernetes tooling and providing implementations for four state-of-the-art frameworks. Therefore, we expect ShuffleBench to be a valuable contribution to both industrial practitioners building stream processing applications and researchers working on new stream processing approaches. We complement this paper with an experimental performance evaluation that employs ShuffleBench with various configurations on Flink, Hazelcast, Kafka Streams, and Spark in a cloud-native environment. Our results show that Flink achieves the highest throughput while Hazelcast processes data streams with the lowest latency.","PeriodicalId":513202,"journal":{"name":"ArXiv","volume":"22 47","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3629526.3645036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Distributed stream processing frameworks help building scalable and reliable applications that perform transformations and aggregations on continuous data streams. This paper introduces ShuffleBench, a novel benchmark to evaluate the performance of modern stream processing frameworks. In contrast to other benchmarks, it focuses on use cases where stream processing frameworks are mainly employed for shuffling (i.e., re-distributing) data records to perform state-local aggregations, while the actual aggregation logic is considered as black-box software components. ShuffleBench is inspired by requirements for near real-time analytics of a large cloud observability platform and takes up benchmarking metrics and methods for latency, throughput, and scalability established in the performance engineering research community. Although inspired by a real-world observability use case, it is highly configurable to allow domain-independent evaluations. ShuffleBench comes as a ready-to-use open-source software utilizing existing Kubernetes tooling and providing implementations for four state-of-the-art frameworks. Therefore, we expect ShuffleBench to be a valuable contribution to both industrial practitioners building stream processing applications and researchers working on new stream processing approaches. We complement this paper with an experimental performance evaluation that employs ShuffleBench with various configurations on Flink, Hazelcast, Kafka Streams, and Spark in a cloud-native environment. Our results show that Flink achieves the highest throughput while Hazelcast processes data streams with the lowest latency.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
ShuffleBench:使用分布式流处理框架的大规模数据洗牌操作基准
分布式流处理框架有助于构建可扩展的可靠应用程序,对连续数据流进行转换和聚合。本文介绍了 ShuffleBench,这是一种用于评估现代流处理框架性能的新型基准。与其他基准测试不同的是,它侧重于流处理框架主要用于洗牌(即重新分配)数据记录以执行状态本地聚合的使用案例,而实际的聚合逻辑则被视为黑盒软件组件。ShuffleBench 受大型云观测平台近实时分析需求的启发,采用了性能工程研究界确立的延迟、吞吐量和可扩展性基准指标和方法。虽然灵感来自真实世界的可观测性用例,但它具有高度可配置性,可进行独立于领域的评估。ShuffleBench 是一款即开即用的开源软件,它利用现有的 Kubernetes 工具,为四个最先进的框架提供了实现方法。因此,我们希望 ShuffleBench 能为构建流处理应用的行业从业者和研究新型流处理方法的科研人员做出有价值的贡献。作为本文的补充,我们在云原生环境中使用 ShuffleBench 对 Flink、Hazelcast、Kafka Streams 和 Spark 进行了不同配置的实验性能评估。结果表明,Flink 的吞吐量最高,而 Hazelcast 处理数据流的延迟最低。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Combining Transformer based Deep Reinforcement Learning with Black-Litterman Model for Portfolio Optimization TinyGC-Net: An Extremely Tiny Network for Calibrating MEMS Gyroscopes Short-Term Solar Irradiance Forecasting Under Data Transmission Constraints F2Depth: Self-supervised Indoor Monocular Depth Estimation via Optical Flow Consistency and Feature Map Synthesis Efficient Constrained k-Center Clustering with Background Knowledge
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1