StreamOps: Cloud-Native Runtime Management for Streaming Services in ByteDance

IF 3.3 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Proceedings of the Vldb Endowment Pub Date : 2023-08-01 DOI:10.14778/3611540.3611543

Yancan Mao, Zhanghao Chen, Yifan Zhang, Meng Wang, Yong Fang, Guanghui Zhang, Rui Shi, Richard T. B. Ma

{"title":"StreamOps: Cloud-Native Runtime Management for Streaming Services in ByteDance","authors":"Yancan Mao, Zhanghao Chen, Yifan Zhang, Meng Wang, Yong Fang, Guanghui Zhang, Rui Shi, Richard T. B. Ma","doi":"10.14778/3611540.3611543","DOIUrl":null,"url":null,"abstract":"Stream processing is widely used for real-time data processing and decision-making, leading to tens of thousands of streaming jobs deployed in ByteDance cloud. Since those streaming jobs usually run for several days or longer and the input workloads vary over time, they usually face diverse runtime issues such as processing lag and varying failures. This requires runtime management to resolve such runtime issues automatically. However, designing a runtime management service on the ByteDance scale is challenging. In particular, the service has to concurrently manage cluster-wide streaming jobs in a scalable and extensible manner. Furthermore, it should also be able to manage diverse streaming jobs effectively. To this end, we propose StreamOps to enable cloud-native runtime management for streaming jobs in ByteDance. StreamOps has three main designs to address the challenges. 1) To allow for scalability, StreamOps is running as a standalone lightweight control plane to manage cluster-wide streaming jobs. 2) To enable extensible runtime management, StreamOps abstracts control policies to identify and resolve runtime issues. New control policies can be implemented with a detect-diagnose-resolve programming paradigm. Each control policy is also configurable for different streaming jobs according to the performance requirements. 3) To mitigate processing lag and handling failures effectively, StreamOps features three control policies, i.e., auto-scaler, straggler detector, and job doctor, that are inspired by state-of-the-art research and production experiences at ByteDance. In this paper, we introduce the design decisions we made and the experiences we learned from building StreamOps. We evaluate StreamOps in our production environment, and the experiment results have further validated our system design.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"140 1","pages":"0"},"PeriodicalIF":3.3000,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Vldb Endowment","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14778/3611540.3611543","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Stream processing is widely used for real-time data processing and decision-making, leading to tens of thousands of streaming jobs deployed in ByteDance cloud. Since those streaming jobs usually run for several days or longer and the input workloads vary over time, they usually face diverse runtime issues such as processing lag and varying failures. This requires runtime management to resolve such runtime issues automatically. However, designing a runtime management service on the ByteDance scale is challenging. In particular, the service has to concurrently manage cluster-wide streaming jobs in a scalable and extensible manner. Furthermore, it should also be able to manage diverse streaming jobs effectively. To this end, we propose StreamOps to enable cloud-native runtime management for streaming jobs in ByteDance. StreamOps has three main designs to address the challenges. 1) To allow for scalability, StreamOps is running as a standalone lightweight control plane to manage cluster-wide streaming jobs. 2) To enable extensible runtime management, StreamOps abstracts control policies to identify and resolve runtime issues. New control policies can be implemented with a detect-diagnose-resolve programming paradigm. Each control policy is also configurable for different streaming jobs according to the performance requirements. 3) To mitigate processing lag and handling failures effectively, StreamOps features three control policies, i.e., auto-scaler, straggler detector, and job doctor, that are inspired by state-of-the-art research and production experiences at ByteDance. In this paper, we introduce the design decisions we made and the experiences we learned from building StreamOps. We evaluate StreamOps in our production environment, and the experiment results have further validated our system design.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

StreamOps:用于字节跳动流媒体服务的云原生运行时管理

流处理被广泛用于实时数据处理和决策，导致数万个流作业部署在字节跳动云中。由于这些流作业通常运行数天或更长时间，并且输入工作负载随时间而变化，因此它们通常面临各种运行时问题，例如处理延迟和各种故障。这需要运行时管理来自动解决此类运行时问题。然而，在ByteDance规模上设计运行时管理服务是具有挑战性的。特别是，服务必须以可伸缩和可扩展的方式并发地管理集群范围的流作业。此外，它还应该能够有效地管理各种流作业。为此，我们提出StreamOps来为字节跳动中的流作业启用云原生运行时管理。StreamOps有三个主要设计来应对挑战。1)考虑到可扩展性，StreamOps作为一个独立的轻量级控制平面运行，以管理集群范围的流作业。2)为了支持可扩展的运行时管理，StreamOps抽象了控制策略来识别和解决运行时问题。新的控制策略可以通过检测-诊断-解决编程范例来实现。每个控制策略还可以根据性能要求为不同的流作业配置。3)为了有效地缓解处理延迟和处理故障，StreamOps采用了三种控制策略，即自动缩放器、离散探测器和job doctor，这些策略的灵感来自于ByteDance最先进的研究和生产经验。在本文中，我们将介绍我们所做的设计决策以及我们从构建StreamOps中学到的经验。我们在生产环境中对StreamOps进行了评估，实验结果进一步验证了我们的系统设计。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the Vldb Endowment Computer Science-General Computer Science

CiteScore

7.70

自引率

0.00%

发文量

期刊介绍： The Proceedings of the VLDB (PVLDB) welcomes original research papers on a broad range of research topics related to all aspects of data management, where systems issues play a significant role, such as data management system technology and information management infrastructures, including their very large scale of experimentation, novel architectures, and demanding applications as well as their underpinning theory. The scope of a submission for PVLDB is also described by the subject areas given below. Moreover, the scope of PVLDB is restricted to scientific areas that are covered by the combined expertise on the submission’s topic of the journal’s editorial board. Finally, the submission’s contributions should build on work already published in data management outlets, e.g., PVLDB, VLDBJ, ACM SIGMOD, IEEE ICDE, EDBT, ACM TODS, IEEE TKDE, and go beyond a syntactic citation.

期刊最新文献

Uldp-FL: Federated Learning with Across-Silo User-Level Differential Privacy. Auditory Brainstem Response in a Child with Mitochondrial Disorder-Leigh Syndrome. Breathing New Life into an Old Tree: Resolving Logging Dilemma of B + -tree on Modern Computational Storage Drives QO-Insight: Inspecting Steered Query Optimizers A Learned Query Rewrite System