首页 > 最新文献

Proceedings of the Vldb Endowment最新文献

英文 中文
StreamOps: Cloud-Native Runtime Management for Streaming Services in ByteDance StreamOps:用于字节跳动流媒体服务的云原生运行时管理
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611543
Yancan Mao, Zhanghao Chen, Yifan Zhang, Meng Wang, Yong Fang, Guanghui Zhang, Rui Shi, Richard T. B. Ma
Stream processing is widely used for real-time data processing and decision-making, leading to tens of thousands of streaming jobs deployed in ByteDance cloud. Since those streaming jobs usually run for several days or longer and the input workloads vary over time, they usually face diverse runtime issues such as processing lag and varying failures. This requires runtime management to resolve such runtime issues automatically. However, designing a runtime management service on the ByteDance scale is challenging. In particular, the service has to concurrently manage cluster-wide streaming jobs in a scalable and extensible manner. Furthermore, it should also be able to manage diverse streaming jobs effectively. To this end, we propose StreamOps to enable cloud-native runtime management for streaming jobs in ByteDance. StreamOps has three main designs to address the challenges. 1) To allow for scalability, StreamOps is running as a standalone lightweight control plane to manage cluster-wide streaming jobs. 2) To enable extensible runtime management, StreamOps abstracts control policies to identify and resolve runtime issues. New control policies can be implemented with a detect-diagnose-resolve programming paradigm. Each control policy is also configurable for different streaming jobs according to the performance requirements. 3) To mitigate processing lag and handling failures effectively, StreamOps features three control policies, i.e., auto-scaler, straggler detector, and job doctor, that are inspired by state-of-the-art research and production experiences at ByteDance. In this paper, we introduce the design decisions we made and the experiences we learned from building StreamOps. We evaluate StreamOps in our production environment, and the experiment results have further validated our system design.
流处理被广泛用于实时数据处理和决策,导致数万个流作业部署在字节跳动云中。由于这些流作业通常运行数天或更长时间,并且输入工作负载随时间而变化,因此它们通常面临各种运行时问题,例如处理延迟和各种故障。这需要运行时管理来自动解决此类运行时问题。然而,在ByteDance规模上设计运行时管理服务是具有挑战性的。特别是,服务必须以可伸缩和可扩展的方式并发地管理集群范围的流作业。此外,它还应该能够有效地管理各种流作业。为此,我们提出StreamOps来为字节跳动中的流作业启用云原生运行时管理。StreamOps有三个主要设计来应对挑战。1)考虑到可扩展性,StreamOps作为一个独立的轻量级控制平面运行,以管理集群范围的流作业。2)为了支持可扩展的运行时管理,StreamOps抽象了控制策略来识别和解决运行时问题。新的控制策略可以通过检测-诊断-解决编程范例来实现。每个控制策略还可以根据性能要求为不同的流作业配置。3)为了有效地缓解处理延迟和处理故障,StreamOps采用了三种控制策略,即自动缩放器、离散探测器和job doctor,这些策略的灵感来自于ByteDance最先进的研究和生产经验。在本文中,我们将介绍我们所做的设计决策以及我们从构建StreamOps中学到的经验。我们在生产环境中对StreamOps进行了评估,实验结果进一步验证了我们的系统设计。
{"title":"StreamOps: Cloud-Native Runtime Management for Streaming Services in ByteDance","authors":"Yancan Mao, Zhanghao Chen, Yifan Zhang, Meng Wang, Yong Fang, Guanghui Zhang, Rui Shi, Richard T. B. Ma","doi":"10.14778/3611540.3611543","DOIUrl":"https://doi.org/10.14778/3611540.3611543","url":null,"abstract":"Stream processing is widely used for real-time data processing and decision-making, leading to tens of thousands of streaming jobs deployed in ByteDance cloud. Since those streaming jobs usually run for several days or longer and the input workloads vary over time, they usually face diverse runtime issues such as processing lag and varying failures. This requires runtime management to resolve such runtime issues automatically. However, designing a runtime management service on the ByteDance scale is challenging. In particular, the service has to concurrently manage cluster-wide streaming jobs in a scalable and extensible manner. Furthermore, it should also be able to manage diverse streaming jobs effectively. To this end, we propose StreamOps to enable cloud-native runtime management for streaming jobs in ByteDance. StreamOps has three main designs to address the challenges. 1) To allow for scalability, StreamOps is running as a standalone lightweight control plane to manage cluster-wide streaming jobs. 2) To enable extensible runtime management, StreamOps abstracts control policies to identify and resolve runtime issues. New control policies can be implemented with a detect-diagnose-resolve programming paradigm. Each control policy is also configurable for different streaming jobs according to the performance requirements. 3) To mitigate processing lag and handling failures effectively, StreamOps features three control policies, i.e., auto-scaler, straggler detector, and job doctor, that are inspired by state-of-the-art research and production experiences at ByteDance. In this paper, we introduce the design decisions we made and the experiences we learned from building StreamOps. We evaluate StreamOps in our production environment, and the experiment results have further validated our system design.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135002992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Demonstrating Waffle: A Self-Driving Grid Index 演示华夫饼:自动驾驶网格索引
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611594
Dalsu Choi, Hyunsik Yoon, Hyubjin Lee, Yon Dohn Chung
This paper demonstrates Waffle, a self-driving grid indexing system for moving objects. We introduce system architecture, system workflow, and user scenarios. Waffle enables the management of moving objects with less human effort while automatically improving performance.
本文介绍了一种用于移动物体的自驾车网格索引系统Waffle。我们将介绍系统架构、系统工作流和用户场景。Waffle使移动对象的管理与更少的人力,同时自动提高性能。
{"title":"Demonstrating Waffle: A Self-Driving Grid Index","authors":"Dalsu Choi, Hyunsik Yoon, Hyubjin Lee, Yon Dohn Chung","doi":"10.14778/3611540.3611594","DOIUrl":"https://doi.org/10.14778/3611540.3611594","url":null,"abstract":"This paper demonstrates Waffle, a self-driving grid indexing system for moving objects. We introduce system architecture, system workflow, and user scenarios. Waffle enables the management of moving objects with less human effort while automatically improving performance.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PikePlace: Generating Intelligence for Marketplace Datasets PikePlace:为市场数据集生成智能
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611632
Shi Qiao, Alekh Jindal
There is a renewed interest in data marketplaces with cloud data warehouses that make sharing and accessing data on-demand and extremely easy. However, analyzing marketplace datasets is challenge since current tools for creating the data models are manual and slow. In this paper, we propose to demonstrate a learning-based approach to discover, deploy, and optimize data models. We present the resulting system, PikePlace, show an evaluation over Snowflake marketplace and TPC-H datasets, and describe several demonstration scenarios that the audience can play with.
人们对数据市场重新产生了兴趣,云数据仓库使得按需共享和访问数据变得非常容易。然而,分析市场数据集是一个挑战,因为当前创建数据模型的工具是手动的,而且速度很慢。在本文中,我们建议演示一种基于学习的方法来发现、部署和优化数据模型。我们展示了最终的系统PikePlace,展示了对雪花市场和TPC-H数据集的评估,并描述了几个演示场景,供观众使用。
{"title":"PikePlace: Generating Intelligence for Marketplace Datasets","authors":"Shi Qiao, Alekh Jindal","doi":"10.14778/3611540.3611632","DOIUrl":"https://doi.org/10.14778/3611540.3611632","url":null,"abstract":"There is a renewed interest in data marketplaces with cloud data warehouses that make sharing and accessing data on-demand and extremely easy. However, analyzing marketplace datasets is challenge since current tools for creating the data models are manual and slow. In this paper, we propose to demonstrate a learning-based approach to discover, deploy, and optimize data models. We present the resulting system, PikePlace, show an evaluation over Snowflake marketplace and TPC-H datasets, and describe several demonstration scenarios that the audience can play with.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
KG-Roar: Interactive Datalog-Based Reasoning on Virtual Knowledge Graphs 基于虚拟知识图的交互式数据推理
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611609
Luigi Bellomarini, Marco Benedetti, Andrea Gentili, Davide Magnanimi, Emanuel Sallinger
Logic-based Knowledge Graphs (KGs) are gaining momentum in academia and industry thanks to the rise of expressive and efficient languages for Knowledge Representation and Reasoning (KRR). These languages accurately express business rules, through which valuable new knowledge is derived. A versatile and scalable backend reasoner, like Vadalog, a state-of-the-art system for logic-based KGs---based on an extension of Datalog---executes the reasoning. In this demo, we present KG-Roar, a web-based interactive development and navigation environment for logical KGs. The system lets the user augment an input graph database with intensional definitions of new nodes and edges and turn it into a KG, via the metaphor of reasoning widgets---user-defined or off-the-shelf code snippets that capture business definitions in the Vadalog language. Then, the user can seamlessly browse the original and the derived nodes and edges within a "Virtual Knowledge Graph", which is reasoned upon and generated interactively at runtime, thanks to the scalability and responsiveness of Vadalog. KG-Roar is domain-independent but domain aware, as exploration controls are contextually generated based on the intensional definitions. We walk the audience through KG-Roar showcasing the construction of certain business definitions and putting it into action on a real-world financial KG, from our work with the Bank of Italy.
基于逻辑的知识图(KGs)在学术界和工业界正获得动力,这得益于知识表示和推理(KRR)的表达和高效语言的兴起。这些语言准确地表达了业务规则,通过这些规则可以获得有价值的新知识。一个通用的、可扩展的后端推理器(如Vadalog)执行推理,Vadalog是一种最先进的基于逻辑的KGs系统,基于Datalog的扩展。在这个演示中,我们展示了KG- roar,一个用于逻辑KG的基于web的交互式开发和导航环境。该系统允许用户通过新节点和边的深入定义来增强输入图形数据库,并通过推理小部件的隐喻将其转换为KG,推理小部件是用户定义的或现成的代码片段,用于捕获Vadalog语言中的业务定义。然后,用户可以在“虚拟知识图”中无缝浏览原始和派生的节点和边,由于Vadalog的可扩展性和响应性,该“虚拟知识图”在运行时以交互方式推理和生成。KG-Roar是独立于领域的,但具有领域意识,因为勘探控制是基于内涵定义的上下文生成的。我们通过KG- roar向观众展示了某些业务定义的构建,并将其应用到现实世界的金融KG中,这是我们与意大利银行的合作。
{"title":"KG-Roar: Interactive Datalog-Based Reasoning on Virtual Knowledge Graphs","authors":"Luigi Bellomarini, Marco Benedetti, Andrea Gentili, Davide Magnanimi, Emanuel Sallinger","doi":"10.14778/3611540.3611609","DOIUrl":"https://doi.org/10.14778/3611540.3611609","url":null,"abstract":"Logic-based Knowledge Graphs (KGs) are gaining momentum in academia and industry thanks to the rise of expressive and efficient languages for Knowledge Representation and Reasoning (KRR). These languages accurately express business rules, through which valuable new knowledge is derived. A versatile and scalable backend reasoner, like Vadalog, a state-of-the-art system for logic-based KGs---based on an extension of Datalog---executes the reasoning. In this demo, we present KG-Roar, a web-based interactive development and navigation environment for logical KGs. The system lets the user augment an input graph database with intensional definitions of new nodes and edges and turn it into a KG, via the metaphor of reasoning widgets---user-defined or off-the-shelf code snippets that capture business definitions in the Vadalog language. Then, the user can seamlessly browse the original and the derived nodes and edges within a \"Virtual Knowledge Graph\", which is reasoned upon and generated interactively at runtime, thanks to the scalability and responsiveness of Vadalog. KG-Roar is domain-independent but domain aware, as exploration controls are contextually generated based on the intensional definitions. We walk the audience through KG-Roar showcasing the construction of certain business definitions and putting it into action on a real-world financial KG, from our work with the Bank of Italy.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DoveDB: A Declarative and Low-Latency Video Database DoveDB:一个声明性和低延迟的视频数据库
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611582
Ziyang Xiao, Dongxiang Zhang, Zepeng Li, Sai Wu, Kian-Lee Tan, Gang Chen
Concerning the usability and efficiency to manage video data generated from large-scale cameras, we demonstrate DoveDB, a declarative and low-latency video database. We devise a more comprehensive video query language called VMQL to improve the expressiveness of previous SQL-like languages, which are augmented with functionalities for model-oriented management and deployment. We also propose a light-weight ingestion scheme to extract tracklets of all the moving objects and build semantic indexes to facilitate efficient query processing. For user interaction, we construct a simulation environment with 120 cameras deployed in a road network and demonstrate three interesting scenarios. Using VMQL, users are allowed to 1) train a visual model using SQL-like statement and deploy it on dozens of target cameras simultaneously for online inference; 2) submit multi-object tracking (MOT) requests on target cameras, store the ingested results and build semantic indexes; and 3) issue an aggregation or top- k query on the ingested cameras and obtain the response within milliseconds. A preliminary video introduction of DoveDB is available at https://www.youtube.com/watch?v=N139dEyvAJk
关于管理大型摄像机视频数据的可用性和效率,我们演示了DoveDB,一个声明性和低延迟的视频数据库。我们设计了一种更全面的视频查询语言,称为VMQL,以改进以前的类sql语言的表达能力,这些语言增加了面向模型的管理和部署功能。我们还提出了一种轻量级的摄取方案来提取所有运动对象的轨迹,并建立语义索引以促进高效的查询处理。对于用户交互,我们构建了一个模拟环境,在道路网络中部署了120个摄像头,并演示了三个有趣的场景。使用VMQL,用户可以1)使用类似sql的语句训练可视化模型,并将其同时部署在数十台目标相机上进行在线推理;2)向目标摄像机提交多目标跟踪(MOT)请求,存储接收结果并建立语义索引;3)对摄取的相机发出聚合或top- k查询,并在毫秒内获得响应。DoveDB的初步视频介绍可以在https://www.youtube.com/watch?v=N139dEyvAJk上获得
{"title":"DoveDB: A Declarative and Low-Latency Video Database","authors":"Ziyang Xiao, Dongxiang Zhang, Zepeng Li, Sai Wu, Kian-Lee Tan, Gang Chen","doi":"10.14778/3611540.3611582","DOIUrl":"https://doi.org/10.14778/3611540.3611582","url":null,"abstract":"Concerning the usability and efficiency to manage video data generated from large-scale cameras, we demonstrate DoveDB, a declarative and low-latency video database. We devise a more comprehensive video query language called VMQL to improve the expressiveness of previous SQL-like languages, which are augmented with functionalities for model-oriented management and deployment. We also propose a light-weight ingestion scheme to extract tracklets of all the moving objects and build semantic indexes to facilitate efficient query processing. For user interaction, we construct a simulation environment with 120 cameras deployed in a road network and demonstrate three interesting scenarios. Using VMQL, users are allowed to 1) train a visual model using SQL-like statement and deploy it on dozens of target cameras simultaneously for online inference; 2) submit multi-object tracking (MOT) requests on target cameras, store the ingested results and build semantic indexes; and 3) issue an aggregation or top- k query on the ingested cameras and obtain the response within milliseconds. A preliminary video introduction of DoveDB is available at https://www.youtube.com/watch?v=N139dEyvAJk","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Building a Collaborative Data Analytics System: Opportunities and Challenges 构建协作数据分析系统:机遇与挑战
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611580
Zuozhi Wang, Chen Li
Real-time collaboration has become increasingly important in various applications, from document creation to data analytics. Although collaboration features are prevalent in editing applications, they remain rare in data-analytics applications, where the need for collaboration is even more crucial. This tutorial aims to provide attendees with a comprehensive understanding of the challenges and design decisions associated with supporting real-time collaboration and user interactions in data analytics systems. We will discuss popular conflict resolution technologies, the unique challenges of facilitating collaborative experiences during the workflow construction and execution phases, and the complexities of supporting responsive user interactions during job execution.
从文档创建到数据分析,实时协作在各种应用程序中变得越来越重要。尽管协作特性在编辑应用程序中很普遍,但在数据分析应用程序中仍然很少,在数据分析应用程序中,协作需求更为重要。本教程旨在为与会者提供与支持数据分析系统中的实时协作和用户交互相关的挑战和设计决策的全面理解。我们将讨论流行的冲突解决技术、在工作流构建和执行阶段促进协作体验的独特挑战,以及在作业执行期间支持响应式用户交互的复杂性。
{"title":"Building a Collaborative Data Analytics System: Opportunities and Challenges","authors":"Zuozhi Wang, Chen Li","doi":"10.14778/3611540.3611580","DOIUrl":"https://doi.org/10.14778/3611540.3611580","url":null,"abstract":"Real-time collaboration has become increasingly important in various applications, from document creation to data analytics. Although collaboration features are prevalent in editing applications, they remain rare in data-analytics applications, where the need for collaboration is even more crucial. This tutorial aims to provide attendees with a comprehensive understanding of the challenges and design decisions associated with supporting real-time collaboration and user interactions in data analytics systems. We will discuss popular conflict resolution technologies, the unique challenges of facilitating collaborative experiences during the workflow construction and execution phases, and the complexities of supporting responsive user interactions during job execution.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Eigen: End-to-End Resource Optimization for Large-Scale Databases on the Cloud 特征:云上大规模数据库的端到端资源优化
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611565
Ji You Li, Jiachi Zhang, Wenchao Zhou, Yuhang Liu, Shuai Zhang, Zhuoming Xue, Ding Xu, Hua Fan, Fangyuan Zhou, Feifei Li
Increasingly, cloud database vendors host large-scale geographically distributed clusters to provide cloud database services. When managing the clusters, we observe that it is challenging to simultaneously maximizing the resource allocation ratio and resource availability. This problem becomes more severe in modern cloud database clusters, where resource allocations occur more frequently and on a greater scale. To improve the resource allocation ratio without hurting resource availability, we introduce Eigen, a large-scale cloud-native cluster management system for large-scale databases on the cloud. Based on a resource flow model, we propose a hierarchical resource management system and three resource optimization algorithms that enable end-to-end resource optimization. Furthermore, we demonstrate the system optimization that promotes user experience by reducing scheduling latencies and improving scheduling throughput. Eigen has been launched in a large-scale public-cloud production environment for 30+ months and served more than 30+ regions (100+ available zones) globally. Based on the evaluation of real-world clusters and simulated experiments, Eigen can improve the allocation ratio by over 27% (from 60% to 87.0%) on average, while the ratio of delayed resource provisions is under 0.1%.
越来越多的云数据库供应商托管大规模地理分布式集群来提供云数据库服务。在集群管理中,我们注意到同时最大化资源分配比率和资源可用性是一项挑战。这个问题在现代云数据库集群中变得更加严重,因为资源分配更频繁,规模更大。为了在不影响资源可用性的情况下提高资源分配比例,我们引入了Eigen,这是一个用于云上大规模数据库的大规模云原生集群管理系统。基于资源流模型,提出了一种分层资源管理系统和三种资源优化算法,实现了端到端的资源优化。此外,我们还演示了通过减少调度延迟和提高调度吞吐量来促进用户体验的系统优化。Eigen已在大规模公有云生产环境中推出30多个月,服务于全球30多个地区(100多个可用区域)。基于对真实集群和模拟实验的评估,Eigen可以将分配率平均提高27%以上(从60%提高到87.0%),而延迟资源供给率在0.1%以下。
{"title":"Eigen: End-to-End Resource Optimization for Large-Scale Databases on the Cloud","authors":"Ji You Li, Jiachi Zhang, Wenchao Zhou, Yuhang Liu, Shuai Zhang, Zhuoming Xue, Ding Xu, Hua Fan, Fangyuan Zhou, Feifei Li","doi":"10.14778/3611540.3611565","DOIUrl":"https://doi.org/10.14778/3611540.3611565","url":null,"abstract":"Increasingly, cloud database vendors host large-scale geographically distributed clusters to provide cloud database services. When managing the clusters, we observe that it is challenging to simultaneously maximizing the resource allocation ratio and resource availability. This problem becomes more severe in modern cloud database clusters, where resource allocations occur more frequently and on a greater scale. To improve the resource allocation ratio without hurting resource availability, we introduce Eigen, a large-scale cloud-native cluster management system for large-scale databases on the cloud. Based on a resource flow model, we propose a hierarchical resource management system and three resource optimization algorithms that enable end-to-end resource optimization. Furthermore, we demonstrate the system optimization that promotes user experience by reducing scheduling latencies and improving scheduling throughput. Eigen has been launched in a large-scale public-cloud production environment for 30+ months and served more than 30+ regions (100+ available zones) globally. Based on the evaluation of real-world clusters and simulated experiments, Eigen can improve the allocation ratio by over 27% (from 60% to 87.0%) on average, while the ratio of delayed resource provisions is under 0.1%.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135002988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Query Event Logs 单一来源:从数据库查询事件日志中高效提取动态粗粒度来源
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611555
Fotis Psallidas, Ashvin Agrawal, Chandru Sugunan, Khaled Ibrahim, Konstantinos Karanasos, Jesús Camacho-Rodríguez, Avrilia Floratou, Carlo Curino, Raghu Ramakrishnan
Provenance encodes information that connects datasets, their generation workflows, and associated metadata (e.g., who or when executed a query). As such, it is instrumental for a wide range of critical governance applications (e.g., observability and auditing). Unfortunately, in the context of database systems, extracting coarse-grained provenance is a long-standing problem due to the complexity and sheer volume of database workflows. Provenance extraction from query event logs has been recently proposed as favorable because, in principle, can result in meaningful provenance graphs for provenance applications. Current approaches, however, (a) add substantial overhead to the database and provenance extraction workflows and (b) extract provenance that is noisy, omits query execution dependencies, and is not rich enough for upstream applications. To address these problems, we introduce OneProvenance: an efficient provenance extraction system from query event logs. OneProvenance addresses the unique challenges of log-based extraction by (a) identifying query execution dependencies through efficient log analysis, (b) extracting provenance through novel event transformations that account for query dependencies, and (c) introducing effective filtering optimizations. Our thorough experimental analysis shows that OneProvenance can improve extraction by up to ~18X compared to state-of-the-art baselines; our optimizations reduce the extraction noise and optimize performance even further. OneProvenance is deployed at scale by Microsoft Purview and actively supports customer provenance extraction needs (https://bit.ly/3N2JVGF).
出处编码了连接数据集、它们的生成工作流和相关元数据(例如,谁或何时执行查询)的信息。因此,它对于广泛的关键治理应用程序(例如,可观察性和审计)是有用的。不幸的是,在数据库系统的上下文中,由于数据库工作流的复杂性和庞大的数量,提取粗粒度的来源是一个长期存在的问题。从查询事件日志中提取起源最近被认为是有利的,因为原则上,这可以为起源应用程序生成有意义的起源图。然而,当前的方法(a)给数据库和来源提取工作流增加了大量的开销,(b)提取的来源是嘈杂的,忽略了查询执行依赖关系,并且对于上游应用程序来说不够丰富。为了解决这些问题,我们引入了OneProvenance:一个从查询事件日志中高效的来源提取系统。OneProvenance通过(a)通过有效的日志分析识别查询执行依赖关系,(b)通过解释查询依赖关系的新颖事件转换提取来源,以及(c)引入有效的过滤优化,解决了基于日志的提取的独特挑战。我们彻底的实验分析表明,与最先进的基线相比,OneProvenance可以将提取效率提高约18倍;我们的优化降低了提取噪声,并进一步优化了性能。OneProvenance由Microsoft Purview大规模部署,并积极支持客户的来源提取需求(https://bit.ly/3N2JVGF)。
{"title":"OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Query Event Logs","authors":"Fotis Psallidas, Ashvin Agrawal, Chandru Sugunan, Khaled Ibrahim, Konstantinos Karanasos, Jesús Camacho-Rodríguez, Avrilia Floratou, Carlo Curino, Raghu Ramakrishnan","doi":"10.14778/3611540.3611555","DOIUrl":"https://doi.org/10.14778/3611540.3611555","url":null,"abstract":"Provenance encodes information that connects datasets, their generation workflows, and associated metadata (e.g., who or when executed a query). As such, it is instrumental for a wide range of critical governance applications (e.g., observability and auditing). Unfortunately, in the context of database systems, extracting coarse-grained provenance is a long-standing problem due to the complexity and sheer volume of database workflows. Provenance extraction from query event logs has been recently proposed as favorable because, in principle, can result in meaningful provenance graphs for provenance applications. Current approaches, however, (a) add substantial overhead to the database and provenance extraction workflows and (b) extract provenance that is noisy, omits query execution dependencies, and is not rich enough for upstream applications. To address these problems, we introduce OneProvenance: an efficient provenance extraction system from query event logs. OneProvenance addresses the unique challenges of log-based extraction by (a) identifying query execution dependencies through efficient log analysis, (b) extracting provenance through novel event transformations that account for query dependencies, and (c) introducing effective filtering optimizations. Our thorough experimental analysis shows that OneProvenance can improve extraction by up to ~18X compared to state-of-the-art baselines; our optimizations reduce the extraction noise and optimize performance even further. OneProvenance is deployed at scale by Microsoft Purview and actively supports customer provenance extraction needs (https://bit.ly/3N2JVGF).","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Big Data Analytic Toolkit: A General-Purpose, Modular, and Heterogeneous Acceleration Toolkit for Data Analytical Engines 大数据分析工具包:数据分析引擎的通用、模块化和异构加速工具包
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611558
Jiang Li, Qi Xie, Yan Ma, Jian Ma, Kunshang Ji, Yizhong Zhang, Chaojun Zhang, Yixiu Chen, Gangsheng Wu, Jie Zhang, Kaidi Yang, Xinyi He, Qiuyang Shen, Yanting Tao, Haiwei Zhao, Penghui Jiao, Chengfei Zhu, David Qian, Cheng Xu
Query compilation and hardware acceleration are important technologies for optimizing the performance of data processing engines. There have been many works on the exploration and adoption of these techniques in recent years. However, a number of engines still refrain from adopting them because of some reasons. One of the common reasons claims that the intricacies of these techniques make engines too complex to maintain. Another major barrier is the lack of widely accepted architectures and libraries of these techniques, which leads to the adoption often starting from scratch with lots of effort. In this paper, we propose Intel Big Data Analytic Toolkit (BDTK), an open-source C++ acceleration toolkit library for analytical data processing engines. BDTK provides lightweight, easy-to-connect, reusable components with interoperable interfaces to support query compilation and hardware accelerators. The query compilation in BDTK leverages vectorized execution and data-centric code generation to achieve high performance. BDTK could be integrated into different engines and helps them to adapt query compilation and hardware accelerators to optimize performance bottlenecks with less engineering effort.
查询编译和硬件加速是优化数据处理引擎性能的重要技术。近年来,有许多关于这些技术的探索和采用的工作。然而,由于某些原因,许多引擎仍然不采用它们。其中一个常见的原因是,这些技术的复杂性使得引擎过于复杂,难以维护。另一个主要障碍是缺乏被广泛接受的这些技术的体系结构和库,这导致采用这些技术往往需要付出大量的努力。在本文中,我们提出了英特尔大数据分析工具包(BDTK),一个开源的c++加速工具包库,用于分析数据处理引擎。BDTK提供了轻量级、易于连接、可重用的组件和可互操作的接口,以支持查询编译和硬件加速器。BDTK中的查询编译利用向量化执行和以数据为中心的代码生成来实现高性能。BDTK可以集成到不同的引擎中,并帮助它们调整查询编译和硬件加速器,以更少的工程工作优化性能瓶颈。
{"title":"Big Data Analytic Toolkit: A General-Purpose, Modular, and Heterogeneous Acceleration Toolkit for Data Analytical Engines","authors":"Jiang Li, Qi Xie, Yan Ma, Jian Ma, Kunshang Ji, Yizhong Zhang, Chaojun Zhang, Yixiu Chen, Gangsheng Wu, Jie Zhang, Kaidi Yang, Xinyi He, Qiuyang Shen, Yanting Tao, Haiwei Zhao, Penghui Jiao, Chengfei Zhu, David Qian, Cheng Xu","doi":"10.14778/3611540.3611558","DOIUrl":"https://doi.org/10.14778/3611540.3611558","url":null,"abstract":"Query compilation and hardware acceleration are important technologies for optimizing the performance of data processing engines. There have been many works on the exploration and adoption of these techniques in recent years. However, a number of engines still refrain from adopting them because of some reasons. One of the common reasons claims that the intricacies of these techniques make engines too complex to maintain. Another major barrier is the lack of widely accepted architectures and libraries of these techniques, which leads to the adoption often starting from scratch with lots of effort. In this paper, we propose Intel Big Data Analytic Toolkit (BDTK), an open-source C++ acceleration toolkit library for analytical data processing engines. BDTK provides lightweight, easy-to-connect, reusable components with interoperable interfaces to support query compilation and hardware accelerators. The query compilation in BDTK leverages vectorized execution and data-centric code generation to achieve high performance. BDTK could be integrated into different engines and helps them to adapt query compilation and hardware accelerators to optimize performance bottlenecks with less engineering effort.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Time Series Data Mining: A Unifying View 时间序列数据挖掘:统一视图
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611570
Eamonn Keogh
Time series data are ubiquitous; large volumes of such data are routinely created in scientific, industrial, entertainment, medical and biological domains. Examples include ECG data, gait analysis, stock market quotes, machine health telemetry, search engine throughput volumes etc. VLDB has traditionally been home to much of the community's best research on time series, with three to eight papers on time series appearing in the conference each year. What do we want to do with such time series? Everything! Classification, clustering, joins, anomaly detection, motif discovery, similarity search, visualization, summarization, compression, segmentation, rule discovery etc. Rather than a deep dive in just one of these subtopics, in this tutorial I will show a surprisingly small set of high-level representations, definitions, distance measures and primitives can be combined to solve the first 90 to 99.9% of the problems listed above. The tutorial will be illustrated with numerous real-world examples created just for this tutorial, including examples from robotics, wearables, medical telemetry, astronomy, and (especially) animal behavior. Moreover, all sample datasets and code snippets will be released so that the tutorial attendees (and later, readers) can first reproduce the results demonstrated, before attempting similar analysis on their data.
时间序列数据无处不在;在科学、工业、娱乐、医学和生物领域,通常会产生大量此类数据。例子包括心电数据、步态分析、股票市场报价、机器健康遥测、搜索引擎吞吐量等。VLDB传统上一直是社区对时间序列的许多最佳研究的所在地,每年在会议上发表三到八篇关于时间序列的论文。我们想用这样的时间序列做什么?一切!分类、聚类、连接、异常检测、基序发现、相似搜索、可视化、摘要、压缩、分割、规则发现等。在本教程中,我将展示一组令人惊讶的高级表示、定义、距离度量和原语,而不是深入研究这些子主题中的一个,它们可以组合起来解决上面列出的前90%到99.9%的问题。本教程将通过为本教程创建的许多现实世界示例进行说明,包括机器人,可穿戴设备,医疗遥测,天文学和(特别是)动物行为的示例。此外,所有样本数据集和代码片段都将发布,以便教程参与者(以及后来的读者)在对其数据进行类似分析之前,可以首先重现演示的结果。
{"title":"Time Series Data Mining: A Unifying View","authors":"Eamonn Keogh","doi":"10.14778/3611540.3611570","DOIUrl":"https://doi.org/10.14778/3611540.3611570","url":null,"abstract":"Time series data are ubiquitous; large volumes of such data are routinely created in scientific, industrial, entertainment, medical and biological domains. Examples include ECG data, gait analysis, stock market quotes, machine health telemetry, search engine throughput volumes etc. VLDB has traditionally been home to much of the community's best research on time series, with three to eight papers on time series appearing in the conference each year. What do we want to do with such time series? Everything! Classification, clustering, joins, anomaly detection, motif discovery, similarity search, visualization, summarization, compression, segmentation, rule discovery etc. Rather than a deep dive in just one of these subtopics, in this tutorial I will show a surprisingly small set of high-level representations, definitions, distance measures and primitives can be combined to solve the first 90 to 99.9% of the problems listed above. The tutorial will be illustrated with numerous real-world examples created just for this tutorial, including examples from robotics, wearables, medical telemetry, astronomy, and (especially) animal behavior. Moreover, all sample datasets and code snippets will be released so that the tutorial attendees (and later, readers) can first reproduce the results demonstrated, before attempting similar analysis on their data.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the Vldb Endowment
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1