首页 > 最新文献

Proceedings of the Vldb Endowment最新文献

英文 中文
DuckPGQ: Bringing SQL/PGQ to DuckDB DuckPGQ:将SQL/PGQ引入DuckDB
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611614
Daniel ten Wolde, Gábor Szárnyas, Peter Boncz
We demonstrate the most important new feature of SQL:2023, namely SQL/PGQ, which eases querying graphs using SQL by introducing new syntax for pattern matching and (shortest) path-finding. We show how support for SQL/PGQ can be integrated into an RDBMS, specifically in the DuckDB system, using an extension module called DuckPGQ. As such, we also demonstrate the use of the DuckDB extensibility mechanism, which allows us to add new functions, data types, operators, optimizer rules, storage systems, and even parsers to DuckDB. We also describe the new data structures and algorithms that the DuckPGQ module is based on, and how they are injected into SQL plans. While the demonstrated DuckPGQ extension module is lean and efficient, we sketch a roadmap to (i) improve its performance through new algorithms (factorized and WCOJ) and better parallelism and (ii) extend its functionality to scenarios beyond SQL, e.g., building and analyzing Graph Neural Networks.
我们展示了SQL:2023最重要的新特性,即SQL/PGQ,它通过引入模式匹配和(最短)寻径的新语法来简化使用SQL查询图。我们将展示如何使用一个名为DuckPGQ的扩展模块将SQL/PGQ支持集成到RDBMS中,特别是在DuckDB系统中。因此,我们还演示了DuckDB可扩展性机制的使用,该机制允许我们向DuckDB添加新的函数、数据类型、操作符、优化器规则、存储系统,甚至解析器。我们还描述了DuckPGQ模块所基于的新数据结构和算法,以及如何将它们注入SQL计划。虽然演示的DuckPGQ扩展模块是精简和高效的,但我们勾画了一个路线图:(i)通过新的算法(分解和WCOJ)和更好的并行性来提高其性能;(ii)将其功能扩展到SQL之外的场景,例如,构建和分析图神经网络。
{"title":"DuckPGQ: Bringing SQL/PGQ to DuckDB","authors":"Daniel ten Wolde, Gábor Szárnyas, Peter Boncz","doi":"10.14778/3611540.3611614","DOIUrl":"https://doi.org/10.14778/3611540.3611614","url":null,"abstract":"We demonstrate the most important new feature of SQL:2023, namely SQL/PGQ, which eases querying graphs using SQL by introducing new syntax for pattern matching and (shortest) path-finding. We show how support for SQL/PGQ can be integrated into an RDBMS, specifically in the DuckDB system, using an extension module called DuckPGQ. As such, we also demonstrate the use of the DuckDB extensibility mechanism, which allows us to add new functions, data types, operators, optimizer rules, storage systems, and even parsers to DuckDB. We also describe the new data structures and algorithms that the DuckPGQ module is based on, and how they are injected into SQL plans. While the demonstrated DuckPGQ extension module is lean and efficient, we sketch a roadmap to (i) improve its performance through new algorithms (factorized and WCOJ) and better parallelism and (ii) extend its functionality to scenarios beyond SQL, e.g., building and analyzing Graph Neural Networks.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic SQL Error Mitigation in Oracle Oracle中的自动SQL错误缓解
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611568
Krishna Kantikiran Pasupuleti, Jiakun Li, Hong Su, Mohamed Ziauddin
Despite best coding practices, software bugs are inevitable in a large codebase. In traditional databases, when errors occur during query processing, they disrupt user workflow until workarounds are found and applied. Manual identification of workarounds often relies on a trial-and-error method. The process is not only time-consuming but also requires domain expertise that users are often lacking. In this paper, we propose a framework to automatically mitigate errors that occur during query compilation (including optimization and code generation) without any user intervention. An error is intercepted by the database internally, a workaround is identified for it, and the query is recompiled using the workaround. The entire process remains transparent to the user with the query being executed seamlessly. The proposed technique handles SQL errors during query compilation and provides three types of mitigation strategies - i) quickly failover to one of the readily-available historical plans for the statement ii) apply targeted error-correcting directives (hints) identified from the optimizer context at the time of the error iii) modify the global configuration of the optimizer using hints. This feature has been implemented and will be released in an upcoming version of Oracle Autonomous Database.
尽管有最佳的编码实践,但在大型代码库中,软件bug是不可避免的。在传统数据库中,当查询处理过程中出现错误时,它们会中断用户的工作流程,直到找到并应用解决方案。手动识别变通方法通常依赖于试错法。这个过程不仅耗时,而且需要用户通常缺乏的领域专业知识。在本文中,我们提出了一个框架来自动减轻查询编译(包括优化和代码生成)过程中发生的错误,而无需任何用户干预。数据库在内部拦截错误,为其识别一个解决方案,并使用该解决方案重新编译查询。整个过程对用户保持透明,查询被无缝地执行。所提议的技术处理查询编译期间的SQL错误,并提供三种类型的缓解策略——i)快速故障转移到语句的一个现成的历史计划;ii)应用在错误发生时从优化器上下文中确定的有针对性的错误纠正指令(提示);iii)使用提示修改优化器的全局配置。这个特性已经实现,并将在Oracle自治数据库的下一个版本中发布。
{"title":"Automatic SQL Error Mitigation in Oracle","authors":"Krishna Kantikiran Pasupuleti, Jiakun Li, Hong Su, Mohamed Ziauddin","doi":"10.14778/3611540.3611568","DOIUrl":"https://doi.org/10.14778/3611540.3611568","url":null,"abstract":"Despite best coding practices, software bugs are inevitable in a large codebase. In traditional databases, when errors occur during query processing, they disrupt user workflow until workarounds are found and applied. Manual identification of workarounds often relies on a trial-and-error method. The process is not only time-consuming but also requires domain expertise that users are often lacking. In this paper, we propose a framework to automatically mitigate errors that occur during query compilation (including optimization and code generation) without any user intervention. An error is intercepted by the database internally, a workaround is identified for it, and the query is recompiled using the workaround. The entire process remains transparent to the user with the query being executed seamlessly. The proposed technique handles SQL errors during query compilation and provides three types of mitigation strategies - i) quickly failover to one of the readily-available historical plans for the statement ii) apply targeted error-correcting directives (hints) identified from the optimizer context at the time of the error iii) modify the global configuration of the optimizer using hints. This feature has been implemented and will be released in an upcoming version of Oracle Autonomous Database.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Natural Language Interfaces for Databases with Deep Learning 深度学习数据库的自然语言接口
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611575
George Katsogiannis-Meimarakis, Mike Xydas, Georgia Koutrika
In the age of the Digital Revolution, almost all human activities, from industrial and business operations to medical and academic research, are reliant on the constant integration and utilisation of ever-increasing volumes of data. However, the explosive volume and complexity of data makes data querying and exploration challenging even for experts, and makes the need to democratise the access to data, even for non-technical users, all the more evident. It is time to lift all technical barriers, by empowering users to access relational databases through conversation. We consider 3 main research areas that a natural language data interface is based on: Text-to-SQL, SQL-to-Text, and Data-to-Text. The purpose of this tutorial is a deep dive into these areas, covering state-of-the-art techniques and models, and explaining how the progress in the deep learning field has led to impressive advancements. We will present benchmarks that sparked research and competition, and discuss open problems and research opportunities with one of the most important challenges being the integration of these 3 research areas into one conversational system.
在数字革命时代,从工业和商业运营到医疗和学术研究,几乎所有人类活动都依赖于不断整合和利用不断增加的数据量。然而,数据的爆炸性数量和复杂性使得数据查询和探索即使对专家来说也是具有挑战性的,并且使得数据访问民主化的需求,即使对于非技术用户来说,也更加明显。现在是解除所有技术障碍的时候了,允许用户通过对话访问关系数据库。我们考虑了自然语言数据接口所基于的3个主要研究领域:文本到sql、sql到文本和数据到文本。本教程的目的是深入研究这些领域,涵盖最先进的技术和模型,并解释深度学习领域的进展如何导致令人印象深刻的进步。我们将展示激发研究和竞争的基准,并讨论开放的问题和研究机会,其中最重要的挑战之一是将这三个研究领域整合到一个对话系统中。
{"title":"Natural Language Interfaces for Databases with Deep Learning","authors":"George Katsogiannis-Meimarakis, Mike Xydas, Georgia Koutrika","doi":"10.14778/3611540.3611575","DOIUrl":"https://doi.org/10.14778/3611540.3611575","url":null,"abstract":"In the age of the Digital Revolution, almost all human activities, from industrial and business operations to medical and academic research, are reliant on the constant integration and utilisation of ever-increasing volumes of data. However, the explosive volume and complexity of data makes data querying and exploration challenging even for experts, and makes the need to democratise the access to data, even for non-technical users, all the more evident. It is time to lift all technical barriers, by empowering users to access relational databases through conversation. We consider 3 main research areas that a natural language data interface is based on: Text-to-SQL, SQL-to-Text, and Data-to-Text. The purpose of this tutorial is a deep dive into these areas, covering state-of-the-art techniques and models, and explaining how the progress in the deep learning field has led to impressive advancements. We will present benchmarks that sparked research and competition, and discuss open problems and research opportunities with one of the most important challenges being the integration of these 3 research areas into one conversational system.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interactive Demonstration of EVA EVA互动演示
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611626
Gaurav Tarlok Kakkar, Aryan Rajoria, Myna Prasanna Kalluraya, Ashmita Raju, Jiashen Cao, Kexin Rong, Joy Arulraj
In this demonstration, we will present EVA, an end-to-end AI-Relational database management system. We will demonstrate the capabilities and utility of EVA using three usage scenarios: (1) EVA serves as a backend for an exploratory video analytics interface developed using Streamlit and React, (2) EVA seamlessly integrates with the Python and Data Science ecosystems by allowing users to access EVA in a Python notebook alongside other popular libraries such as Pandas and Matplotlib, and (3) EVA facilitates bulk labeling with Label Studio, a widely-used labeling framework. By optimizing complex vision queries, we illustrate how EVA allows a wide range of application developers to harness the recent advances in computer vision.
在这个演示中,我们将介绍EVA,一个端到端的ai关系数据库管理系统。我们将使用三种使用场景来演示EVA的功能和效用:(1)EVA作为使用Streamlit和React开发的探索性视频分析界面的后端,(2)EVA与Python和数据科学生态系统无缝集成,允许用户在Python笔记本中访问EVA以及其他流行的库,如Pandas和Matplotlib,以及(3)EVA便于使用Label Studio进行大量标记,这是一个广泛使用的标记框架。通过优化复杂的视觉查询,我们说明了EVA如何允许广泛的应用程序开发人员利用计算机视觉的最新进展。
{"title":"Interactive Demonstration of EVA","authors":"Gaurav Tarlok Kakkar, Aryan Rajoria, Myna Prasanna Kalluraya, Ashmita Raju, Jiashen Cao, Kexin Rong, Joy Arulraj","doi":"10.14778/3611540.3611626","DOIUrl":"https://doi.org/10.14778/3611540.3611626","url":null,"abstract":"In this demonstration, we will present EVA, an end-to-end AI-Relational database management system. We will demonstrate the capabilities and utility of EVA using three usage scenarios: (1) EVA serves as a backend for an exploratory video analytics interface developed using Streamlit and React, (2) EVA seamlessly integrates with the Python and Data Science ecosystems by allowing users to access EVA in a Python notebook alongside other popular libraries such as Pandas and Matplotlib, and (3) EVA facilitates bulk labeling with Label Studio, a widely-used labeling framework. By optimizing complex vision queries, we illustrate how EVA allows a wide range of application developers to harness the recent advances in computer vision.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
ScalarDB: Universal Transaction Manager for Polystores 用于polystore的通用事务管理器
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611563
Hiroyuki Yamada, Toshihiro Suzuki, Yuji Ito, Jun Nemoto
This paper presents ScalarDB, a universal transaction manager that achieves distributed transactions across multiple disparate databases. ScalarDB provides a database-agnostic transaction manager on top of its database abstraction; thus, it achieves transactions spanning various databases without depending on the transactional capability of underlying databases. ScalarDB is based on several research works and extended to provide a strong correctness guarantee (i.e., strict serializability), further performance optimizations, and several critical mechanisms for productization. In this paper, we describe the design and implementation of ScalarDB. We also present evaluation results showing that ScalarDB achieves database-spanning transactions with reasonable performance and near-linear scalability without sacrificing correctness. Finally, we share some case studies and lessons learned while building and running ScalarDB.
本文介绍了ScalarDB,一个通用的事务管理器,它可以跨多个不同的数据库实现分布式事务。ScalarDB在其数据库抽象之上提供了一个与数据库无关的事务管理器;因此,它可以实现跨各种数据库的事务,而不依赖于底层数据库的事务功能。ScalarDB基于几项研究工作,并进行了扩展,以提供强大的正确性保证(即严格的序列化性)、进一步的性能优化和几个关键的产品化机制。在本文中,我们描述了ScalarDB的设计和实现。我们还提供了评估结果,表明ScalarDB在不牺牲正确性的情况下实现了具有合理性能和近线性可伸缩性的数据库跨事务。最后,我们将分享一些在构建和运行ScalarDB时获得的案例研究和经验教训。
{"title":"ScalarDB: Universal Transaction Manager for Polystores","authors":"Hiroyuki Yamada, Toshihiro Suzuki, Yuji Ito, Jun Nemoto","doi":"10.14778/3611540.3611563","DOIUrl":"https://doi.org/10.14778/3611540.3611563","url":null,"abstract":"This paper presents ScalarDB, a universal transaction manager that achieves distributed transactions across multiple disparate databases. ScalarDB provides a database-agnostic transaction manager on top of its database abstraction; thus, it achieves transactions spanning various databases without depending on the transactional capability of underlying databases. ScalarDB is based on several research works and extended to provide a strong correctness guarantee (i.e., strict serializability), further performance optimizations, and several critical mechanisms for productization. In this paper, we describe the design and implementation of ScalarDB. We also present evaluation results showing that ScalarDB achieves database-spanning transactions with reasonable performance and near-linear scalability without sacrificing correctness. Finally, we share some case studies and lessons learned while building and running ScalarDB.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Lindorm TSDB: A Cloud-Native Time-Series Database for Large-Scale Monitoring Systems Lindorm TSDB:用于大规模监控系统的云原生时间序列数据库
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611559
Chunhui Shen, Qianyu Ouyang, Feibo Li, Zhipeng Liu, Longcheng Zhu, Yujie Zou, Qing Su, Tianhuan Yu, Yi Yi, Jianhong Hu, Cen Zheng, Bo Wen, Hanbang Zheng, Lunfan Xu, Sicheng Pan, Bin Wu, Xiao He, Ye Li, Jian Tan, Sheng Wang, Dan Pei, Wei Zhang, Feifei Li
Internet services supported by large-scale distributed systems have become essential for our daily life. To ensure the stability and high quality of services, diverse metric data are constantly collected and managed in a time-series database to monitor the service status. However, when the number of metrics becomes massive, existing time-series databases are inefficient in handling high-rate data ingestion and queries hitting multiple metrics. Besides, they all lack the support of machine learning functions, which are crucial for sophisticated analysis of large-scale time series. In this paper, we present Lindorm TSDB, a distributed time-series database designed for handling monitoring metrics at scale. It sustains high write throughput and low query latency with massive active metrics. It also allows users to analyze data with anomaly detection and time series forecasting algorithms directly through SQL. Furthermore, Lindorm TSDB retains stable performance even during node scaling. We evaluate Lindorm TSDB under different data scales, and the results show that it outperforms two popular open-source time-series databases on both writing and query, while executing time-series machine learning tasks efficiently.
大规模分布式系统支持的互联网服务已经成为我们日常生活中必不可少的一部分。为了保证服务的稳定性和高质量,不断收集和管理各种度量数据,并将其保存在时间序列数据库中,以监控服务状态。然而,当指标的数量变得巨大时,现有的时间序列数据库在处理高速数据摄取和涉及多个指标的查询方面效率低下。此外,它们都缺乏机器学习功能的支持,而机器学习功能对于大规模时间序列的复杂分析至关重要。在本文中,我们提出了Lindorm TSDB,一个分布式时间序列数据库,旨在处理大规模的监控指标。它通过大量活动度量维持高写吞吐量和低查询延迟。它还允许用户直接通过SQL使用异常检测和时间序列预测算法分析数据。此外,即使在节点扩展期间,Lindorm TSDB也保持稳定的性能。我们在不同的数据规模下对Lindorm TSDB进行了评估,结果表明它在编写和查询方面都优于两种流行的开源时间序列数据库,同时有效地执行时间序列机器学习任务。
{"title":"Lindorm TSDB: A Cloud-Native Time-Series Database for Large-Scale Monitoring Systems","authors":"Chunhui Shen, Qianyu Ouyang, Feibo Li, Zhipeng Liu, Longcheng Zhu, Yujie Zou, Qing Su, Tianhuan Yu, Yi Yi, Jianhong Hu, Cen Zheng, Bo Wen, Hanbang Zheng, Lunfan Xu, Sicheng Pan, Bin Wu, Xiao He, Ye Li, Jian Tan, Sheng Wang, Dan Pei, Wei Zhang, Feifei Li","doi":"10.14778/3611540.3611559","DOIUrl":"https://doi.org/10.14778/3611540.3611559","url":null,"abstract":"Internet services supported by large-scale distributed systems have become essential for our daily life. To ensure the stability and high quality of services, diverse metric data are constantly collected and managed in a time-series database to monitor the service status. However, when the number of metrics becomes massive, existing time-series databases are inefficient in handling high-rate data ingestion and queries hitting multiple metrics. Besides, they all lack the support of machine learning functions, which are crucial for sophisticated analysis of large-scale time series. In this paper, we present Lindorm TSDB, a distributed time-series database designed for handling monitoring metrics at scale. It sustains high write throughput and low query latency with massive active metrics. It also allows users to analyze data with anomaly detection and time series forecasting algorithms directly through SQL. Furthermore, Lindorm TSDB retains stable performance even during node scaling. We evaluate Lindorm TSDB under different data scales, and the results show that it outperforms two popular open-source time-series databases on both writing and query, while executing time-series machine learning tasks efficiently.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PolarDB-SCC: A Cloud-Native Database Ensuring Low Latency for Strongly Consistent Reads PolarDB-SCC:一个云原生数据库,确保低延迟的强一致读取
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611562
Xinjun Yang, Yingqiang Zhang, Hao Chen, Chuan Sun, Feifei Li, Wenchao Zhou
A classic design of cloud-native databases adopts an architecture that consists of one read/write (RW) node and one or more read-only (RO) nodes. In such a design, the propagation of write-ahead logs (WALs) from the RW node to the RO node(s) is typically performed asynchronously. Consequently, system designers either have to accept a loose consistency guarantee, where a read from the RO node may return stale data, or tolerate significant performance degradation in terms of read latency, as it then needs to wait for the log to be propagated and applied. Most commercial cloud-native databases, such as Amazon Aurora, choose performance over strong consistency. As a result, it makes RO nodes useless for many applications requiring read-after-write consistency (a form of strong consistency), and the support for serverless databases (i.e., allowing the RO nodes to be scaled out automatically) is impossible as they require a single endpoint. This paper proposes PolarDB-SCC (PolarDB-Strongly Consistent Cluster), a cloud-native database architecture that guarantees strongly consistent reads with very low latency. The core idea is to eliminate unnecessary waits and reduce the necessary wait time on RO nodes while still supporting strong consistency. To achieve this, it tracks the RW node's modification timestamp at three progressively finer-grained levels. We further design a Linear Lamport timestamp to reduce the RO node's timestamp fetching operations and leverage the RDMA network for all the data transferring ( e.g. , timestamp fetching and log shipment) to minimize network overhead and extra CPU usage. Our evaluation shows that PolarDB-SCC does not incur any noticeable overhead for ensuring strongly consistent reads compared with the eventually consistent (stale) read policy. To the best of our knowledge, PolarDB-SCC is the first "read-write splitting" cloud-native database that supports strongly consistent read with negligible overhead. Compared with a straightforward read-wait design, PolarDB-SCC improves throughput by up to 4.51× and reduces median latency by up to 3.66× in SysBench's read-write workload. PolarDB-SCC is already commercially available at Alibaba Cloud.
经典的云原生数据库设计采用一个RW (read/write)节点和一个或多个RO (read/write)节点组成的架构。在这样的设计中,预写日志(write-ahead logs, wal)从RW节点到RO节点的传播通常是异步执行的。因此,系统设计人员要么必须接受松散的一致性保证(从RO节点读取可能返回过时的数据),要么必须容忍读取延迟方面的显著性能下降,因为它需要等待日志被传播和应用。大多数商业云原生数据库(如Amazon Aurora)选择性能而不是强一致性。因此,对于许多需要读写后一致性(强一致性的一种形式)的应用程序来说,它使RO节点变得无用,并且不可能支持无服务器数据库(即允许RO节点自动向外扩展),因为它们需要单个端点。本文提出了PolarDB-SCC (polardb - strong Consistent Cluster),这是一种云原生数据库架构,可以保证读取的强一致性和极低的延迟。其核心思想是消除不必要的等待,减少RO节点上必要的等待时间,同时仍然支持强一致性。为了实现这一点,它在三个逐步细化的级别上跟踪RW节点的修改时间戳。我们进一步设计了Linear Lamport时间戳,以减少RO节点的时间戳获取操作,并利用RDMA网络进行所有数据传输(例如,时间戳获取和日志发送),以最小化网络开销和额外的CPU使用。我们的评估表明,与最终一致的(陈旧的)读取策略相比,PolarDB-SCC在确保强一致性读取方面不会产生任何明显的开销。据我们所知,PolarDB-SCC是第一个“读写分离”的云原生数据库,它支持高一致性读取,开销可以忽略不计。与直接的读取等待设计相比,在SysBench的读写工作负载中,PolarDB-SCC将吞吐量提高了4.51倍,并将中位延迟降低了3.66倍。PolarDB-SCC已经在阿里云上商业化。
{"title":"PolarDB-SCC: A Cloud-Native Database Ensuring Low Latency for Strongly Consistent Reads","authors":"Xinjun Yang, Yingqiang Zhang, Hao Chen, Chuan Sun, Feifei Li, Wenchao Zhou","doi":"10.14778/3611540.3611562","DOIUrl":"https://doi.org/10.14778/3611540.3611562","url":null,"abstract":"A classic design of cloud-native databases adopts an architecture that consists of one read/write (RW) node and one or more read-only (RO) nodes. In such a design, the propagation of write-ahead logs (WALs) from the RW node to the RO node(s) is typically performed asynchronously. Consequently, system designers either have to accept a loose consistency guarantee, where a read from the RO node may return stale data, or tolerate significant performance degradation in terms of read latency, as it then needs to wait for the log to be propagated and applied. Most commercial cloud-native databases, such as Amazon Aurora, choose performance over strong consistency. As a result, it makes RO nodes useless for many applications requiring read-after-write consistency (a form of strong consistency), and the support for serverless databases (i.e., allowing the RO nodes to be scaled out automatically) is impossible as they require a single endpoint. This paper proposes PolarDB-SCC (PolarDB-Strongly Consistent Cluster), a cloud-native database architecture that guarantees strongly consistent reads with very low latency. The core idea is to eliminate unnecessary waits and reduce the necessary wait time on RO nodes while still supporting strong consistency. To achieve this, it tracks the RW node's modification timestamp at three progressively finer-grained levels. We further design a Linear Lamport timestamp to reduce the RO node's timestamp fetching operations and leverage the RDMA network for all the data transferring ( e.g. , timestamp fetching and log shipment) to minimize network overhead and extra CPU usage. Our evaluation shows that PolarDB-SCC does not incur any noticeable overhead for ensuring strongly consistent reads compared with the eventually consistent (stale) read policy. To the best of our knowledge, PolarDB-SCC is the first \"read-write splitting\" cloud-native database that supports strongly consistent read with negligible overhead. Compared with a straightforward read-wait design, PolarDB-SCC improves throughput by up to 4.51× and reduces median latency by up to 3.66× in SysBench's read-write workload. PolarDB-SCC is already commercially available at Alibaba Cloud.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
OceanBase Paetica: A Hybrid Shared-Nothing/Shared-Everything Database for Supporting Single Machine and Distributed Cluster OceanBase Paetica:支持单机和分布式集群的无共享/万物共享混合数据库
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611560
Zhifeng Yang, Quanqing Xu, Shanyan Gao, Chuanhui Yang, Guoping Wang, Yuzhong Zhao, Fanyu Kong, Hao Liu, Wanhong Wang, Jinliang Xiao
In the ongoing evolution of the OceanBase database system, it is essential to enhance its adaptability to small-scale enterprises. The OceanBase database system has demonstrated its stability and effectiveness within the Ant Group and other commercial organizations, besides through the TPC-C and TPC-H tests. In this paper, we have designed a stand-alone and distributed integrated architecture named Paetica to address the overhead caused by the distributed components in the stand-alone mode, with respect to the OceanBase system. Paetica enables adaptive configuration of the database that allows OceanBase to support both serial and parallel executions in stand-alone and distributed scenarios, thus providing efficiency and economy. This design has been implemented in version 4.0 of the OceanBase system, and the experiments show that Paetica exhibits notable scalability and outperforms alternative stand-alone or distributed databases. Furthermore, it enables the transition of OceanBase from primarily serving large enterprises to truly catering to small and medium enterprises, by employing a single OceanBase database for the successive stages of enterprise or business development, without the requirement for migration. Our experiments confirm that Paetica has achieved linear scalability with the increasing CPU core number within the stand-alone mode. It also outperforms MySQL and Greenplum in the Sysbench and TPC-H evaluations.
在OceanBase数据库系统不断发展的过程中,必须增强其对小型企业的适应性。除了通过TPC-C和TPC-H测试外,OceanBase数据库系统已经在Ant集团和其他商业组织中证明了其稳定性和有效性。在本文中,我们设计了一个名为Paetica的独立和分布式集成体系结构,以解决由独立模式下的分布式组件引起的开销,涉及到OceanBase系统。Paetica支持数据库的自适应配置,允许OceanBase在独立和分布式场景中支持串行和并行执行,从而提供效率和经济。该设计已在OceanBase系统4.0版本中实现,实验表明,Paetica具有显著的可扩展性,并且优于其他独立或分布式数据库。此外,通过为企业或业务发展的连续阶段使用单一的OceanBase数据库,而不需要迁移,它使OceanBase从主要服务大型企业转变为真正迎合中小型企业。我们的实验证实,在独立模式下,随着CPU核数的增加,Paetica已经实现了线性可扩展性。在Sysbench和TPC-H评估中,它也优于MySQL和Greenplum。
{"title":"OceanBase Paetica: A Hybrid Shared-Nothing/Shared-Everything Database for Supporting Single Machine and Distributed Cluster","authors":"Zhifeng Yang, Quanqing Xu, Shanyan Gao, Chuanhui Yang, Guoping Wang, Yuzhong Zhao, Fanyu Kong, Hao Liu, Wanhong Wang, Jinliang Xiao","doi":"10.14778/3611540.3611560","DOIUrl":"https://doi.org/10.14778/3611540.3611560","url":null,"abstract":"In the ongoing evolution of the OceanBase database system, it is essential to enhance its adaptability to small-scale enterprises. The OceanBase database system has demonstrated its stability and effectiveness within the Ant Group and other commercial organizations, besides through the TPC-C and TPC-H tests. In this paper, we have designed a stand-alone and distributed integrated architecture named Paetica to address the overhead caused by the distributed components in the stand-alone mode, with respect to the OceanBase system. Paetica enables adaptive configuration of the database that allows OceanBase to support both serial and parallel executions in stand-alone and distributed scenarios, thus providing efficiency and economy. This design has been implemented in version 4.0 of the OceanBase system, and the experiments show that Paetica exhibits notable scalability and outperforms alternative stand-alone or distributed databases. Furthermore, it enables the transition of OceanBase from primarily serving large enterprises to truly catering to small and medium enterprises, by employing a single OceanBase database for the successive stages of enterprise or business development, without the requirement for migration. Our experiments confirm that Paetica has achieved linear scalability with the increasing CPU core number within the stand-alone mode. It also outperforms MySQL and Greenplum in the Sysbench and TPC-H evaluations.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Story of AWS Glue AWS Glue的故事
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611547
Mohit Saxena, Benjamin Sowell, Daiyan Alamgir, Nitin Bahadur, Bijay Bisht, Santosh Chandrachood, Chitti Keswani, G. Krishnamoorthy, Austin Lee, Bohou Li, Zach Mitchell, Vaibhav Porwal, Maheedhar Reddy Chappidi, Brian Ross, Noritaka Sekiyama, Omer Zaki, Linchi Zhang, Mehul A. Shah
AWS Glue is Amazon's serverless data integration cloud service that makes it simple and cost effective to extract, clean, enrich, load, and organize data. Originally launched in August 2017, AWS Glue began as an extract-transform-load (ETL) service designed to relieve developers and data engineers of the undifferentiated heavy lifting needed to load databases, data warehouses, and build data lakes on Amazon S3. Since then, it has evolved to serve a larger audience including ETL specialists and data scientists, and includes a broader suite of data integration capabilities. Today, hundreds of thousands of customers use AWS Glue every month. In this paper, we describe the use cases and challenges cloud customers face in preparing data for analytics and the tenets we chose to drive Glue's design. We chose early on to focus on ease-of-use, scale, and extensibility. At its core, Glue offers serverless Apache Spark and Python engines backed by a purpose-built resource manager for fast startup and auto-scaling. In Spark, it offers a new data structure --- DynamicFrames --- for manipulating messy schema-free semi-structured data such as event logs, a variety of transformations and tooling to simplify data preparation, and a new shuffle plugin to offload to cloud storage. It also includes a Hivemetastore compatible Data Catalog with Glue crawlers to build and manage metadata, e.g. for data lakes on Amazon S3. Finally, Glue Studio is its visual interface for authoring Spark and Python-based ETL jobs. We describe the innovations that differentiate AWS Glue and drive its popularity and how it has evolved over the years.
AWS Glue是亚马逊的无服务器数据集成云服务,它使提取、清理、丰富、加载和组织数据变得简单而经济高效。AWS Glue最初于2017年8月推出,最初是一种提取-转换-加载(ETL)服务,旨在减轻开发人员和数据工程师在Amazon S3上加载数据库、数据仓库和构建数据湖所需的繁重工作。从那时起,它已经发展到服务于包括ETL专家和数据科学家在内的更大的受众,并包括更广泛的数据集成功能套件。如今,每个月都有数十万客户使用AWS Glue。在本文中,我们描述了云客户在准备分析数据时面临的用例和挑战,以及我们选择的驱动Glue设计的原则。我们在早期选择将重点放在易用性、可扩展性和可扩展性上。Glue的核心是提供无服务器的Apache Spark和Python引擎,由专门构建的资源管理器支持,用于快速启动和自动扩展。在Spark中,它提供了一种新的数据结构——DynamicFrames——用于操作杂乱的无模式半结构化数据,如事件日志,各种转换和工具来简化数据准备,以及一个新的shuffle插件来卸载到云存储。它还包括一个与Hivemetastore兼容的数据目录和Glue爬虫来构建和管理元数据,例如Amazon S3上的数据湖。最后,Glue Studio是用于创建基于Spark和python的ETL作业的可视化界面。我们将介绍使AWS Glue与众不同并推动其流行的创新,以及多年来它是如何发展的。
{"title":"The Story of AWS Glue","authors":"Mohit Saxena, Benjamin Sowell, Daiyan Alamgir, Nitin Bahadur, Bijay Bisht, Santosh Chandrachood, Chitti Keswani, G. Krishnamoorthy, Austin Lee, Bohou Li, Zach Mitchell, Vaibhav Porwal, Maheedhar Reddy Chappidi, Brian Ross, Noritaka Sekiyama, Omer Zaki, Linchi Zhang, Mehul A. Shah","doi":"10.14778/3611540.3611547","DOIUrl":"https://doi.org/10.14778/3611540.3611547","url":null,"abstract":"AWS Glue is Amazon's serverless data integration cloud service that makes it simple and cost effective to extract, clean, enrich, load, and organize data. Originally launched in August 2017, AWS Glue began as an extract-transform-load (ETL) service designed to relieve developers and data engineers of the undifferentiated heavy lifting needed to load databases, data warehouses, and build data lakes on Amazon S3. Since then, it has evolved to serve a larger audience including ETL specialists and data scientists, and includes a broader suite of data integration capabilities. Today, hundreds of thousands of customers use AWS Glue every month. In this paper, we describe the use cases and challenges cloud customers face in preparing data for analytics and the tenets we chose to drive Glue's design. We chose early on to focus on ease-of-use, scale, and extensibility. At its core, Glue offers serverless Apache Spark and Python engines backed by a purpose-built resource manager for fast startup and auto-scaling. In Spark, it offers a new data structure --- DynamicFrames --- for manipulating messy schema-free semi-structured data such as event logs, a variety of transformations and tooling to simplify data preparation, and a new shuffle plugin to offload to cloud storage. It also includes a Hivemetastore compatible Data Catalog with Glue crawlers to build and manage metadata, e.g. for data lakes on Amazon S3. Finally, Glue Studio is its visual interface for authoring Spark and Python-based ETL jobs. We describe the innovations that differentiate AWS Glue and drive its popularity and how it has evolved over the years.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CDSBen: Benchmarking the Performance of Storage Services in Cloud-Native Database System at ByteDance CDSBen:在ByteDance上对云原生数据库系统中的存储服务性能进行基准测试
3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611549
Jiashu Zhang, Wen Jiang, Bo Tang, Haoxiang Ma, Lixun Cao, Zhongbin Jiang, Yuanyuan Nie, Fan Wang, Lei Zhang, Yuming Liang
In this work, we focus on the performance benchmarking problem of storage services in cloud-native database systems, which are widely used in various cloud applications. The core idea of these systems is to separate computation and storage in traditional monolithic OLTP databases. Specifically, we first present the characteristics of two representative real I/O workloads at the storage tier of ByteDance's cloud-native database veDB. We then elaborate the limitations of using standard benchmarks such as TPC-C and YCSB to resemble these workloads. To overcome these limitations, we devise a learning-based I/O workload benchmark called CDS-Ben. We demonstrate the superiority of CDSBen by deploying it at ByteDance and showing that its generated I/O traces accurately resemble the real I/O traces in production. Additionally, we verify the accuracy and flexibility of CDSBen by generating a wide range of I/O workloads with different I/O characteristics.
在这项工作中,我们重点研究了云原生数据库系统中存储服务的性能基准测试问题,云原生数据库系统广泛应用于各种云应用。这些系统的核心思想是将传统的单片OLTP数据库中的计算和存储分离开来。具体来说,我们首先展示了字节跳动的云原生数据库veDB的存储层上两个具有代表性的真实I/O工作负载的特征。然后,我们详细说明了使用标准基准(如TPC-C和YCSB)来模拟这些工作负载的局限性。为了克服这些限制,我们设计了一个基于学习的I/O工作负载基准,称为CDS-Ben。我们通过在ByteDance上部署CDSBen来展示它的优越性,并展示其生成的I/O轨迹与生产中的实际I/O轨迹非常相似。此外,我们还通过生成具有不同I/O特征的各种I/O工作负载来验证cdshen的准确性和灵活性。
{"title":"CDSBen: Benchmarking the Performance of Storage Services in Cloud-Native Database System at ByteDance","authors":"Jiashu Zhang, Wen Jiang, Bo Tang, Haoxiang Ma, Lixun Cao, Zhongbin Jiang, Yuanyuan Nie, Fan Wang, Lei Zhang, Yuming Liang","doi":"10.14778/3611540.3611549","DOIUrl":"https://doi.org/10.14778/3611540.3611549","url":null,"abstract":"In this work, we focus on the performance benchmarking problem of storage services in cloud-native database systems, which are widely used in various cloud applications. The core idea of these systems is to separate computation and storage in traditional monolithic OLTP databases. Specifically, we first present the characteristics of two representative real I/O workloads at the storage tier of ByteDance's cloud-native database veDB. We then elaborate the limitations of using standard benchmarks such as TPC-C and YCSB to resemble these workloads. To overcome these limitations, we devise a learning-based I/O workload benchmark called CDS-Ben. We demonstrate the superiority of CDSBen by deploying it at ByteDance and showing that its generated I/O traces accurately resemble the real I/O traces in production. Additionally, we verify the accuracy and flexibility of CDSBen by generating a wide range of I/O workloads with different I/O characteristics.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the Vldb Endowment
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1