Pub Date : 2023-08-01DOI: 10.14778/3611540.3611614
Daniel ten Wolde, Gábor Szárnyas, Peter Boncz
We demonstrate the most important new feature of SQL:2023, namely SQL/PGQ, which eases querying graphs using SQL by introducing new syntax for pattern matching and (shortest) path-finding. We show how support for SQL/PGQ can be integrated into an RDBMS, specifically in the DuckDB system, using an extension module called DuckPGQ. As such, we also demonstrate the use of the DuckDB extensibility mechanism, which allows us to add new functions, data types, operators, optimizer rules, storage systems, and even parsers to DuckDB. We also describe the new data structures and algorithms that the DuckPGQ module is based on, and how they are injected into SQL plans. While the demonstrated DuckPGQ extension module is lean and efficient, we sketch a roadmap to (i) improve its performance through new algorithms (factorized and WCOJ) and better parallelism and (ii) extend its functionality to scenarios beyond SQL, e.g., building and analyzing Graph Neural Networks.
{"title":"DuckPGQ: Bringing SQL/PGQ to DuckDB","authors":"Daniel ten Wolde, Gábor Szárnyas, Peter Boncz","doi":"10.14778/3611540.3611614","DOIUrl":"https://doi.org/10.14778/3611540.3611614","url":null,"abstract":"We demonstrate the most important new feature of SQL:2023, namely SQL/PGQ, which eases querying graphs using SQL by introducing new syntax for pattern matching and (shortest) path-finding. We show how support for SQL/PGQ can be integrated into an RDBMS, specifically in the DuckDB system, using an extension module called DuckPGQ. As such, we also demonstrate the use of the DuckDB extensibility mechanism, which allows us to add new functions, data types, operators, optimizer rules, storage systems, and even parsers to DuckDB. We also describe the new data structures and algorithms that the DuckPGQ module is based on, and how they are injected into SQL plans. While the demonstrated DuckPGQ extension module is lean and efficient, we sketch a roadmap to (i) improve its performance through new algorithms (factorized and WCOJ) and better parallelism and (ii) extend its functionality to scenarios beyond SQL, e.g., building and analyzing Graph Neural Networks.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.14778/3611540.3611568
Krishna Kantikiran Pasupuleti, Jiakun Li, Hong Su, Mohamed Ziauddin
Despite best coding practices, software bugs are inevitable in a large codebase. In traditional databases, when errors occur during query processing, they disrupt user workflow until workarounds are found and applied. Manual identification of workarounds often relies on a trial-and-error method. The process is not only time-consuming but also requires domain expertise that users are often lacking. In this paper, we propose a framework to automatically mitigate errors that occur during query compilation (including optimization and code generation) without any user intervention. An error is intercepted by the database internally, a workaround is identified for it, and the query is recompiled using the workaround. The entire process remains transparent to the user with the query being executed seamlessly. The proposed technique handles SQL errors during query compilation and provides three types of mitigation strategies - i) quickly failover to one of the readily-available historical plans for the statement ii) apply targeted error-correcting directives (hints) identified from the optimizer context at the time of the error iii) modify the global configuration of the optimizer using hints. This feature has been implemented and will be released in an upcoming version of Oracle Autonomous Database.
{"title":"Automatic SQL Error Mitigation in Oracle","authors":"Krishna Kantikiran Pasupuleti, Jiakun Li, Hong Su, Mohamed Ziauddin","doi":"10.14778/3611540.3611568","DOIUrl":"https://doi.org/10.14778/3611540.3611568","url":null,"abstract":"Despite best coding practices, software bugs are inevitable in a large codebase. In traditional databases, when errors occur during query processing, they disrupt user workflow until workarounds are found and applied. Manual identification of workarounds often relies on a trial-and-error method. The process is not only time-consuming but also requires domain expertise that users are often lacking. In this paper, we propose a framework to automatically mitigate errors that occur during query compilation (including optimization and code generation) without any user intervention. An error is intercepted by the database internally, a workaround is identified for it, and the query is recompiled using the workaround. The entire process remains transparent to the user with the query being executed seamlessly. The proposed technique handles SQL errors during query compilation and provides three types of mitigation strategies - i) quickly failover to one of the readily-available historical plans for the statement ii) apply targeted error-correcting directives (hints) identified from the optimizer context at the time of the error iii) modify the global configuration of the optimizer using hints. This feature has been implemented and will be released in an upcoming version of Oracle Autonomous Database.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.14778/3611540.3611575
George Katsogiannis-Meimarakis, Mike Xydas, Georgia Koutrika
In the age of the Digital Revolution, almost all human activities, from industrial and business operations to medical and academic research, are reliant on the constant integration and utilisation of ever-increasing volumes of data. However, the explosive volume and complexity of data makes data querying and exploration challenging even for experts, and makes the need to democratise the access to data, even for non-technical users, all the more evident. It is time to lift all technical barriers, by empowering users to access relational databases through conversation. We consider 3 main research areas that a natural language data interface is based on: Text-to-SQL, SQL-to-Text, and Data-to-Text. The purpose of this tutorial is a deep dive into these areas, covering state-of-the-art techniques and models, and explaining how the progress in the deep learning field has led to impressive advancements. We will present benchmarks that sparked research and competition, and discuss open problems and research opportunities with one of the most important challenges being the integration of these 3 research areas into one conversational system.
{"title":"Natural Language Interfaces for Databases with Deep Learning","authors":"George Katsogiannis-Meimarakis, Mike Xydas, Georgia Koutrika","doi":"10.14778/3611540.3611575","DOIUrl":"https://doi.org/10.14778/3611540.3611575","url":null,"abstract":"In the age of the Digital Revolution, almost all human activities, from industrial and business operations to medical and academic research, are reliant on the constant integration and utilisation of ever-increasing volumes of data. However, the explosive volume and complexity of data makes data querying and exploration challenging even for experts, and makes the need to democratise the access to data, even for non-technical users, all the more evident. It is time to lift all technical barriers, by empowering users to access relational databases through conversation. We consider 3 main research areas that a natural language data interface is based on: Text-to-SQL, SQL-to-Text, and Data-to-Text. The purpose of this tutorial is a deep dive into these areas, covering state-of-the-art techniques and models, and explaining how the progress in the deep learning field has led to impressive advancements. We will present benchmarks that sparked research and competition, and discuss open problems and research opportunities with one of the most important challenges being the integration of these 3 research areas into one conversational system.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this demonstration, we will present EVA, an end-to-end AI-Relational database management system. We will demonstrate the capabilities and utility of EVA using three usage scenarios: (1) EVA serves as a backend for an exploratory video analytics interface developed using Streamlit and React, (2) EVA seamlessly integrates with the Python and Data Science ecosystems by allowing users to access EVA in a Python notebook alongside other popular libraries such as Pandas and Matplotlib, and (3) EVA facilitates bulk labeling with Label Studio, a widely-used labeling framework. By optimizing complex vision queries, we illustrate how EVA allows a wide range of application developers to harness the recent advances in computer vision.
{"title":"Interactive Demonstration of EVA","authors":"Gaurav Tarlok Kakkar, Aryan Rajoria, Myna Prasanna Kalluraya, Ashmita Raju, Jiashen Cao, Kexin Rong, Joy Arulraj","doi":"10.14778/3611540.3611626","DOIUrl":"https://doi.org/10.14778/3611540.3611626","url":null,"abstract":"In this demonstration, we will present EVA, an end-to-end AI-Relational database management system. We will demonstrate the capabilities and utility of EVA using three usage scenarios: (1) EVA serves as a backend for an exploratory video analytics interface developed using Streamlit and React, (2) EVA seamlessly integrates with the Python and Data Science ecosystems by allowing users to access EVA in a Python notebook alongside other popular libraries such as Pandas and Matplotlib, and (3) EVA facilitates bulk labeling with Label Studio, a widely-used labeling framework. By optimizing complex vision queries, we illustrate how EVA allows a wide range of application developers to harness the recent advances in computer vision.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.14778/3611540.3611563
Hiroyuki Yamada, Toshihiro Suzuki, Yuji Ito, Jun Nemoto
This paper presents ScalarDB, a universal transaction manager that achieves distributed transactions across multiple disparate databases. ScalarDB provides a database-agnostic transaction manager on top of its database abstraction; thus, it achieves transactions spanning various databases without depending on the transactional capability of underlying databases. ScalarDB is based on several research works and extended to provide a strong correctness guarantee (i.e., strict serializability), further performance optimizations, and several critical mechanisms for productization. In this paper, we describe the design and implementation of ScalarDB. We also present evaluation results showing that ScalarDB achieves database-spanning transactions with reasonable performance and near-linear scalability without sacrificing correctness. Finally, we share some case studies and lessons learned while building and running ScalarDB.
{"title":"ScalarDB: Universal Transaction Manager for Polystores","authors":"Hiroyuki Yamada, Toshihiro Suzuki, Yuji Ito, Jun Nemoto","doi":"10.14778/3611540.3611563","DOIUrl":"https://doi.org/10.14778/3611540.3611563","url":null,"abstract":"This paper presents ScalarDB, a universal transaction manager that achieves distributed transactions across multiple disparate databases. ScalarDB provides a database-agnostic transaction manager on top of its database abstraction; thus, it achieves transactions spanning various databases without depending on the transactional capability of underlying databases. ScalarDB is based on several research works and extended to provide a strong correctness guarantee (i.e., strict serializability), further performance optimizations, and several critical mechanisms for productization. In this paper, we describe the design and implementation of ScalarDB. We also present evaluation results showing that ScalarDB achieves database-spanning transactions with reasonable performance and near-linear scalability without sacrificing correctness. Finally, we share some case studies and lessons learned while building and running ScalarDB.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.14778/3611540.3611559
Chunhui Shen, Qianyu Ouyang, Feibo Li, Zhipeng Liu, Longcheng Zhu, Yujie Zou, Qing Su, Tianhuan Yu, Yi Yi, Jianhong Hu, Cen Zheng, Bo Wen, Hanbang Zheng, Lunfan Xu, Sicheng Pan, Bin Wu, Xiao He, Ye Li, Jian Tan, Sheng Wang, Dan Pei, Wei Zhang, Feifei Li
Internet services supported by large-scale distributed systems have become essential for our daily life. To ensure the stability and high quality of services, diverse metric data are constantly collected and managed in a time-series database to monitor the service status. However, when the number of metrics becomes massive, existing time-series databases are inefficient in handling high-rate data ingestion and queries hitting multiple metrics. Besides, they all lack the support of machine learning functions, which are crucial for sophisticated analysis of large-scale time series. In this paper, we present Lindorm TSDB, a distributed time-series database designed for handling monitoring metrics at scale. It sustains high write throughput and low query latency with massive active metrics. It also allows users to analyze data with anomaly detection and time series forecasting algorithms directly through SQL. Furthermore, Lindorm TSDB retains stable performance even during node scaling. We evaluate Lindorm TSDB under different data scales, and the results show that it outperforms two popular open-source time-series databases on both writing and query, while executing time-series machine learning tasks efficiently.
{"title":"Lindorm TSDB: A Cloud-Native Time-Series Database for Large-Scale Monitoring Systems","authors":"Chunhui Shen, Qianyu Ouyang, Feibo Li, Zhipeng Liu, Longcheng Zhu, Yujie Zou, Qing Su, Tianhuan Yu, Yi Yi, Jianhong Hu, Cen Zheng, Bo Wen, Hanbang Zheng, Lunfan Xu, Sicheng Pan, Bin Wu, Xiao He, Ye Li, Jian Tan, Sheng Wang, Dan Pei, Wei Zhang, Feifei Li","doi":"10.14778/3611540.3611559","DOIUrl":"https://doi.org/10.14778/3611540.3611559","url":null,"abstract":"Internet services supported by large-scale distributed systems have become essential for our daily life. To ensure the stability and high quality of services, diverse metric data are constantly collected and managed in a time-series database to monitor the service status. However, when the number of metrics becomes massive, existing time-series databases are inefficient in handling high-rate data ingestion and queries hitting multiple metrics. Besides, they all lack the support of machine learning functions, which are crucial for sophisticated analysis of large-scale time series. In this paper, we present Lindorm TSDB, a distributed time-series database designed for handling monitoring metrics at scale. It sustains high write throughput and low query latency with massive active metrics. It also allows users to analyze data with anomaly detection and time series forecasting algorithms directly through SQL. Furthermore, Lindorm TSDB retains stable performance even during node scaling. We evaluate Lindorm TSDB under different data scales, and the results show that it outperforms two popular open-source time-series databases on both writing and query, while executing time-series machine learning tasks efficiently.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A classic design of cloud-native databases adopts an architecture that consists of one read/write (RW) node and one or more read-only (RO) nodes. In such a design, the propagation of write-ahead logs (WALs) from the RW node to the RO node(s) is typically performed asynchronously. Consequently, system designers either have to accept a loose consistency guarantee, where a read from the RO node may return stale data, or tolerate significant performance degradation in terms of read latency, as it then needs to wait for the log to be propagated and applied. Most commercial cloud-native databases, such as Amazon Aurora, choose performance over strong consistency. As a result, it makes RO nodes useless for many applications requiring read-after-write consistency (a form of strong consistency), and the support for serverless databases (i.e., allowing the RO nodes to be scaled out automatically) is impossible as they require a single endpoint. This paper proposes PolarDB-SCC (PolarDB-Strongly Consistent Cluster), a cloud-native database architecture that guarantees strongly consistent reads with very low latency. The core idea is to eliminate unnecessary waits and reduce the necessary wait time on RO nodes while still supporting strong consistency. To achieve this, it tracks the RW node's modification timestamp at three progressively finer-grained levels. We further design a Linear Lamport timestamp to reduce the RO node's timestamp fetching operations and leverage the RDMA network for all the data transferring ( e.g. , timestamp fetching and log shipment) to minimize network overhead and extra CPU usage. Our evaluation shows that PolarDB-SCC does not incur any noticeable overhead for ensuring strongly consistent reads compared with the eventually consistent (stale) read policy. To the best of our knowledge, PolarDB-SCC is the first "read-write splitting" cloud-native database that supports strongly consistent read with negligible overhead. Compared with a straightforward read-wait design, PolarDB-SCC improves throughput by up to 4.51× and reduces median latency by up to 3.66× in SysBench's read-write workload. PolarDB-SCC is already commercially available at Alibaba Cloud.
{"title":"PolarDB-SCC: A Cloud-Native Database Ensuring Low Latency for Strongly Consistent Reads","authors":"Xinjun Yang, Yingqiang Zhang, Hao Chen, Chuan Sun, Feifei Li, Wenchao Zhou","doi":"10.14778/3611540.3611562","DOIUrl":"https://doi.org/10.14778/3611540.3611562","url":null,"abstract":"A classic design of cloud-native databases adopts an architecture that consists of one read/write (RW) node and one or more read-only (RO) nodes. In such a design, the propagation of write-ahead logs (WALs) from the RW node to the RO node(s) is typically performed asynchronously. Consequently, system designers either have to accept a loose consistency guarantee, where a read from the RO node may return stale data, or tolerate significant performance degradation in terms of read latency, as it then needs to wait for the log to be propagated and applied. Most commercial cloud-native databases, such as Amazon Aurora, choose performance over strong consistency. As a result, it makes RO nodes useless for many applications requiring read-after-write consistency (a form of strong consistency), and the support for serverless databases (i.e., allowing the RO nodes to be scaled out automatically) is impossible as they require a single endpoint. This paper proposes PolarDB-SCC (PolarDB-Strongly Consistent Cluster), a cloud-native database architecture that guarantees strongly consistent reads with very low latency. The core idea is to eliminate unnecessary waits and reduce the necessary wait time on RO nodes while still supporting strong consistency. To achieve this, it tracks the RW node's modification timestamp at three progressively finer-grained levels. We further design a Linear Lamport timestamp to reduce the RO node's timestamp fetching operations and leverage the RDMA network for all the data transferring ( e.g. , timestamp fetching and log shipment) to minimize network overhead and extra CPU usage. Our evaluation shows that PolarDB-SCC does not incur any noticeable overhead for ensuring strongly consistent reads compared with the eventually consistent (stale) read policy. To the best of our knowledge, PolarDB-SCC is the first \"read-write splitting\" cloud-native database that supports strongly consistent read with negligible overhead. Compared with a straightforward read-wait design, PolarDB-SCC improves throughput by up to 4.51× and reduces median latency by up to 3.66× in SysBench's read-write workload. PolarDB-SCC is already commercially available at Alibaba Cloud.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the ongoing evolution of the OceanBase database system, it is essential to enhance its adaptability to small-scale enterprises. The OceanBase database system has demonstrated its stability and effectiveness within the Ant Group and other commercial organizations, besides through the TPC-C and TPC-H tests. In this paper, we have designed a stand-alone and distributed integrated architecture named Paetica to address the overhead caused by the distributed components in the stand-alone mode, with respect to the OceanBase system. Paetica enables adaptive configuration of the database that allows OceanBase to support both serial and parallel executions in stand-alone and distributed scenarios, thus providing efficiency and economy. This design has been implemented in version 4.0 of the OceanBase system, and the experiments show that Paetica exhibits notable scalability and outperforms alternative stand-alone or distributed databases. Furthermore, it enables the transition of OceanBase from primarily serving large enterprises to truly catering to small and medium enterprises, by employing a single OceanBase database for the successive stages of enterprise or business development, without the requirement for migration. Our experiments confirm that Paetica has achieved linear scalability with the increasing CPU core number within the stand-alone mode. It also outperforms MySQL and Greenplum in the Sysbench and TPC-H evaluations.
{"title":"OceanBase Paetica: A Hybrid Shared-Nothing/Shared-Everything Database for Supporting Single Machine and Distributed Cluster","authors":"Zhifeng Yang, Quanqing Xu, Shanyan Gao, Chuanhui Yang, Guoping Wang, Yuzhong Zhao, Fanyu Kong, Hao Liu, Wanhong Wang, Jinliang Xiao","doi":"10.14778/3611540.3611560","DOIUrl":"https://doi.org/10.14778/3611540.3611560","url":null,"abstract":"In the ongoing evolution of the OceanBase database system, it is essential to enhance its adaptability to small-scale enterprises. The OceanBase database system has demonstrated its stability and effectiveness within the Ant Group and other commercial organizations, besides through the TPC-C and TPC-H tests. In this paper, we have designed a stand-alone and distributed integrated architecture named Paetica to address the overhead caused by the distributed components in the stand-alone mode, with respect to the OceanBase system. Paetica enables adaptive configuration of the database that allows OceanBase to support both serial and parallel executions in stand-alone and distributed scenarios, thus providing efficiency and economy. This design has been implemented in version 4.0 of the OceanBase system, and the experiments show that Paetica exhibits notable scalability and outperforms alternative stand-alone or distributed databases. Furthermore, it enables the transition of OceanBase from primarily serving large enterprises to truly catering to small and medium enterprises, by employing a single OceanBase database for the successive stages of enterprise or business development, without the requirement for migration. Our experiments confirm that Paetica has achieved linear scalability with the increasing CPU core number within the stand-alone mode. It also outperforms MySQL and Greenplum in the Sysbench and TPC-H evaluations.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.14778/3611540.3611547
Mohit Saxena, Benjamin Sowell, Daiyan Alamgir, Nitin Bahadur, Bijay Bisht, Santosh Chandrachood, Chitti Keswani, G. Krishnamoorthy, Austin Lee, Bohou Li, Zach Mitchell, Vaibhav Porwal, Maheedhar Reddy Chappidi, Brian Ross, Noritaka Sekiyama, Omer Zaki, Linchi Zhang, Mehul A. Shah
AWS Glue is Amazon's serverless data integration cloud service that makes it simple and cost effective to extract, clean, enrich, load, and organize data. Originally launched in August 2017, AWS Glue began as an extract-transform-load (ETL) service designed to relieve developers and data engineers of the undifferentiated heavy lifting needed to load databases, data warehouses, and build data lakes on Amazon S3. Since then, it has evolved to serve a larger audience including ETL specialists and data scientists, and includes a broader suite of data integration capabilities. Today, hundreds of thousands of customers use AWS Glue every month. In this paper, we describe the use cases and challenges cloud customers face in preparing data for analytics and the tenets we chose to drive Glue's design. We chose early on to focus on ease-of-use, scale, and extensibility. At its core, Glue offers serverless Apache Spark and Python engines backed by a purpose-built resource manager for fast startup and auto-scaling. In Spark, it offers a new data structure --- DynamicFrames --- for manipulating messy schema-free semi-structured data such as event logs, a variety of transformations and tooling to simplify data preparation, and a new shuffle plugin to offload to cloud storage. It also includes a Hivemetastore compatible Data Catalog with Glue crawlers to build and manage metadata, e.g. for data lakes on Amazon S3. Finally, Glue Studio is its visual interface for authoring Spark and Python-based ETL jobs. We describe the innovations that differentiate AWS Glue and drive its popularity and how it has evolved over the years.
{"title":"The Story of AWS Glue","authors":"Mohit Saxena, Benjamin Sowell, Daiyan Alamgir, Nitin Bahadur, Bijay Bisht, Santosh Chandrachood, Chitti Keswani, G. Krishnamoorthy, Austin Lee, Bohou Li, Zach Mitchell, Vaibhav Porwal, Maheedhar Reddy Chappidi, Brian Ross, Noritaka Sekiyama, Omer Zaki, Linchi Zhang, Mehul A. Shah","doi":"10.14778/3611540.3611547","DOIUrl":"https://doi.org/10.14778/3611540.3611547","url":null,"abstract":"AWS Glue is Amazon's serverless data integration cloud service that makes it simple and cost effective to extract, clean, enrich, load, and organize data. Originally launched in August 2017, AWS Glue began as an extract-transform-load (ETL) service designed to relieve developers and data engineers of the undifferentiated heavy lifting needed to load databases, data warehouses, and build data lakes on Amazon S3. Since then, it has evolved to serve a larger audience including ETL specialists and data scientists, and includes a broader suite of data integration capabilities. Today, hundreds of thousands of customers use AWS Glue every month. In this paper, we describe the use cases and challenges cloud customers face in preparing data for analytics and the tenets we chose to drive Glue's design. We chose early on to focus on ease-of-use, scale, and extensibility. At its core, Glue offers serverless Apache Spark and Python engines backed by a purpose-built resource manager for fast startup and auto-scaling. In Spark, it offers a new data structure --- DynamicFrames --- for manipulating messy schema-free semi-structured data such as event logs, a variety of transformations and tooling to simplify data preparation, and a new shuffle plugin to offload to cloud storage. It also includes a Hivemetastore compatible Data Catalog with Glue crawlers to build and manage metadata, e.g. for data lakes on Amazon S3. Finally, Glue Studio is its visual interface for authoring Spark and Python-based ETL jobs. We describe the innovations that differentiate AWS Glue and drive its popularity and how it has evolved over the years.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.14778/3611540.3611549
Jiashu Zhang, Wen Jiang, Bo Tang, Haoxiang Ma, Lixun Cao, Zhongbin Jiang, Yuanyuan Nie, Fan Wang, Lei Zhang, Yuming Liang
In this work, we focus on the performance benchmarking problem of storage services in cloud-native database systems, which are widely used in various cloud applications. The core idea of these systems is to separate computation and storage in traditional monolithic OLTP databases. Specifically, we first present the characteristics of two representative real I/O workloads at the storage tier of ByteDance's cloud-native database veDB. We then elaborate the limitations of using standard benchmarks such as TPC-C and YCSB to resemble these workloads. To overcome these limitations, we devise a learning-based I/O workload benchmark called CDS-Ben. We demonstrate the superiority of CDSBen by deploying it at ByteDance and showing that its generated I/O traces accurately resemble the real I/O traces in production. Additionally, we verify the accuracy and flexibility of CDSBen by generating a wide range of I/O workloads with different I/O characteristics.
{"title":"CDSBen: Benchmarking the Performance of Storage Services in Cloud-Native Database System at ByteDance","authors":"Jiashu Zhang, Wen Jiang, Bo Tang, Haoxiang Ma, Lixun Cao, Zhongbin Jiang, Yuanyuan Nie, Fan Wang, Lei Zhang, Yuming Liang","doi":"10.14778/3611540.3611549","DOIUrl":"https://doi.org/10.14778/3611540.3611549","url":null,"abstract":"In this work, we focus on the performance benchmarking problem of storage services in cloud-native database systems, which are widely used in various cloud applications. The core idea of these systems is to separate computation and storage in traditional monolithic OLTP databases. Specifically, we first present the characteristics of two representative real I/O workloads at the storage tier of ByteDance's cloud-native database veDB. We then elaborate the limitations of using standard benchmarks such as TPC-C and YCSB to resemble these workloads. To overcome these limitations, we devise a learning-based I/O workload benchmark called CDS-Ben. We demonstrate the superiority of CDSBen by deploying it at ByteDance and showing that its generated I/O traces accurately resemble the real I/O traces in production. Additionally, we verify the accuracy and flexibility of CDSBen by generating a wide range of I/O workloads with different I/O characteristics.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}