Pub Date : 2023-02-01DOI: 10.14778/3583140.3583154
Zhiguo Jiang, Hanhua Chen, Hai Jin
A graph stream refers to a continuous stream of edges, forming a huge and fast-evolving graph. The vast volume and high update speed of a graph stream bring stringent requirements for the data management structure, including sublinear space cost, computation-efficient operation support, and scalability of the structure. Existing designs summarize a graph stream by leveraging a hash-based compressed matrix and representing an edge using its fingerprint to achieve practical storage for a graph stream with a known upper bound of data volume. However, they fail to support the dynamically extending of graph streams. In this paper, we propose Auxo, a scalable structure to support space/time efficient summarization of dynamic graph streams. Auxo is built on a proposed novel prefix embedded tree (PET) which leverages binary logarithmic search and common binary prefixes embedding to provide an efficient and scalable tree structure. PET reduces the item insert/query time from O (| E |) to O ( log | E |) as well as reducing the total storage cost by a log | E | scale, where | E | is the size of the edge set in a graph stream. To further improve the memory utilization of PET during scaling, we propose a proportional PET structure that extends a higher level in a proportionally incremental style. We conduct comprehensive experiments on large-scale real-world datasets to evaluate the performance of this design. Results show that Auxo significantly reduces the insert and query time by one to two orders of magnitude compared to the state of the arts. Meanwhile, Auxo achieves efficiently and economically structure scaling with an average memory utilization of over 80%.
图流是指连续的边流,形成一个巨大的、快速发展的图。图流的庞大容量和高更新速度对数据管理结构提出了严格的要求,包括亚线性空间成本、计算高效的操作支持和结构的可扩展性。现有的设计通过利用基于哈希的压缩矩阵和使用其指纹表示边缘来总结图形流,从而实现具有已知数据量上限的图形流的实际存储。然而,它们不支持图形流的动态扩展。在本文中,我们提出了Auxo,一个可扩展的结构,以支持空间/时间高效的动态图流摘要。Auxo是建立在一种新的前缀嵌入树(PET)之上的,它利用二进制对数搜索和通用二进制前缀嵌入来提供一种高效且可扩展的树结构。PET将项插入/查询时间从O (| E |)减少到O (log | E |),并将总存储成本降低了log | E |比例,其中| E |是图流中边集的大小。为了进一步提高PET在缩放过程中的内存利用率,我们提出了一种比例PET结构,该结构以比例增量的方式扩展到更高的级别。我们在大规模的真实数据集上进行了全面的实验来评估该设计的性能。结果表明,与现有技术相比,Auxo显著地将插入和查询时间减少了一到两个数量级。同时,Auxo实现了高效和经济的结构扩展,平均内存利用率超过80%。
{"title":"Auxo: A Scalable and Efficient Graph Stream Summarization Structure","authors":"Zhiguo Jiang, Hanhua Chen, Hai Jin","doi":"10.14778/3583140.3583154","DOIUrl":"https://doi.org/10.14778/3583140.3583154","url":null,"abstract":"A graph stream refers to a continuous stream of edges, forming a huge and fast-evolving graph. The vast volume and high update speed of a graph stream bring stringent requirements for the data management structure, including sublinear space cost, computation-efficient operation support, and scalability of the structure. Existing designs summarize a graph stream by leveraging a hash-based compressed matrix and representing an edge using its fingerprint to achieve practical storage for a graph stream with a known upper bound of data volume. However, they fail to support the dynamically extending of graph streams.\u0000 \u0000 In this paper, we propose Auxo, a scalable structure to support space/time efficient summarization of dynamic graph streams. Auxo is built on a proposed novel\u0000 prefix embedded tree\u0000 (PET) which leverages binary logarithmic search and common binary prefixes embedding to provide an efficient and scalable tree structure. PET reduces the item insert/query time from\u0000 O\u0000 (|\u0000 E\u0000 |) to\u0000 O\u0000 (\u0000 log\u0000 |\u0000 E\u0000 |) as well as reducing the total storage cost by a\u0000 log\u0000 |\u0000 E\u0000 | scale, where |\u0000 E\u0000 | is the size of the edge set in a graph stream. To further improve the memory utilization of PET during scaling, we propose a proportional PET structure that extends a higher level in a proportionally incremental style. We conduct comprehensive experiments on large-scale real-world datasets to evaluate the performance of this design. Results show that Auxo significantly reduces the insert and query time by one to two orders of magnitude compared to the state of the arts. Meanwhile, Auxo achieves efficiently and economically structure scaling with an average memory utilization of over 80%.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"68 1","pages":"1386-1398"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78220165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hybrid transactional/analytical processing (HTAP) would overload database systems. To alleviate performance interference between transactions and analytics, recent research pursues the potential of in-storage processing (ISP) using commodity computational storage devices (CSDs). However, in-storage query processing faces technical challenges in HTAP environments. Continuously updated data versions pose two hurdles: (1) data items keep changing, and (2) finding visible data versions incurs excessive data access in CSDs. Such access patterns dominate the cost of query processing, which may hinder the active deployment of CSDs. This paper addresses the core issues by proposing an a nalyt i c offloa d e ngine (AIDE) that transforms engine-specific query execution logic into vendor-neutral computation through a canonical interface. At the core of AIDE are the canonical representation of vendor-specific data and the separate management of data locators. It enables any CSD to execute vendor-neutral operations on canonical tuples with separate indexes, regardless of host databases. To eliminate excessive data access, we prescreen the indexes before offloading; thus, host-side prescreening can obviate the need for running costly version searching in CSDs and boost analytics. We implemented our prototype for PostgreSQL and MyRocks, demonstrating that AIDE supports efficient ISP for two databases using the same FPGA logic. Evaluation results show that AIDE improves query latency up to 42× on PostgreSQL and 34× on MyRocks.
{"title":"Deploying Computational Storage for HTAP DBMSs Takes More Than Just Computation Offloading","authors":"Kitaek Lee, Insoon Jo, Jaechan Ahn, Hyuk Lee, Hwang Lee, Woong Sul, Hyungsoo Jung","doi":"10.14778/3583140.3583161","DOIUrl":"https://doi.org/10.14778/3583140.3583161","url":null,"abstract":"Hybrid transactional/analytical processing (HTAP) would overload database systems. To alleviate performance interference between transactions and analytics, recent research pursues the potential of in-storage processing (ISP) using commodity computational storage devices (CSDs). However, in-storage query processing faces technical challenges in HTAP environments. Continuously updated data versions pose two hurdles: (1) data items keep changing, and (2) finding visible data versions incurs excessive data access in CSDs. Such access patterns dominate the cost of query processing, which may hinder the active deployment of CSDs.\u0000 \u0000 This paper addresses the core issues by proposing an\u0000 \u0000 a\u0000 nalyt\u0000 i\u0000 c offloa\u0000 d e\u0000 ngine\u0000 \u0000 (AIDE) that transforms engine-specific query execution logic into vendor-neutral computation through a canonical interface. At the core of AIDE are the\u0000 canonical representation\u0000 of vendor-specific data and the separate management of data locators. It enables any CSD to execute vendor-neutral operations on canonical tuples with separate indexes, regardless of host databases. To eliminate excessive data access, we\u0000 prescreen\u0000 the indexes before offloading; thus, host-side prescreening can obviate the need for running costly version searching in CSDs and boost analytics. We implemented our prototype for PostgreSQL and MyRocks, demonstrating that AIDE supports efficient ISP for two databases using the same FPGA logic. Evaluation results show that AIDE improves query latency up to 42× on PostgreSQL and 34× on MyRocks.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"71 1","pages":"1480-1493"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82287029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-01DOI: 10.14778/3583140.3583158
Xiang Li, Fabing Li, Mingyu Gao
As big data processing in the cloud becomes prevalent today, data privacy on such public platforms raises critical concerns. Hardware-based trusted execution environments (TEEs) provide promising and practical platforms for low-cost privacy-preserving data processing. However, using TEEs to enhance the security of data analytics frameworks like Apache Spark involves challenging issues when separating various framework components into trusted and untrusted domains, demanding meticulous considerations for programmability, performance, and security. Based on Intel SGX, we build Flare, a fast, secure, and memory-efficient data analytics framework with a familiar user programming interface and useful functionalities similar to Apache Spark. Flare ensures confidentiality and integrity by keeping sensitive data and computations encrypted and authenticated. It also supports oblivious processing to protect against access pattern side channels. The main innovations of Flare include a novel abstraction paradigm of shadow operators and shadow tasks to minimize trusted components and reduce domain switch overheads, memory-efficient data processing with proper granularities for different operators, and adaptive parallelization based on memory allocation intensity for better scalability. Flare outperforms the state-of-the-art secure framework by 3.0× to 176.1×, and is also 2.8× to 28.3× faster than a monolithic libOS-based integration approach.
{"title":"FLARE: A Fast, Secure, and Memory-Efficient Distributed Analytics Framework (Flavor: Systems)","authors":"Xiang Li, Fabing Li, Mingyu Gao","doi":"10.14778/3583140.3583158","DOIUrl":"https://doi.org/10.14778/3583140.3583158","url":null,"abstract":"As big data processing in the cloud becomes prevalent today, data privacy on such public platforms raises critical concerns. Hardware-based trusted execution environments (TEEs) provide promising and practical platforms for low-cost privacy-preserving data processing. However, using TEEs to enhance the security of data analytics frameworks like Apache Spark involves challenging issues when separating various framework components into trusted and untrusted domains, demanding meticulous considerations for programmability, performance, and security.\u0000 Based on Intel SGX, we build Flare, a fast, secure, and memory-efficient data analytics framework with a familiar user programming interface and useful functionalities similar to Apache Spark. Flare ensures confidentiality and integrity by keeping sensitive data and computations encrypted and authenticated. It also supports oblivious processing to protect against access pattern side channels. The main innovations of Flare include a novel abstraction paradigm of shadow operators and shadow tasks to minimize trusted components and reduce domain switch overheads, memory-efficient data processing with proper granularities for different operators, and adaptive parallelization based on memory allocation intensity for better scalability. Flare outperforms the state-of-the-art secure framework by 3.0× to 176.1×, and is also 2.8× to 28.3× faster than a monolithic libOS-based integration approach.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"65 1","pages":"1439-1452"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75833064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-01DOI: 10.14778/3583140.3583149
Yushi Sun, Hao Xin, Lei Chen
Understanding the semantics of tabular data is of great importance in various downstream applications, such as schema matching, data cleaning, and data integration. Column semantic type annotation is a critical task in the semantic understanding of tabular data. Despite the fact that various approaches have been proposed, they are challenged by the difficulties of handling wide tables and incorporating complex inter-table context information. Failure to handle wide tables limits the usage of column type annotation approaches, while failure to incorporate inter-table context harms the annotation quality. Existing methods either completely ignore these problems or propose ad-hoc solutions. In this paper, we propose Related tables Enhanced Column semantic type Annotation framework (RECA), which incorporates inter-table context information by finding and aligning schema-similar and topic-relevant tables based on a novel named entity schema. The design of RECA can naturally handle wide tables and incorporate useful inter-table context information to enhance the annotation quality. We conduct extensive experiments on two web table datasets to comprehensively evaluate the performance of RECA. Our results show that RECA achieves support-weighted F1 scores of 0.853 and 0.937 with macro average F1 scores of 0.674 and 0.783 on the two datasets respectively, which outperform the state-of-the-art methods.
{"title":"RECA: Related Tables Enhanced Column Semantic Type Annotation Framework","authors":"Yushi Sun, Hao Xin, Lei Chen","doi":"10.14778/3583140.3583149","DOIUrl":"https://doi.org/10.14778/3583140.3583149","url":null,"abstract":"Understanding the semantics of tabular data is of great importance in various downstream applications, such as schema matching, data cleaning, and data integration. Column semantic type annotation is a critical task in the semantic understanding of tabular data. Despite the fact that various approaches have been proposed, they are challenged by the difficulties of handling wide tables and incorporating complex inter-table context information. Failure to handle wide tables limits the usage of column type annotation approaches, while failure to incorporate inter-table context harms the annotation quality. Existing methods either completely ignore these problems or propose ad-hoc solutions. In this paper, we propose Related tables Enhanced Column semantic type Annotation framework (RECA), which incorporates inter-table context information by finding and aligning schema-similar and topic-relevant tables based on a novel named entity schema. The design of RECA can naturally handle wide tables and incorporate useful inter-table context information to enhance the annotation quality. We conduct extensive experiments on two web table datasets to comprehensively evaluate the performance of RECA. Our results show that RECA achieves support-weighted F1 scores of 0.853 and 0.937 with macro average F1 scores of 0.674 and 0.783 on the two datasets respectively, which outperform the state-of-the-art methods.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"54 1","pages":"1319-1331"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77607388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-01DOI: 10.14778/3583140.3583147
Fuheng Zhao, Punnal Ismail Khan, D. Agrawal, A. E. Abbadi, Arpit Gupta, Zaoxing Liu
System operators are often interested in extracting different feature streams from multi-dimensional data streams; and reporting their distributions at regular intervals, including the heavy hitters that contribute to the tail portion of the feature distribution. Satisfying these requirements to increase data rates with limited resources is challenging. This paper presents the design and implementation of Panakos that makes the best use of available resources to report a given feature's distribution accurately, its tail contributors, and other stream statistics (e.g., cardinality, entropy, etc.). Our key idea is to leverage the skewness inherent to most feature streams in the real world. We leverage this skewness by disentangling the feature stream into hot, warm, and cold items based on their feature values. We then use different data structures for tracking objects in each category. Panakos provides solid theoretical guarantees and achieves high performance for various tasks. We have implemented Panakos on both software and hardware and compared Panakos to other state-of-the-art sketches using synthetic and real-world datasets. The experimental results demonstrate that Panakos often achieves one order of magnitude better accuracy than the state-of-the-art solutions for a given memory budget.
{"title":"Panakos: Chasing the Tails for Multidimensional Data Streams","authors":"Fuheng Zhao, Punnal Ismail Khan, D. Agrawal, A. E. Abbadi, Arpit Gupta, Zaoxing Liu","doi":"10.14778/3583140.3583147","DOIUrl":"https://doi.org/10.14778/3583140.3583147","url":null,"abstract":"System operators are often interested in extracting different feature streams from multi-dimensional data streams; and reporting their distributions at regular intervals, including the heavy hitters that contribute to the tail portion of the feature distribution. Satisfying these requirements to increase data rates with limited resources is challenging. This paper presents the design and implementation of Panakos that makes the best use of available resources to report a given feature's distribution accurately, its tail contributors, and other stream statistics (e.g., cardinality, entropy, etc.). Our key idea is to leverage the skewness inherent to most feature streams in the real world. We leverage this skewness by disentangling the feature stream into hot, warm, and cold items based on their feature values. We then use different data structures for tracking objects in each category. Panakos provides solid theoretical guarantees and achieves high performance for various tasks. We have implemented Panakos on both software and hardware and compared Panakos to other state-of-the-art sketches using synthetic and real-world datasets. The experimental results demonstrate that Panakos often achieves one order of magnitude better accuracy than the state-of-the-art solutions for a given memory budget.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"51 1","pages":"1291-1304"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86150305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-01DOI: 10.14778/3583140.3583156
Alexander van Renen, Viktor Leis
The cloud facilitates the transition to a service-oriented perspective. This affects cloud-native data management in general, and data analytics in particular. Instead of managing a multi-node database cluster on-premise, end users simply send queries to a managed cloud data warehouse and receive results. While this is obviously very attractive for end users, database system architects still have to engineer systems for this new service model. There are currently many competing architectures ranging from self-hosted (Presto, PostgreSQL), over managed (Snowflake, Amazon Redshift) to query-as-a-service (Amazon Athena, Google BigQuery) offerings. Benchmarking these architectural approaches is currently difficult, and it is not even clear what the metrics for a comparison should be. To overcome these challenges, we first analyze a real-world query trace from Snowflake and compare its properties to that of TPC-H and TPC-DS. Doing so, we identify important differences that distinguish traditional benchmarks from real-world cloud data warehouse workloads. Based on this analysis, we propose the Cloud Analytics Benchmark (CAB). By incorporating workload fluctuations and multi-tenancy, CAB allows evaluating different designs in terms of user-centered metrics such as cost and performance.
云促进了向面向服务的透视图的转换。这通常会影响云原生数据管理,尤其是数据分析。终端用户无需在本地管理多节点数据库集群,只需将查询发送到托管的云数据仓库并接收结果即可。虽然这显然对最终用户非常有吸引力,但数据库系统架构师仍然必须为这种新的服务模型设计系统。目前有许多相互竞争的架构,从自托管(Presto, PostgreSQL),过度管理(Snowflake, Amazon Redshift)到查询即服务(Amazon Athena, Google BigQuery)产品。对这些体系结构方法进行基准测试目前是困难的,甚至不清楚比较的指标应该是什么。为了克服这些挑战,我们首先分析了来自Snowflake的真实查询跟踪,并将其属性与TPC-H和TPC-DS进行了比较。这样,我们就可以识别传统基准测试与实际云数据仓库工作负载之间的重要区别。在此基础上,我们提出了云分析基准(CAB)。通过结合工作负载波动和多租户,CAB允许根据以用户为中心的指标(如成本和性能)评估不同的设计。
{"title":"Cloud Analytics Benchmark","authors":"Alexander van Renen, Viktor Leis","doi":"10.14778/3583140.3583156","DOIUrl":"https://doi.org/10.14778/3583140.3583156","url":null,"abstract":"The cloud facilitates the transition to a service-oriented perspective. This affects cloud-native data management in general, and data analytics in particular. Instead of managing a multi-node database cluster on-premise, end users simply send queries to a managed cloud data warehouse and receive results. While this is obviously very attractive for end users, database system architects still have to engineer systems for this new service model. There are currently many competing architectures ranging from self-hosted (Presto, PostgreSQL), over managed (Snowflake, Amazon Redshift) to query-as-a-service (Amazon Athena, Google BigQuery) offerings. Benchmarking these architectural approaches is currently difficult, and it is not even clear what the metrics for a comparison should be.\u0000 To overcome these challenges, we first analyze a real-world query trace from Snowflake and compare its properties to that of TPC-H and TPC-DS. Doing so, we identify important differences that distinguish traditional benchmarks from real-world cloud data warehouse workloads. Based on this analysis, we propose the Cloud Analytics Benchmark (CAB). By incorporating workload fluctuations and multi-tenancy, CAB allows evaluating different designs in terms of user-centered metrics such as cost and performance.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"14 1","pages":"1413-1425"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85801530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-01DOI: 10.14778/3583140.3583165
Han Fu, Chang Liu, Bin Wu, Feifei Li, Jian Tan, Jianling Sun
Natural language to SQL (NL2SQL) techniques provide a convenient interface to access databases, especially for non-expert users, to conduct various data analytics. Existing methods often employ either a rule-base approach or a deep learning based solution. The former is hard to generalize across different domains. Though the latter generalizes well, it often results in queries with syntactic or semantic errors, thus may be even not executable. In this work, we bridge the gap between the two and design a new framework to significantly improve both accuracy and runtime. In particular, we develop a novel CatSQL sketch, which constructs a template with slots that initially serve as placeholders, and tightly integrates with a deep learning model to fill in these slots with meaningful contents based on the database schema. Compared with the widely used sequence-to-sequence-based approaches, our sketch-based method does not need to generate keywords which are boilerplates in the template, and can achieve better accuracy and run much faster. Compared with the existing sketch-based approaches, our CatSQL sketch is more general and versatile, and can leverage the values already filled in on certain slots to derive the rest ones for improved performance. In addition, we propose the Semantics Correction technique, which is the first that leverages database domain knowledge in a deep learning based NL2SQL solution. Semantics Correction is a post-processing routine, which checks the initially generated SQL queries by applying rules to identify and correct semantic errors. This technique significantly improves the NL2SQL accuracy. We conduct extensive evaluations on both single-domain and cross-domain benchmarks and demonstrate that our approach significantly outperforms the previous ones in terms of both accuracy and throughput. In particular, on the state-of-the-art NL2SQL benchmark Spider, our CatSQL prototype outperforms the best of the previous solutions by 4 points on accuracy, while still achieving a throughput up to 63 times higher.
{"title":"CatSQL: Towards Real World Natural Language to SQL Applications","authors":"Han Fu, Chang Liu, Bin Wu, Feifei Li, Jian Tan, Jianling Sun","doi":"10.14778/3583140.3583165","DOIUrl":"https://doi.org/10.14778/3583140.3583165","url":null,"abstract":"\u0000 Natural language to SQL (NL2SQL) techniques provide a convenient interface to access databases, especially for non-expert users, to conduct various data analytics. Existing methods often employ either a rule-base approach or a deep learning based solution. The former is hard to generalize across different domains. Though the latter generalizes well, it often results in queries with syntactic or semantic errors, thus may be even not executable. In this work, we bridge the gap between the two and design a new framework to significantly improve both accuracy and runtime. In particular, we develop a novel\u0000 CatSQL\u0000 sketch, which constructs a template with slots that initially serve as placeholders, and tightly integrates with a deep learning model to fill in these slots with meaningful contents based on the database schema. Compared with the widely used sequence-to-sequence-based approaches, our sketch-based method does not need to generate keywords which are boilerplates in the template, and can achieve better accuracy and run much faster. Compared with the existing sketch-based approaches, our\u0000 CatSQL\u0000 sketch is more general and versatile, and can leverage the values already filled in on certain slots to derive the rest ones for improved performance. In addition, we propose the\u0000 Semantics Correction\u0000 technique, which is the first that leverages database domain knowledge in a deep learning based NL2SQL solution.\u0000 Semantics Correction\u0000 is a post-processing routine, which checks the initially generated SQL queries by applying rules to identify and correct semantic errors. This technique significantly improves the NL2SQL accuracy. We conduct extensive evaluations on both single-domain and cross-domain benchmarks and demonstrate that our approach significantly outperforms the previous ones in terms of both accuracy and throughput. In particular, on the state-of-the-art NL2SQL benchmark Spider, our\u0000 CatSQL\u0000 prototype outperforms the best of the previous solutions by 4 points on accuracy, while still achieving a throughput up to 63 times higher.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"82 1","pages":"1534-1547"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83479088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-01DOI: 10.14778/3583140.3583159
Mijin An, Jonghyeok Park, Tianzheng Wang, Beomseok Nam, Sang-Won Lee
When running OLTP workloads, relational DBMSs with flash SSDs still suffer from the durability overhead. Heavy writes to SSD not only limit the performance but also shorten the storage lifespan. To mitigate the durability overhead, this paper proposes a new database architecture, NV-SQL. NV-SQL aims at absorbing a large fraction of writes written from DRAM to SSD by introducing NVDIMM into the memory hierarchy as a durable write cache. On the new architecture, NV-SQL makes two technical contributions. First, it proposes the re-update interval-based admission policy that determines which write-hot pages qualify for being cached in NVDIMM. It is novel in that the page hotness is based solely on pages' LSN. Second, this study finds that NVDIMM-resident pages can violate the page action consistency upon crash and proposes how to detect inconsistent pages using per-page in-update flag and how to rectify them using the redo log. NV-SQL demonstrates how the ARIES-like logging and recovery techniques can be elegantly extended to support the caching and recovery for NVDIMM data. Additionally, by placing write-intensive redo buffer and DWB in NVDIMM, NV-SQL eliminates the log-force-at-commit and WAL protocols and further halves the writes to the storage. Our NV-SQL prototype running with a real NVDIMM device outperforms the same-priced vanilla MySQL with larger DRAM by several folds in terms of transaction throughput for write-intensive OLTP benchmarks. This confirms that NV-SQL is a cost-performance efficient solution to the durability problem.
{"title":"NV-SQL: Boosting OLTP Performance with Non-Volatile DIMMs","authors":"Mijin An, Jonghyeok Park, Tianzheng Wang, Beomseok Nam, Sang-Won Lee","doi":"10.14778/3583140.3583159","DOIUrl":"https://doi.org/10.14778/3583140.3583159","url":null,"abstract":"\u0000 When running OLTP workloads, relational DBMSs with flash SSDs still suffer from the\u0000 durability overhead.\u0000 Heavy writes to SSD not only limit the performance but also shorten the storage lifespan. To mitigate the durability overhead, this paper proposes a new database architecture, NV-SQL. NV-SQL aims at absorbing a large fraction of writes written from DRAM to SSD by introducing NVDIMM into the memory hierarchy as a durable write cache. On the new architecture, NV-SQL makes two technical contributions. First, it proposes the\u0000 re-update interval-based admission policy\u0000 that determines which write-hot pages qualify for being cached in NVDIMM. It is novel in that the page hotness is based solely on pages' LSN. Second, this study finds that NVDIMM-resident pages can violate the\u0000 page action consistency\u0000 upon crash and proposes how to detect inconsistent pages using per-page in-update flag and how to rectify them using the redo log. NV-SQL demonstrates how the ARIES-like logging and recovery techniques can be elegantly extended to support the caching and recovery for NVDIMM data. Additionally, by placing write-intensive redo buffer and DWB in NVDIMM, NV-SQL eliminates the\u0000 log-force-at-commit\u0000 and\u0000 WAL\u0000 protocols and further halves the writes to the storage. Our NV-SQL prototype running with a real NVDIMM device outperforms the\u0000 same-priced\u0000 vanilla MySQL with larger DRAM by several folds in terms of transaction throughput for write-intensive OLTP benchmarks. This confirms that NV-SQL is a cost-performance efficient solution to the durability problem.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"37 1","pages":"1453-1465"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79385368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-01DOI: 10.14778/3583140.3583157
Adnan Alhomssi, Viktor Leis
MVCC-based snapshot isolation promises that read queries can proceed without interfering with concurrent writes. However, as we show experimentally, in existing implementations a single long-running query can easily cause transactional throughput to collapse. Moreover, existing out-of-memory commit protocols fail to meet the scalability needs of modern multi-core systems. In this paper, we present three complementary techniques for robust and scalable snapshot isolation in out-of-memory systems. First, we propose a commit protocol that minimizes cross-thread communication for better scalability, avoids touching the write set on commit, and enables efficient fine-granular garbage collection. Second, we introduce the Graveyard Index, an auxiliary data structure that moves logically-deleted tuples out of the way of operational transactions. Third, we present an adaptive version storage scheme that enables fast garbage collection and improves scan performance of frequently-modified tuples. All techniques are engineered to scale well on multi-core processors, and together enable robust performance for complex hybrid workloads.
{"title":"Scalable and Robust Snapshot Isolation for High-Performance Storage Engines","authors":"Adnan Alhomssi, Viktor Leis","doi":"10.14778/3583140.3583157","DOIUrl":"https://doi.org/10.14778/3583140.3583157","url":null,"abstract":"MVCC-based snapshot isolation promises that read queries can proceed without interfering with concurrent writes. However, as we show experimentally, in existing implementations a single long-running query can easily cause transactional throughput to collapse. Moreover, existing out-of-memory commit protocols fail to meet the scalability needs of modern multi-core systems. In this paper, we present three complementary techniques for robust and scalable snapshot isolation in out-of-memory systems. First, we propose a commit protocol that minimizes cross-thread communication for better scalability, avoids touching the write set on commit, and enables efficient fine-granular garbage collection. Second, we introduce the Graveyard Index, an auxiliary data structure that moves logically-deleted tuples out of the way of operational transactions. Third, we present an adaptive version storage scheme that enables fast garbage collection and improves scan performance of frequently-modified tuples. All techniques are engineered to scale well on multi-core processors, and together enable robust performance for complex hybrid workloads.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"69 1","pages":"1426-1438"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79935453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-01DOI: 10.14778/3583140.3583150
Yiming Li, Yanyan Shen, Lei Chen, Mingxuan Yuan
Temporal graph neural networks (T-GNNs) are state-of-the-art methods for learning representations over dynamic graphs. Despite the superior performance, T-GNNs still suffer from high computational complexity caused by the tedious recursive temporal message passing scheme, which hinders their applicability to large dynamic graphs. To address the problem, we build the theoretical connection between the temporal message passing scheme adopted by T-GNNs and the temporal random walk process on dynamic graphs. Our theoretical analysis indicates that it would be possible to select a few influential temporal neighbors to compute a target node's representation without compromising the predictive performance. Based on this finding, we propose to utilize T-PPR, a parameterized metric for estimating the influence score of nodes on evolving graphs. We further develop an efficient single-scan algorithm to answer the top- k T-PPR query with rigorous approximation guarantees. Finally, we present Zebra, a scalable framework that accelerates the computation of T-GNN by directly aggregating the features of the most prominent temporal neighbors returned by the top- k T-PPR query. Extensive experiments have validated that Zebra can be up to two orders of magnitude faster than the state-of-the-art T-GNNs while attaining better performance.
时态图神经网络(t - gnn)是学习动态图表示的最新方法。尽管具有优异的性能,但t - gnn仍然存在繁琐的递归时间消息传递方案导致的高计算复杂度,这阻碍了其在大型动态图中的应用。为了解决这个问题,我们建立了t - gnn采用的时间消息传递方案与动态图上的时间随机漫步过程之间的理论联系。我们的理论分析表明,在不影响预测性能的情况下,选择几个有影响的时间邻居来计算目标节点的表示是可能的。基于这一发现,我们提出利用参数化度量T-PPR来估计节点对进化图的影响评分。我们进一步开发了一种高效的单扫描算法来回答具有严格近似保证的top- k T-PPR查询。最后,我们提出了Zebra,这是一个可扩展的框架,通过直接聚合top- k T-PPR查询返回的最突出的时间邻居的特征来加速T-GNN的计算。大量的实验已经证实,Zebra可以比最先进的t - gnn快两个数量级,同时获得更好的性能。
{"title":"Zebra: When Temporal Graph Neural Networks Meet Temporal Personalized PageRank","authors":"Yiming Li, Yanyan Shen, Lei Chen, Mingxuan Yuan","doi":"10.14778/3583140.3583150","DOIUrl":"https://doi.org/10.14778/3583140.3583150","url":null,"abstract":"\u0000 Temporal graph neural networks (T-GNNs) are state-of-the-art methods for learning representations over dynamic graphs. Despite the superior performance, T-GNNs still suffer from high computational complexity caused by the tedious recursive temporal message passing scheme, which hinders their applicability to large dynamic graphs. To address the problem, we build the theoretical connection between the temporal message passing scheme adopted by T-GNNs and the temporal random walk process on dynamic graphs. Our theoretical analysis indicates that it would be possible to select a few influential temporal neighbors to compute a target node's representation without compromising the predictive performance. Based on this finding, we propose to utilize T-PPR, a parameterized metric for estimating the influence score of nodes on evolving graphs. We further develop an efficient single-scan algorithm to answer the top-\u0000 k\u0000 T-PPR query with rigorous approximation guarantees. Finally, we present Zebra, a scalable framework that accelerates the computation of T-GNN by directly aggregating the features of the most prominent temporal neighbors returned by the top-\u0000 k\u0000 T-PPR query. Extensive experiments have validated that Zebra can be up to two orders of magnitude faster than the state-of-the-art T-GNNs while attaining better performance.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"15 1","pages":"1332-1345"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79501587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}