Proc. VLDB Endow.最新文献_第9页

Auxo: A Scalable and Efficient Graph Stream Summarization Structure Auxo:一个可伸缩和高效的图形流摘要结构

Proc. VLDB Endow.

Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583154

Zhiguo Jiang, Hanhua Chen, Hai Jin

A graph stream refers to a continuous stream of edges, forming a huge and fast-evolving graph. The vast volume and high update speed of a graph stream bring stringent requirements for the data management structure, including sublinear space cost, computation-efficient operation support, and scalability of the structure. Existing designs summarize a graph stream by leveraging a hash-based compressed matrix and representing an edge using its fingerprint to achieve practical storage for a graph stream with a known upper bound of data volume. However, they fail to support the dynamically extending of graph streams. In this paper, we propose Auxo, a scalable structure to support space/time efficient summarization of dynamic graph streams. Auxo is built on a proposed novel prefix embedded tree (PET) which leverages binary logarithmic search and common binary prefixes embedding to provide an efficient and scalable tree structure. PET reduces the item insert/query time from O (| E |) to O ( log | E |) as well as reducing the total storage cost by a log | E | scale, where | E | is the size of the edge set in a graph stream. To further improve the memory utilization of PET during scaling, we propose a proportional PET structure that extends a higher level in a proportionally incremental style. We conduct comprehensive experiments on large-scale real-world datasets to evaluate the performance of this design. Results show that Auxo significantly reduces the insert and query time by one to two orders of magnitude compared to the state of the arts. Meanwhile, Auxo achieves efficiently and economically structure scaling with an average memory utilization of over 80%.

图流是指连续的边流，形成一个巨大的、快速发展的图。图流的庞大容量和高更新速度对数据管理结构提出了严格的要求，包括亚线性空间成本、计算高效的操作支持和结构的可扩展性。现有的设计通过利用基于哈希的压缩矩阵和使用其指纹表示边缘来总结图形流，从而实现具有已知数据量上限的图形流的实际存储。然而，它们不支持图形流的动态扩展。在本文中，我们提出了Auxo，一个可扩展的结构，以支持空间/时间高效的动态图流摘要。Auxo是建立在一种新的前缀嵌入树(PET)之上的，它利用二进制对数搜索和通用二进制前缀嵌入来提供一种高效且可扩展的树结构。PET将项插入/查询时间从O (| E |)减少到O (log | E |)，并将总存储成本降低了log | E |比例，其中| E |是图流中边集的大小。为了进一步提高PET在缩放过程中的内存利用率，我们提出了一种比例PET结构，该结构以比例增量的方式扩展到更高的级别。我们在大规模的真实数据集上进行了全面的实验来评估该设计的性能。结果表明，与现有技术相比，Auxo显著地将插入和查询时间减少了一到两个数量级。同时，Auxo实现了高效和经济的结构扩展，平均内存利用率超过80%。

{"title":"Auxo: A Scalable and Efficient Graph Stream Summarization Structure","authors":"Zhiguo Jiang, Hanhua Chen, Hai Jin","doi":"10.14778/3583140.3583154","DOIUrl":"https://doi.org/10.14778/3583140.3583154","url":null,"abstract":"A graph stream refers to a continuous stream of edges, forming a huge and fast-evolving graph. The vast volume and high update speed of a graph stream bring stringent requirements for the data management structure, including sublinear space cost, computation-efficient operation support, and scalability of the structure. Existing designs summarize a graph stream by leveraging a hash-based compressed matrix and representing an edge using its fingerprint to achieve practical storage for a graph stream with a known upper bound of data volume. However, they fail to support the dynamically extending of graph streams.\u0000 \u0000 In this paper, we propose Auxo, a scalable structure to support space/time efficient summarization of dynamic graph streams. Auxo is built on a proposed novel\u0000 prefix embedded tree\u0000 (PET) which leverages binary logarithmic search and common binary prefixes embedding to provide an efficient and scalable tree structure. PET reduces the item insert/query time from\u0000 O\u0000 (|\u0000 E\u0000 |) to\u0000 O\u0000 (\u0000 log\u0000 |\u0000 E\u0000 |) as well as reducing the total storage cost by a\u0000 log\u0000 |\u0000 E\u0000 | scale, where |\u0000 E\u0000 | is the size of the edge set in a graph stream. To further improve the memory utilization of PET during scaling, we propose a proportional PET structure that extends a higher level in a proportionally incremental style. We conduct comprehensive experiments on large-scale real-world datasets to evaluate the performance of this design. Results show that Auxo significantly reduces the insert and query time by one to two orders of magnitude compared to the state of the arts. Meanwhile, Auxo achieves efficiently and economically structure scaling with an average memory utilization of over 80%.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"68 1","pages":"1386-1398"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78220165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deploying Computational Storage for HTAP DBMSs Takes More Than Just Computation Offloading 为HTAP dbms部署计算存储需要的不仅仅是计算卸载

Proc. VLDB Endow.

Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583161

Kitaek Lee, Insoon Jo, Jaechan Ahn, Hyuk Lee, Hwang Lee, Woong Sul, Hyungsoo Jung

Hybrid transactional/analytical processing (HTAP) would overload database systems. To alleviate performance interference between transactions and analytics, recent research pursues the potential of in-storage processing (ISP) using commodity computational storage devices (CSDs). However, in-storage query processing faces technical challenges in HTAP environments. Continuously updated data versions pose two hurdles: (1) data items keep changing, and (2) finding visible data versions incurs excessive data access in CSDs. Such access patterns dominate the cost of query processing, which may hinder the active deployment of CSDs. This paper addresses the core issues by proposing an a nalyt i c offloa d e ngine (AIDE) that transforms engine-specific query execution logic into vendor-neutral computation through a canonical interface. At the core of AIDE are the canonical representation of vendor-specific data and the separate management of data locators. It enables any CSD to execute vendor-neutral operations on canonical tuples with separate indexes, regardless of host databases. To eliminate excessive data access, we prescreen the indexes before offloading; thus, host-side prescreening can obviate the need for running costly version searching in CSDs and boost analytics. We implemented our prototype for PostgreSQL and MyRocks, demonstrating that AIDE supports efficient ISP for two databases using the same FPGA logic. Evaluation results show that AIDE improves query latency up to 42× on PostgreSQL and 34× on MyRocks.

混合事务/分析处理(HTAP)将使数据库系统过载。为了减轻事务和分析之间的性能干扰，最近的研究追求使用商品计算存储设备(csd)的存储内处理(ISP)的潜力。然而，在HTAP环境中，存储查询处理面临着技术挑战。不断更新的数据版本带来了两个障碍:(1)数据项不断变化，(2)查找可见的数据版本会导致csd中的数据访问过多。此类访问模式支配着查询处理的成本，这可能会阻碍主动部署csd。本文通过提出一种分析方法来解决核心问题，该方法通过规范接口将特定于引擎的查询执行逻辑转换为与供应商无关的计算。AIDE的核心是特定于供应商的数据的规范化表示和数据定位器的独立管理。它使任何CSD都能够在具有单独索引的规范元组上执行与供应商无关的操作，而不受主机数据库的影响。为了消除过多的数据访问，我们在卸载之前预先筛选索引;因此，主机端预筛选可以避免在csd和boost分析中运行成本高昂的版本搜索。我们实现了PostgreSQL和MyRocks的原型，证明AIDE支持使用相同FPGA逻辑的两个数据库的高效ISP。评估结果表明，AIDE在PostgreSQL上将查询延迟提高了42倍，在MyRocks上提高了34倍。

{"title":"Deploying Computational Storage for HTAP DBMSs Takes More Than Just Computation Offloading","authors":"Kitaek Lee, Insoon Jo, Jaechan Ahn, Hyuk Lee, Hwang Lee, Woong Sul, Hyungsoo Jung","doi":"10.14778/3583140.3583161","DOIUrl":"https://doi.org/10.14778/3583140.3583161","url":null,"abstract":"Hybrid transactional/analytical processing (HTAP) would overload database systems. To alleviate performance interference between transactions and analytics, recent research pursues the potential of in-storage processing (ISP) using commodity computational storage devices (CSDs). However, in-storage query processing faces technical challenges in HTAP environments. Continuously updated data versions pose two hurdles: (1) data items keep changing, and (2) finding visible data versions incurs excessive data access in CSDs. Such access patterns dominate the cost of query processing, which may hinder the active deployment of CSDs.\u0000 \u0000 This paper addresses the core issues by proposing an\u0000 \u0000 a\u0000 nalyt\u0000 i\u0000 c offloa\u0000 d e\u0000 ngine\u0000 \u0000 (AIDE) that transforms engine-specific query execution logic into vendor-neutral computation through a canonical interface. At the core of AIDE are the\u0000 canonical representation\u0000 of vendor-specific data and the separate management of data locators. It enables any CSD to execute vendor-neutral operations on canonical tuples with separate indexes, regardless of host databases. To eliminate excessive data access, we\u0000 prescreen\u0000 the indexes before offloading; thus, host-side prescreening can obviate the need for running costly version searching in CSDs and boost analytics. We implemented our prototype for PostgreSQL and MyRocks, demonstrating that AIDE supports efficient ISP for two databases using the same FPGA logic. Evaluation results show that AIDE improves query latency up to 42× on PostgreSQL and 34× on MyRocks.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"71 1","pages":"1480-1493"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82287029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FLARE: A Fast, Secure, and Memory-Efficient Distributed Analytics Framework (Flavor: Systems) FLARE:一个快速、安全、内存高效的分布式分析框架(风格:系统)

Proc. VLDB Endow.

Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583158

Xiang Li, Fabing Li, Mingyu Gao

As big data processing in the cloud becomes prevalent today, data privacy on such public platforms raises critical concerns. Hardware-based trusted execution environments (TEEs) provide promising and practical platforms for low-cost privacy-preserving data processing. However, using TEEs to enhance the security of data analytics frameworks like Apache Spark involves challenging issues when separating various framework components into trusted and untrusted domains, demanding meticulous considerations for programmability, performance, and security. Based on Intel SGX, we build Flare, a fast, secure, and memory-efficient data analytics framework with a familiar user programming interface and useful functionalities similar to Apache Spark. Flare ensures confidentiality and integrity by keeping sensitive data and computations encrypted and authenticated. It also supports oblivious processing to protect against access pattern side channels. The main innovations of Flare include a novel abstraction paradigm of shadow operators and shadow tasks to minimize trusted components and reduce domain switch overheads, memory-efficient data processing with proper granularities for different operators, and adaptive parallelization based on memory allocation intensity for better scalability. Flare outperforms the state-of-the-art secure framework by 3.0× to 176.1×, and is also 2.8× to 28.3× faster than a monolithic libOS-based integration approach.

随着云中的大数据处理在今天变得普遍，这些公共平台上的数据隐私引发了严重的担忧。基于硬件的可信执行环境(tee)为低成本保护隐私的数据处理提供了有前途的实用平台。然而，使用tee来增强数据分析框架(如Apache Spark)的安全性涉及到将各种框架组件划分为可信和不可信域时的挑战性问题，需要对可编程性、性能和安全性进行细致的考虑。基于英特尔SGX，我们构建了Flare，这是一个快速、安全、内存高效的数据分析框架，具有熟悉的用户编程界面和类似于Apache Spark的有用功能。Flare通过对敏感数据和计算进行加密和认证来确保机密性和完整性。它还支持无关处理，以防止访问模式侧通道。Flare的主要创新包括影子操作符和影子任务的新颖抽象范式，以最大限度地减少可信组件并减少域切换开销，对不同操作符进行适当粒度的内存高效数据处理，以及基于内存分配强度的自适应并行化，以获得更好的可扩展性。Flare的性能比最先进的安全框架高出3.0到176.1倍，并且比基于libos的单片集成方法快2.8到28.3倍。

{"title":"FLARE: A Fast, Secure, and Memory-Efficient Distributed Analytics Framework (Flavor: Systems)","authors":"Xiang Li, Fabing Li, Mingyu Gao","doi":"10.14778/3583140.3583158","DOIUrl":"https://doi.org/10.14778/3583140.3583158","url":null,"abstract":"As big data processing in the cloud becomes prevalent today, data privacy on such public platforms raises critical concerns. Hardware-based trusted execution environments (TEEs) provide promising and practical platforms for low-cost privacy-preserving data processing. However, using TEEs to enhance the security of data analytics frameworks like Apache Spark involves challenging issues when separating various framework components into trusted and untrusted domains, demanding meticulous considerations for programmability, performance, and security.\u0000 Based on Intel SGX, we build Flare, a fast, secure, and memory-efficient data analytics framework with a familiar user programming interface and useful functionalities similar to Apache Spark. Flare ensures confidentiality and integrity by keeping sensitive data and computations encrypted and authenticated. It also supports oblivious processing to protect against access pattern side channels. The main innovations of Flare include a novel abstraction paradigm of shadow operators and shadow tasks to minimize trusted components and reduce domain switch overheads, memory-efficient data processing with proper granularities for different operators, and adaptive parallelization based on memory allocation intensity for better scalability. Flare outperforms the state-of-the-art secure framework by 3.0× to 176.1×, and is also 2.8× to 28.3× faster than a monolithic libOS-based integration approach.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"65 1","pages":"1439-1452"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75833064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

RECA: Related Tables Enhanced Column Semantic Type Annotation Framework RECA:相关表增强的列语义类型注释框架

Proc. VLDB Endow.

Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583149

Yushi Sun, Hao Xin, Lei Chen

Understanding the semantics of tabular data is of great importance in various downstream applications, such as schema matching, data cleaning, and data integration. Column semantic type annotation is a critical task in the semantic understanding of tabular data. Despite the fact that various approaches have been proposed, they are challenged by the difficulties of handling wide tables and incorporating complex inter-table context information. Failure to handle wide tables limits the usage of column type annotation approaches, while failure to incorporate inter-table context harms the annotation quality. Existing methods either completely ignore these problems or propose ad-hoc solutions. In this paper, we propose Related tables Enhanced Column semantic type Annotation framework (RECA), which incorporates inter-table context information by finding and aligning schema-similar and topic-relevant tables based on a novel named entity schema. The design of RECA can naturally handle wide tables and incorporate useful inter-table context information to enhance the annotation quality. We conduct extensive experiments on two web table datasets to comprehensively evaluate the performance of RECA. Our results show that RECA achieves support-weighted F1 scores of 0.853 and 0.937 with macro average F1 scores of 0.674 and 0.783 on the two datasets respectively, which outperform the state-of-the-art methods.

理解表格数据的语义在各种下游应用程序中非常重要，例如模式匹配、数据清理和数据集成。列语义类型标注是表格数据语义理解中的一项关键任务。尽管已经提出了各种方法，但它们都受到处理宽表和合并复杂的表间上下文信息的困难的挑战。不能处理宽表会限制列类型注释方法的使用，而不能合并表间上下文会损害注释质量。现有的方法要么完全忽略这些问题，要么提出临时解决方案。本文提出了关联表增强型列语义类型注释框架(RECA)，该框架基于一种新的命名实体模式，通过查找和对齐模式相似的表和主题相关的表来整合表间上下文信息。RECA的设计可以自然地处理宽表，并结合有用的表间上下文信息来提高注释质量。我们在两个web表数据集上进行了大量的实验，以全面评估RECA的性能。结果表明，RECA在两个数据集上的支持加权F1得分分别为0.853和0.937，宏观平均F1得分分别为0.674和0.783，优于现有方法。

{"title":"RECA: Related Tables Enhanced Column Semantic Type Annotation Framework","authors":"Yushi Sun, Hao Xin, Lei Chen","doi":"10.14778/3583140.3583149","DOIUrl":"https://doi.org/10.14778/3583140.3583149","url":null,"abstract":"Understanding the semantics of tabular data is of great importance in various downstream applications, such as schema matching, data cleaning, and data integration. Column semantic type annotation is a critical task in the semantic understanding of tabular data. Despite the fact that various approaches have been proposed, they are challenged by the difficulties of handling wide tables and incorporating complex inter-table context information. Failure to handle wide tables limits the usage of column type annotation approaches, while failure to incorporate inter-table context harms the annotation quality. Existing methods either completely ignore these problems or propose ad-hoc solutions. In this paper, we propose Related tables Enhanced Column semantic type Annotation framework (RECA), which incorporates inter-table context information by finding and aligning schema-similar and topic-relevant tables based on a novel named entity schema. The design of RECA can naturally handle wide tables and incorporate useful inter-table context information to enhance the annotation quality. We conduct extensive experiments on two web table datasets to comprehensively evaluate the performance of RECA. Our results show that RECA achieves support-weighted F1 scores of 0.853 and 0.937 with macro average F1 scores of 0.674 and 0.783 on the two datasets respectively, which outperform the state-of-the-art methods.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"54 1","pages":"1319-1331"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77607388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Panakos: Chasing the Tails for Multidimensional Data Streams Panakos:追踪多维数据流的尾部

Proc. VLDB Endow.

Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583147

Fuheng Zhao, Punnal Ismail Khan, D. Agrawal, A. E. Abbadi, Arpit Gupta, Zaoxing Liu

System operators are often interested in extracting different feature streams from multi-dimensional data streams; and reporting their distributions at regular intervals, including the heavy hitters that contribute to the tail portion of the feature distribution. Satisfying these requirements to increase data rates with limited resources is challenging. This paper presents the design and implementation of Panakos that makes the best use of available resources to report a given feature's distribution accurately, its tail contributors, and other stream statistics (e.g., cardinality, entropy, etc.). Our key idea is to leverage the skewness inherent to most feature streams in the real world. We leverage this skewness by disentangling the feature stream into hot, warm, and cold items based on their feature values. We then use different data structures for tracking objects in each category. Panakos provides solid theoretical guarantees and achieves high performance for various tasks. We have implemented Panakos on both software and hardware and compared Panakos to other state-of-the-art sketches using synthetic and real-world datasets. The experimental results demonstrate that Panakos often achieves one order of magnitude better accuracy than the state-of-the-art solutions for a given memory budget.

系统操作员通常对从多维数据流中提取不同的特征流感兴趣;并定期报告它们的分布，包括对特征分布的尾部部分做出贡献的重要部分。在有限的资源下满足这些要求以提高数据速率是一项挑战。本文介绍了Panakos的设计和实现，它可以最好地利用可用资源来准确地报告给定特征的分布，它的尾部贡献者和其他流统计(例如，基数，熵等)。我们的主要想法是利用现实世界中大多数特征流固有的偏度。我们利用这种偏度，根据特征值将特征流分解为热的、温暖的和冷的项目。然后，我们使用不同的数据结构来跟踪每个类别中的对象。Panakos为各种任务提供了坚实的理论保证和高性能。我们已经在软件和硬件上实现了Panakos，并使用合成和现实世界的数据集将Panakos与其他最先进的草图进行了比较。实验结果表明，在给定的内存预算下，Panakos通常比最先进的解决方案达到一个数量级的准确性。

{"title":"Panakos: Chasing the Tails for Multidimensional Data Streams","authors":"Fuheng Zhao, Punnal Ismail Khan, D. Agrawal, A. E. Abbadi, Arpit Gupta, Zaoxing Liu","doi":"10.14778/3583140.3583147","DOIUrl":"https://doi.org/10.14778/3583140.3583147","url":null,"abstract":"System operators are often interested in extracting different feature streams from multi-dimensional data streams; and reporting their distributions at regular intervals, including the heavy hitters that contribute to the tail portion of the feature distribution. Satisfying these requirements to increase data rates with limited resources is challenging. This paper presents the design and implementation of Panakos that makes the best use of available resources to report a given feature's distribution accurately, its tail contributors, and other stream statistics (e.g., cardinality, entropy, etc.). Our key idea is to leverage the skewness inherent to most feature streams in the real world. We leverage this skewness by disentangling the feature stream into hot, warm, and cold items based on their feature values. We then use different data structures for tracking objects in each category. Panakos provides solid theoretical guarantees and achieves high performance for various tasks. We have implemented Panakos on both software and hardware and compared Panakos to other state-of-the-art sketches using synthetic and real-world datasets. The experimental results demonstrate that Panakos often achieves one order of magnitude better accuracy than the state-of-the-art solutions for a given memory budget.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"51 1","pages":"1291-1304"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86150305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Cloud Analytics Benchmark 云分析基准

Proc. VLDB Endow.

Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583156

Alexander van Renen, Viktor Leis

The cloud facilitates the transition to a service-oriented perspective. This affects cloud-native data management in general, and data analytics in particular. Instead of managing a multi-node database cluster on-premise, end users simply send queries to a managed cloud data warehouse and receive results. While this is obviously very attractive for end users, database system architects still have to engineer systems for this new service model. There are currently many competing architectures ranging from self-hosted (Presto, PostgreSQL), over managed (Snowflake, Amazon Redshift) to query-as-a-service (Amazon Athena, Google BigQuery) offerings. Benchmarking these architectural approaches is currently difficult, and it is not even clear what the metrics for a comparison should be. To overcome these challenges, we first analyze a real-world query trace from Snowflake and compare its properties to that of TPC-H and TPC-DS. Doing so, we identify important differences that distinguish traditional benchmarks from real-world cloud data warehouse workloads. Based on this analysis, we propose the Cloud Analytics Benchmark (CAB). By incorporating workload fluctuations and multi-tenancy, CAB allows evaluating different designs in terms of user-centered metrics such as cost and performance.

云促进了向面向服务的透视图的转换。这通常会影响云原生数据管理，尤其是数据分析。终端用户无需在本地管理多节点数据库集群，只需将查询发送到托管的云数据仓库并接收结果即可。虽然这显然对最终用户非常有吸引力，但数据库系统架构师仍然必须为这种新的服务模型设计系统。目前有许多相互竞争的架构，从自托管(Presto, PostgreSQL)，过度管理(Snowflake, Amazon Redshift)到查询即服务(Amazon Athena, Google BigQuery)产品。对这些体系结构方法进行基准测试目前是困难的，甚至不清楚比较的指标应该是什么。为了克服这些挑战，我们首先分析了来自Snowflake的真实查询跟踪，并将其属性与TPC-H和TPC-DS进行了比较。这样，我们就可以识别传统基准测试与实际云数据仓库工作负载之间的重要区别。在此基础上，我们提出了云分析基准(CAB)。通过结合工作负载波动和多租户，CAB允许根据以用户为中心的指标(如成本和性能)评估不同的设计。

{"title":"Cloud Analytics Benchmark","authors":"Alexander van Renen, Viktor Leis","doi":"10.14778/3583140.3583156","DOIUrl":"https://doi.org/10.14778/3583140.3583156","url":null,"abstract":"The cloud facilitates the transition to a service-oriented perspective. This affects cloud-native data management in general, and data analytics in particular. Instead of managing a multi-node database cluster on-premise, end users simply send queries to a managed cloud data warehouse and receive results. While this is obviously very attractive for end users, database system architects still have to engineer systems for this new service model. There are currently many competing architectures ranging from self-hosted (Presto, PostgreSQL), over managed (Snowflake, Amazon Redshift) to query-as-a-service (Amazon Athena, Google BigQuery) offerings. Benchmarking these architectural approaches is currently difficult, and it is not even clear what the metrics for a comparison should be.\u0000 To overcome these challenges, we first analyze a real-world query trace from Snowflake and compare its properties to that of TPC-H and TPC-DS. Doing so, we identify important differences that distinguish traditional benchmarks from real-world cloud data warehouse workloads. Based on this analysis, we propose the Cloud Analytics Benchmark (CAB). By incorporating workload fluctuations and multi-tenancy, CAB allows evaluating different designs in terms of user-centered metrics such as cost and performance.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"14 1","pages":"1413-1425"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85801530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

CatSQL: Towards Real World Natural Language to SQL Applications CatSQL:从真实世界的自然语言到SQL应用程序

Proc. VLDB Endow.

Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583165

Han Fu, Chang Liu, Bin Wu, Feifei Li, Jian Tan, Jianling Sun

Natural language to SQL (NL2SQL) techniques provide a convenient interface to access databases, especially for non-expert users, to conduct various data analytics. Existing methods often employ either a rule-base approach or a deep learning based solution. The former is hard to generalize across different domains. Though the latter generalizes well, it often results in queries with syntactic or semantic errors, thus may be even not executable. In this work, we bridge the gap between the two and design a new framework to significantly improve both accuracy and runtime. In particular, we develop a novel CatSQL sketch, which constructs a template with slots that initially serve as placeholders, and tightly integrates with a deep learning model to fill in these slots with meaningful contents based on the database schema. Compared with the widely used sequence-to-sequence-based approaches, our sketch-based method does not need to generate keywords which are boilerplates in the template, and can achieve better accuracy and run much faster. Compared with the existing sketch-based approaches, our CatSQL sketch is more general and versatile, and can leverage the values already filled in on certain slots to derive the rest ones for improved performance. In addition, we propose the Semantics Correction technique, which is the first that leverages database domain knowledge in a deep learning based NL2SQL solution. Semantics Correction is a post-processing routine, which checks the initially generated SQL queries by applying rules to identify and correct semantic errors. This technique significantly improves the NL2SQL accuracy. We conduct extensive evaluations on both single-domain and cross-domain benchmarks and demonstrate that our approach significantly outperforms the previous ones in terms of both accuracy and throughput. In particular, on the state-of-the-art NL2SQL benchmark Spider, our CatSQL prototype outperforms the best of the previous solutions by 4 points on accuracy, while still achieving a throughput up to 63 times higher.

自然语言到SQL (NL2SQL)技术为访问数据库提供了方便的接口，特别是对于非专业用户，可以进行各种数据分析。现有的方法通常采用基于规则的方法或基于深度学习的解决方案。前者很难在不同的领域进行概括。尽管后者泛化得很好，但它经常导致查询出现语法或语义错误，因此甚至可能无法执行。在这项工作中，我们弥合了两者之间的差距，并设计了一个新的框架，以显着提高准确性和运行时间。特别是，我们开发了一个新颖的CatSQL草图，它构建了一个带有槽的模板，这些槽最初用作占位符，并与深度学习模型紧密集成，根据数据库模式用有意义的内容填充这些槽。与广泛使用的基于序列到序列的方法相比，我们的基于草图的方法不需要在模板中生成作为样板的关键字，并且可以达到更好的准确性和更快的运行速度。与现有的基于草图的方法相比，我们的CatSQL草图更加通用和通用，并且可以利用在某些槽中已经填充的值来派生其余的值以提高性能。此外，我们提出了语义校正技术，这是第一个在基于深度学习的NL2SQL解决方案中利用数据库领域知识的技术。语义纠正是一个后处理例程，它通过应用规则来识别和纠正语义错误，从而检查最初生成的SQL查询。这种技术显著提高了NL2SQL的准确性。我们对单域和跨域基准测试进行了广泛的评估，并证明我们的方法在准确性和吞吐量方面都明显优于以前的方法。特别是，在最先进的NL2SQL基准Spider上，我们的CatSQL原型在准确性上比以前的最佳解决方案高出4分，同时仍然实现了高达63倍的吞吐量。

{"title":"CatSQL: Towards Real World Natural Language to SQL Applications","authors":"Han Fu, Chang Liu, Bin Wu, Feifei Li, Jian Tan, Jianling Sun","doi":"10.14778/3583140.3583165","DOIUrl":"https://doi.org/10.14778/3583140.3583165","url":null,"abstract":"\u0000 Natural language to SQL (NL2SQL) techniques provide a convenient interface to access databases, especially for non-expert users, to conduct various data analytics. Existing methods often employ either a rule-base approach or a deep learning based solution. The former is hard to generalize across different domains. Though the latter generalizes well, it often results in queries with syntactic or semantic errors, thus may be even not executable. In this work, we bridge the gap between the two and design a new framework to significantly improve both accuracy and runtime. In particular, we develop a novel\u0000 CatSQL\u0000 sketch, which constructs a template with slots that initially serve as placeholders, and tightly integrates with a deep learning model to fill in these slots with meaningful contents based on the database schema. Compared with the widely used sequence-to-sequence-based approaches, our sketch-based method does not need to generate keywords which are boilerplates in the template, and can achieve better accuracy and run much faster. Compared with the existing sketch-based approaches, our\u0000 CatSQL\u0000 sketch is more general and versatile, and can leverage the values already filled in on certain slots to derive the rest ones for improved performance. In addition, we propose the\u0000 Semantics Correction\u0000 technique, which is the first that leverages database domain knowledge in a deep learning based NL2SQL solution.\u0000 Semantics Correction\u0000 is a post-processing routine, which checks the initially generated SQL queries by applying rules to identify and correct semantic errors. This technique significantly improves the NL2SQL accuracy. We conduct extensive evaluations on both single-domain and cross-domain benchmarks and demonstrate that our approach significantly outperforms the previous ones in terms of both accuracy and throughput. In particular, on the state-of-the-art NL2SQL benchmark Spider, our\u0000 CatSQL\u0000 prototype outperforms the best of the previous solutions by 4 points on accuracy, while still achieving a throughput up to 63 times higher.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"82 1","pages":"1534-1547"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83479088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

NV-SQL: Boosting OLTP Performance with Non-Volatile DIMMs NV-SQL:使用非易失性内存提高OLTP性能

Proc. VLDB Endow.

Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583159

Mijin An, Jonghyeok Park, Tianzheng Wang, Beomseok Nam, Sang-Won Lee

When running OLTP workloads, relational DBMSs with flash SSDs still suffer from the durability overhead. Heavy writes to SSD not only limit the performance but also shorten the storage lifespan. To mitigate the durability overhead, this paper proposes a new database architecture, NV-SQL. NV-SQL aims at absorbing a large fraction of writes written from DRAM to SSD by introducing NVDIMM into the memory hierarchy as a durable write cache. On the new architecture, NV-SQL makes two technical contributions. First, it proposes the re-update interval-based admission policy that determines which write-hot pages qualify for being cached in NVDIMM. It is novel in that the page hotness is based solely on pages' LSN. Second, this study finds that NVDIMM-resident pages can violate the page action consistency upon crash and proposes how to detect inconsistent pages using per-page in-update flag and how to rectify them using the redo log. NV-SQL demonstrates how the ARIES-like logging and recovery techniques can be elegantly extended to support the caching and recovery for NVDIMM data. Additionally, by placing write-intensive redo buffer and DWB in NVDIMM, NV-SQL eliminates the log-force-at-commit and WAL protocols and further halves the writes to the storage. Our NV-SQL prototype running with a real NVDIMM device outperforms the same-priced vanilla MySQL with larger DRAM by several folds in terms of transaction throughput for write-intensive OLTP benchmarks. This confirms that NV-SQL is a cost-performance efficient solution to the durability problem.

在运行OLTP工作负载时，带有闪存ssd的关系dbms仍然会受到持久性开销的影响。大量写入SSD不仅会限制性能，还会缩短存储寿命。为了减少持久性开销，本文提出了一种新的数据库体系结构NV-SQL。NV-SQL的目的是通过将NVDIMM作为持久写缓存引入内存层次结构来吸收大部分从DRAM到SSD的写操作。在新的体系结构上，NV-SQL做出了两项技术贡献。首先，它提出了基于重新更新间隔的准入策略，该策略确定哪些写热页有资格被缓存到NVDIMM中。它的新颖之处在于，页面热度仅基于页面的LSN。其次，本研究发现nvdimm驻留页面在崩溃时可能会违反页面操作一致性，并提出如何使用per-page in-update标志检测不一致的页面，以及如何使用重做日志纠正不一致的页面。NV-SQL演示了如何优雅地扩展类似aries的日志记录和恢复技术，以支持NVDIMM数据的缓存和恢复。此外，通过在NVDIMM中放置写密集型重做缓冲区和DWB, NV-SQL消除了强制提交日志和WAL协议，并进一步减少了对存储的写操作。在写密集型OLTP基准测试的事务吞吐量方面，我们的NV-SQL原型在真正的NVDIMM设备上运行，比具有更大DRAM的相同价格的香草MySQL要好几倍。这证实了NV-SQL是解决持久性问题的一种经济高效的解决方案。

{"title":"NV-SQL: Boosting OLTP Performance with Non-Volatile DIMMs","authors":"Mijin An, Jonghyeok Park, Tianzheng Wang, Beomseok Nam, Sang-Won Lee","doi":"10.14778/3583140.3583159","DOIUrl":"https://doi.org/10.14778/3583140.3583159","url":null,"abstract":"\u0000 When running OLTP workloads, relational DBMSs with flash SSDs still suffer from the\u0000 durability overhead.\u0000 Heavy writes to SSD not only limit the performance but also shorten the storage lifespan. To mitigate the durability overhead, this paper proposes a new database architecture, NV-SQL. NV-SQL aims at absorbing a large fraction of writes written from DRAM to SSD by introducing NVDIMM into the memory hierarchy as a durable write cache. On the new architecture, NV-SQL makes two technical contributions. First, it proposes the\u0000 re-update interval-based admission policy\u0000 that determines which write-hot pages qualify for being cached in NVDIMM. It is novel in that the page hotness is based solely on pages' LSN. Second, this study finds that NVDIMM-resident pages can violate the\u0000 page action consistency\u0000 upon crash and proposes how to detect inconsistent pages using per-page in-update flag and how to rectify them using the redo log. NV-SQL demonstrates how the ARIES-like logging and recovery techniques can be elegantly extended to support the caching and recovery for NVDIMM data. Additionally, by placing write-intensive redo buffer and DWB in NVDIMM, NV-SQL eliminates the\u0000 log-force-at-commit\u0000 and\u0000 WAL\u0000 protocols and further halves the writes to the storage. Our NV-SQL prototype running with a real NVDIMM device outperforms the\u0000 same-priced\u0000 vanilla MySQL with larger DRAM by several folds in terms of transaction throughput for write-intensive OLTP benchmarks. This confirms that NV-SQL is a cost-performance efficient solution to the durability problem.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"37 1","pages":"1453-1465"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79385368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scalable and Robust Snapshot Isolation for High-Performance Storage Engines 用于高性能存储引擎的可扩展和健壮的快照隔离

Proc. VLDB Endow.

Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583157

Adnan Alhomssi, Viktor Leis

MVCC-based snapshot isolation promises that read queries can proceed without interfering with concurrent writes. However, as we show experimentally, in existing implementations a single long-running query can easily cause transactional throughput to collapse. Moreover, existing out-of-memory commit protocols fail to meet the scalability needs of modern multi-core systems. In this paper, we present three complementary techniques for robust and scalable snapshot isolation in out-of-memory systems. First, we propose a commit protocol that minimizes cross-thread communication for better scalability, avoids touching the write set on commit, and enables efficient fine-granular garbage collection. Second, we introduce the Graveyard Index, an auxiliary data structure that moves logically-deleted tuples out of the way of operational transactions. Third, we present an adaptive version storage scheme that enables fast garbage collection and improves scan performance of frequently-modified tuples. All techniques are engineered to scale well on multi-core processors, and together enable robust performance for complex hybrid workloads.

基于mvc的快照隔离保证了读查询可以在不干扰并发写的情况下进行。然而，正如我们通过实验证明的那样，在现有的实现中，单个长时间运行的查询很容易导致事务吞吐量崩溃。此外，现有的内存不足提交协议无法满足现代多核系统的可伸缩性需求。在本文中，我们提出了在内存不足系统中实现健壮和可扩展快照隔离的三种互补技术。首先，我们提出了一种提交协议，它可以最大限度地减少跨线程通信以获得更好的可伸缩性，避免在提交时触及写集，并实现高效的细粒度垃圾收集。其次，我们引入墓地索引(Graveyard Index)，这是一种辅助数据结构，可以将逻辑删除的元组移出操作事务。第三，我们提出了一种自适应版本存储方案，该方案实现了快速垃圾收集并提高了频繁修改元组的扫描性能。所有技术都经过设计，可以在多核处理器上很好地扩展，并共同为复杂的混合工作负载提供强大的性能。

引用次数: 1

Zebra: When Temporal Graph Neural Networks Meet Temporal Personalized PageRank 斑马:当时间图神经网络满足时间个性化PageRank

Proc. VLDB Endow.

Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583150

Yiming Li, Yanyan Shen, Lei Chen, Mingxuan Yuan

Temporal graph neural networks (T-GNNs) are state-of-the-art methods for learning representations over dynamic graphs. Despite the superior performance, T-GNNs still suffer from high computational complexity caused by the tedious recursive temporal message passing scheme, which hinders their applicability to large dynamic graphs. To address the problem, we build the theoretical connection between the temporal message passing scheme adopted by T-GNNs and the temporal random walk process on dynamic graphs. Our theoretical analysis indicates that it would be possible to select a few influential temporal neighbors to compute a target node's representation without compromising the predictive performance. Based on this finding, we propose to utilize T-PPR, a parameterized metric for estimating the influence score of nodes on evolving graphs. We further develop an efficient single-scan algorithm to answer the top- k T-PPR query with rigorous approximation guarantees. Finally, we present Zebra, a scalable framework that accelerates the computation of T-GNN by directly aggregating the features of the most prominent temporal neighbors returned by the top- k T-PPR query. Extensive experiments have validated that Zebra can be up to two orders of magnitude faster than the state-of-the-art T-GNNs while attaining better performance.

时态图神经网络(t - gnn)是学习动态图表示的最新方法。尽管具有优异的性能，但t - gnn仍然存在繁琐的递归时间消息传递方案导致的高计算复杂度，这阻碍了其在大型动态图中的应用。为了解决这个问题，我们建立了t - gnn采用的时间消息传递方案与动态图上的时间随机漫步过程之间的理论联系。我们的理论分析表明，在不影响预测性能的情况下，选择几个有影响的时间邻居来计算目标节点的表示是可能的。基于这一发现，我们提出利用参数化度量T-PPR来估计节点对进化图的影响评分。我们进一步开发了一种高效的单扫描算法来回答具有严格近似保证的top- k T-PPR查询。最后，我们提出了Zebra，这是一个可扩展的框架，通过直接聚合top- k T-PPR查询返回的最突出的时间邻居的特征来加速T-GNN的计算。大量的实验已经证实，Zebra可以比最先进的t - gnn快两个数量级，同时获得更好的性能。

{"title":"Zebra: When Temporal Graph Neural Networks Meet Temporal Personalized PageRank","authors":"Yiming Li, Yanyan Shen, Lei Chen, Mingxuan Yuan","doi":"10.14778/3583140.3583150","DOIUrl":"https://doi.org/10.14778/3583140.3583150","url":null,"abstract":"\u0000 Temporal graph neural networks (T-GNNs) are state-of-the-art methods for learning representations over dynamic graphs. Despite the superior performance, T-GNNs still suffer from high computational complexity caused by the tedious recursive temporal message passing scheme, which hinders their applicability to large dynamic graphs. To address the problem, we build the theoretical connection between the temporal message passing scheme adopted by T-GNNs and the temporal random walk process on dynamic graphs. Our theoretical analysis indicates that it would be possible to select a few influential temporal neighbors to compute a target node's representation without compromising the predictive performance. Based on this finding, we propose to utilize T-PPR, a parameterized metric for estimating the influence score of nodes on evolving graphs. We further develop an efficient single-scan algorithm to answer the top-\u0000 k\u0000 T-PPR query with rigorous approximation guarantees. Finally, we present Zebra, a scalable framework that accelerates the computation of T-GNN by directly aggregating the features of the most prominent temporal neighbors returned by the top-\u0000 k\u0000 T-PPR query. Extensive experiments have validated that Zebra can be up to two orders of magnitude faster than the state-of-the-art T-GNNs while attaining better performance.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"15 1","pages":"1332-1345"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79501587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3