ACM Transactions on Database Systems (TODS)最新文献_第4页

On the Language of Nested Tuple Generating Dependencies 论嵌套元组生成依赖关系的语言

ACM Transactions on Database Systems (TODS)

Pub Date : 2020-07-13 DOI: 10.1145/3369554

Phokion G. Kolaitis, R. Pichler, Emanuel Sallinger, V. Savenkov

During the past 15 years, schema mappings have been extensively used in formalizing and studying such critical data interoperability tasks as data exchange and data integration. Much of the work has focused on GLAV mappings, i.e., schema mappings specified by source-to-target tuple-generating dependencies (s-t tgds), and on schema mappings specified by second-order tgds (SO tgds), which constitute the closure of GLAV mappings under composition. In addition, nested GLAV mappings have also been considered, i.e., schema mappings specified by nested tgds, which have expressive power intermediate between s-t tgds and SO tgds. Even though nested GLAV mappings have been used in data exchange systems, such as IBM’s Clio, no systematic investigation of this class of schema mappings has been carried out so far. In this article, we embark on such an investigation by focusing on the basic reasoning tasks, algorithmic problems, and structural properties of nested GLAV mappings. One of our main results is the decidability of the implication problem for nested tgds. We also analyze the structure of the core of universal solutions with respect to nested GLAV mappings and develop useful tools for telling apart SO tgds from nested tgds. By discovering deeper structural properties of nested GLAV mappings, we show that also the following problem is decidable: Given a nested GLAV mapping, is it logically equivalent to a GLAV mapping?

在过去的15年中，模式映射被广泛用于形式化和研究关键的数据互操作性任务，如数据交换和数据集成。大部分工作都集中在GLAV映射上，即由源到目标元组生成依赖(s-t tgds)指定的模式映射，以及由二阶tgds (SO tgds)指定的模式映射，后者构成了组合下GLAV映射的闭包。此外，还考虑了嵌套的GLAV映射，即由嵌套的tgds指定的模式映射，它具有介于s-t tgds和SO tgds之间的表达能力。尽管在数据交换系统(如IBM的Clio)中已经使用了嵌套的GLAV映射，但到目前为止还没有对这类模式映射进行系统的研究。在本文中，我们通过关注基本的推理任务、算法问题和嵌套GLAV映射的结构特性，着手进行这样的研究。我们的主要结果之一是嵌套tgds隐含问题的可判定性。我们还分析了关于嵌套GLAV映射的通用解的核心结构，并开发了用于区分SO tgds和嵌套tgds的有用工具。通过发现嵌套GLAV映射的更深层次的结构属性，我们也证明了以下问题是可确定的:给定一个嵌套的GLAV映射，它是否在逻辑上等同于一个GLAV映射?

{"title":"On the Language of Nested Tuple Generating Dependencies","authors":"Phokion G. Kolaitis, R. Pichler, Emanuel Sallinger, V. Savenkov","doi":"10.1145/3369554","DOIUrl":"https://doi.org/10.1145/3369554","url":null,"abstract":"During the past 15 years, schema mappings have been extensively used in formalizing and studying such critical data interoperability tasks as data exchange and data integration. Much of the work has focused on GLAV mappings, i.e., schema mappings specified by source-to-target tuple-generating dependencies (s-t tgds), and on schema mappings specified by second-order tgds (SO tgds), which constitute the closure of GLAV mappings under composition. In addition, nested GLAV mappings have also been considered, i.e., schema mappings specified by nested tgds, which have expressive power intermediate between s-t tgds and SO tgds. Even though nested GLAV mappings have been used in data exchange systems, such as IBM’s Clio, no systematic investigation of this class of schema mappings has been carried out so far. In this article, we embark on such an investigation by focusing on the basic reasoning tasks, algorithmic problems, and structural properties of nested GLAV mappings. One of our main results is the decidability of the implication problem for nested tgds. We also analyze the structure of the core of universal solutions with respect to nested GLAV mappings and develop useful tools for telling apart SO tgds from nested tgds. By discovering deeper structural properties of nested GLAV mappings, we show that also the following problem is decidable: Given a nested GLAV mapping, is it logically equivalent to a GLAV mapping?","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"9 1","pages":"1 - 59"},"PeriodicalIF":0.0,"publicationDate":"2020-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73044704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Catching Numeric Inconsistencies in Graphs 捕捉图形中的数字不一致性

ACM Transactions on Database Systems (TODS)

Pub Date : 2020-06-27 DOI: 10.1145/3385031

W. Fan, Xueli Liu, Ping Lu, Chao Tian

Numeric inconsistencies are common in real-life knowledge bases and social networks. To catch such errors, we extend graph functional dependencies with linear arithmetic expressions and built-in comparison predicates, referred to as numeric graph dependencies (NGDs). We study fundamental problems for NGDs. We show that their satisfiability, implication, and validation problems are Σp2-complete, Πp2-complete, and coNP-complete, respectively. However, if we allow non-linear arithmetic expressions, even of degree at most 2, the satisfiability and implication problems become undecidable. In other words, NGDs strike a balance between expressivity and complexity. To make practical use of NGDs, we develop an incremental algorithm IncDect to detect errors in a graph G using NGDs in response to updates ΔG to G. We show that the incremental validation problem is coNP-complete. Nonetheless, algorithm IncDect is localizable, i.e., its cost is determined by small neighbors of nodes in ΔG instead of the entire G. Moreover, we parallelize IncDect such that it guarantees to reduce running time with the increase of processors. In addition, to strike a balance between the efficiency and accuracy, we also develop polynomial-time parallel algorithms for detection and incremental detection of top-ranked inconsistencies. Using real-life and synthetic graphs, we experimentally verify the scalability and efficiency of the algorithms.

数字不一致在现实生活中的知识库和社交网络中很常见。为了捕获此类错误，我们使用线性算术表达式和内置比较谓词扩展图函数依赖关系，称为数字图依赖关系(ngd)。我们研究NGDs的基本问题。我们证明了它们的可满足性、蕴涵性和验证性问题分别是Σp2-complete、Πp2-complete和conp完全的。然而，如果我们允许非线性算术表达式，即使最多为2次，可满足性和蕴涵问题就变得不可确定。换句话说，ngd在表达性和复杂性之间取得了平衡。为了实际使用ngd，我们开发了一个增量算法IncDect来检测图G中的错误，使用ngd响应更新ΔG到G。我们证明了增量验证问题是conp完全的。尽管如此，IncDect算法是可本地化的，即它的成本是由ΔG中节点的小邻居决定的，而不是整个g。此外，我们将IncDect并行化，以保证随着处理器的增加而减少运行时间。此外，为了在效率和准确性之间取得平衡，我们还开发了多项式时间并行算法来检测和增量检测排名不一致。利用真实图和合成图，实验验证了算法的可扩展性和效率。

{"title":"Catching Numeric Inconsistencies in Graphs","authors":"W. Fan, Xueli Liu, Ping Lu, Chao Tian","doi":"10.1145/3385031","DOIUrl":"https://doi.org/10.1145/3385031","url":null,"abstract":"Numeric inconsistencies are common in real-life knowledge bases and social networks. To catch such errors, we extend graph functional dependencies with linear arithmetic expressions and built-in comparison predicates, referred to as numeric graph dependencies (NGDs). We study fundamental problems for NGDs. We show that their satisfiability, implication, and validation problems are Σp2-complete, Πp2-complete, and coNP-complete, respectively. However, if we allow non-linear arithmetic expressions, even of degree at most 2, the satisfiability and implication problems become undecidable. In other words, NGDs strike a balance between expressivity and complexity. To make practical use of NGDs, we develop an incremental algorithm IncDect to detect errors in a graph G using NGDs in response to updates ΔG to G. We show that the incremental validation problem is coNP-complete. Nonetheless, algorithm IncDect is localizable, i.e., its cost is determined by small neighbors of nodes in ΔG instead of the entire G. Moreover, we parallelize IncDect such that it guarantees to reduce running time with the increase of processors. In addition, to strike a balance between the efficiency and accuracy, we also develop polynomial-time parallel algorithms for detection and incremental detection of top-ranked inconsistencies. Using real-life and synthetic graphs, we experimentally verify the scalability and efficiency of the algorithms.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"38 1","pages":"1 - 47"},"PeriodicalIF":0.0,"publicationDate":"2020-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81174168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Succinct Range Filters 简洁范围过滤器

ACM Transactions on Database Systems (TODS)

Pub Date : 2020-06-21 DOI: 10.1145/3375660

Huanchen Zhang, Hyeontaek Lim, Viktor Leis, D. Andersen, M. Kaminsky, K. Keeton, Andrew Pavlo

We present the Succinct Range Filter (SuRF), a fast and compact data structure for approximate membership tests. Unlike traditional Bloom filters, SuRF supports both single-key lookups and common range queries: open-range queries, closed-range queries, and range counts. SuRF is based on a new data structure called the Fast Succinct Trie (FST) that matches the point and range query performance of state-of-the-art order-preserving indexes, while consuming only 10 bits per trie node. The false-positive rates in SuRF for both point and range queries are tunable to satisfy different application needs. We evaluate SuRF in RocksDB as a replacement for its Bloom filters to reduce I/O by filtering requests before they access on-disk data structures. Our experiments on a 100-GB dataset show that replacing RocksDB’s Bloom filters with SuRFs speeds up open-seek (without upper-bound) and closed-seek (with upper-bound) queries by up to 1.5× and 5× with a modest cost on the worst-case (all-missing) point query throughput due to slightly higher false-positive rate.

摘要提出了一种快速、紧凑的近似隶属度检验数据结构——简洁范围滤波器(SuRF)。与传统的Bloom过滤器不同，SuRF支持单键查找和常见的范围查询:开放范围查询，封闭范围查询和范围计数。SuRF基于一种新的数据结构，称为快速简洁Trie (FST)，它匹配最先进的顺序保持索引的点和范围查询性能，同时每个Trie节点仅消耗10比特。SuRF中点查询和范围查询的误报率是可调的，以满足不同的应用程序需求。我们评估了RocksDB中的SuRF作为Bloom过滤器的替代品，通过在请求访问磁盘数据结构之前过滤请求来减少I/O。我们在100 gb数据集上的实验表明，用surf替换RocksDB的Bloom过滤器可以使开放寻道(没有上界)和封闭寻道(有上界)查询的速度提高1.5倍和5倍，并且由于误报率略高，在最坏情况(全部缺失)点查询吞吐量上的代价不大。

引用次数: 1

Balancing Expressiveness and Inexpressiveness in View Design 平衡视图设计中的表达性和非表达性

ACM Transactions on Database Systems (TODS)

Pub Date : 2020-06-01 DOI: 10.1145/3488370

Michael Benedikt, P. Bourhis, Louis Jachiet, Efthymia Tsamoura

We study the design of data publishing mechanisms that allow a collection of autonomous distributed data sources to collaborate to support queries. A common mechanism for data publishing is via views: functions that expose derived data to users, usually specified as declarative queries. Our autonomy assumption is that the views must be on individual sources, but with the intention of supporting integrated queries. In deciding what data to expose to users, two considerations must be balanced. The views must be sufficiently expressive to support queries that users want to ask—the utility of the publishing mechanism. But there may also be some expressiveness restrictions. Here, we consider two restrictions, a minimal information requirement, saying that the views should reveal as little as possible while supporting the utility query, and a non-disclosure requirement, formalizing the need to prevent external users from computing information that data owners do not want revealed. We investigate the problem of designing views that satisfy both expressiveness and inexpressiveness requirements, for views in a restricted information systems - query languages (conjunctive queries), and for arbitrary views.

我们研究了数据发布机制的设计，该机制允许一组自治的分布式数据源协作以支持查询。数据发布的常见机制是通过视图:向用户公开派生数据的函数，通常指定为声明性查询。我们的自主性假设是，视图必须在单独的源上，但要支持集成查询。在决定向用户公开哪些数据时，必须权衡两个考虑因素。视图必须具有足够的表达能力，以支持用户想要查询的查询——发布机制的实用程序。但也可能存在一些表达限制。在这里，我们考虑两个限制，一个是最小信息需求，即视图在支持实用程序查询的同时应尽可能少地显示信息，另一个是非公开需求，规定了防止外部用户计算数据所有者不希望显示的信息的需求。我们研究了设计既满足表达性要求又满足非表达性要求的视图的问题，适用于有限信息系统中的视图-查询语言(合取查询)，以及任意视图。

{"title":"Balancing Expressiveness and Inexpressiveness in View Design","authors":"Michael Benedikt, P. Bourhis, Louis Jachiet, Efthymia Tsamoura","doi":"10.1145/3488370","DOIUrl":"https://doi.org/10.1145/3488370","url":null,"abstract":"We study the design of data publishing mechanisms that allow a collection of autonomous distributed data sources to collaborate to support queries. A common mechanism for data publishing is via views: functions that expose derived data to users, usually specified as declarative queries. Our autonomy assumption is that the views must be on individual sources, but with the intention of supporting integrated queries. In deciding what data to expose to users, two considerations must be balanced. The views must be sufficiently expressive to support queries that users want to ask—the utility of the publishing mechanism. But there may also be some expressiveness restrictions. Here, we consider two restrictions, a minimal information requirement, saying that the views should reveal as little as possible while supporting the utility query, and a non-disclosure requirement, formalizing the need to prevent external users from computing information that data owners do not want revealed. We investigate the problem of designing views that satisfy both expressiveness and inexpressiveness requirements, for views in a restricted information systems - query languages (conjunctive queries), and for arbitrary views.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"14 1","pages":"1 - 40"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75190534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

εKTELO

ACM Transactions on Database Systems (TODS)

Pub Date : 2020-02-08 DOI: 10.1145/3362032

Dan Zhang, Ryan McKenna, Ios Kotsogiannis, G. Bissias, Michael Hay, Ashwin Machanavajjhala, G. Miklau

The adoption of differential privacy is growing, but the complexity of designing private, efficient, and accurate algorithms is still high. We propose a novel programming framework and system, εKTELO for implementing both existing and new privacy algorithms. For the task of answering linear counting queries, we show that nearly all existing algorithms can be composed from operators, each conforming to one of a small number of operator classes. While past programming frameworks have helped to ensure the privacy of programs, the novelty of our framework is its significant support for authoring accurate and efficient (as well as private) programs. After describing the design and architecture of the εKTELO system, we show that εKTELO is expressive, allows for safer implementations through code reuse, and allows both privacy novices and experts to easily design algorithms. We provide a number of novel implementation techniques to support the generality and scalability of εKTELO operators. These include methods to automatically compute lossless reductions of the data representation, implicit matrices that avoid materialized state but still support computations, and iterative inference implementations that generalize techniques from the privacy literature. We demonstrate the utility of εKTELO by designing several new state-of-the-art algorithms, most of which result from simple re-combinations of operators defined in the framework. We study the accuracy and scalability of εKTELO plans in a thorough empirical evaluation.

差分隐私的采用越来越多，但设计隐私、高效和准确的算法的复杂性仍然很高。我们提出了一个新的编程框架和系统，εKTELO来实现现有的和新的隐私算法。对于回答线性计数查询的任务，我们证明了几乎所有现有的算法都可以由算子组成，每个算子都符合少数算子类中的一个。虽然过去的编程框架有助于确保程序的私密性，但我们框架的新颖之处在于它对编写准确、高效(以及私有)程序的重要支持。在描述了εKTELO系统的设计和架构之后，我们证明了εKTELO是表达性的，允许通过代码重用实现更安全的实现，并且允许隐私新手和专家轻松地设计算法。我们提供了一些新的实现技术来支持εKTELO算子的通用性和可扩展性。这些方法包括自动计算数据表示的无损约简的方法，避免物化状态但仍支持计算的隐式矩阵，以及从隐私文献中推广技术的迭代推理实现。我们通过设计几个新的最先进的算法来证明εKTELO的实用性，其中大多数算法是由框架中定义的算子的简单重组产生的。我们对εKTELO计划的准确性和可扩展性进行了全面的实证评估。

{"title":"εKTELO","authors":"Dan Zhang, Ryan McKenna, Ios Kotsogiannis, G. Bissias, Michael Hay, Ashwin Machanavajjhala, G. Miklau","doi":"10.1145/3362032","DOIUrl":"https://doi.org/10.1145/3362032","url":null,"abstract":"The adoption of differential privacy is growing, but the complexity of designing private, efficient, and accurate algorithms is still high. We propose a novel programming framework and system, εKTELO for implementing both existing and new privacy algorithms. For the task of answering linear counting queries, we show that nearly all existing algorithms can be composed from operators, each conforming to one of a small number of operator classes. While past programming frameworks have helped to ensure the privacy of programs, the novelty of our framework is its significant support for authoring accurate and efficient (as well as private) programs. After describing the design and architecture of the εKTELO system, we show that εKTELO is expressive, allows for safer implementations through code reuse, and allows both privacy novices and experts to easily design algorithms. We provide a number of novel implementation techniques to support the generality and scalability of εKTELO operators. These include methods to automatically compute lossless reductions of the data representation, implicit matrices that avoid materialized state but still support computations, and iterative inference implementations that generalize techniques from the privacy literature. We demonstrate the utility of εKTELO by designing several new state-of-the-art algorithms, most of which result from simple re-combinations of operators defined in the framework. We study the accuracy and scalability of εKTELO plans in a thorough empirical evaluation.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"15 1","pages":"1 - 44"},"PeriodicalIF":0.0,"publicationDate":"2020-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82642546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Game-theoretic Approach to Data Interaction 数据交互的博弈论方法

ACM Transactions on Database Systems (TODS)

Pub Date : 2020-02-08 DOI: 10.1145/3351450

Ben McCamish, Vahid Ghadakchi, Arash Termehchy, B. Touri, E. Cotilla-Sánchez, Liang Huang, Soravit Changpinyo

As most users do not precisely know the structure and/or the content of databases, their queries do not exactly reflect their information needs. The database management system (DBMS) may interact with users and use their feedback on the returned results to learn the information needs behind their queries. Current query interfaces assume that users do not learn and modify the way they express their information needs in the form of queries during their interaction with the DBMS. Using a real-world interaction workload, we show that users learn and modify how to express their information needs during their interactions with the DBMS and their learning is accurately modeled by a well-known reinforcement learning mechanism. As current data interaction systems assume that users do not modify their strategies, they cannot discover the information needs behind users’ queries effectively. We model the interaction between the user and the DBMS as a game with identical interest between two rational agents whose goal is to establish a common language for representing information needs in the form of queries. We propose a reinforcement learning method that learns and answers the information needs behind queries and adapts to the changes in users’ strategies and proves that it improves the effectiveness of answering queries, stochastically speaking. We propose two efficient implementations of this method over large relational databases. Our extensive empirical studies over real-world query workloads indicate that our algorithms are efficient and effective.

由于大多数用户并不确切地知道数据库的结构和/或内容，因此他们的查询不能准确地反映他们的信息需求。数据库管理系统(DBMS)可以与用户交互，并使用他们对返回结果的反馈来了解他们查询背后的信息需求。当前的查询接口假定用户在与DBMS交互期间不学习和修改以查询形式表达信息需求的方式。使用现实世界的交互工作负载，我们展示了用户在与DBMS交互过程中学习和修改如何表达他们的信息需求，并且他们的学习是由著名的强化学习机制精确建模的。由于当前的数据交互系统假设用户不修改策略，因此无法有效发现用户查询背后的信息需求。我们将用户和DBMS之间的交互建模为两个理性代理之间具有相同兴趣的游戏，其目标是建立一种以查询形式表示信息需求的公共语言。我们提出了一种强化学习方法，可以学习和回答查询背后的信息需求，并适应用户策略的变化，从随机角度证明了它提高了回答查询的有效性。我们在大型关系数据库上提出了两种有效的实现方法。我们对实际查询工作负载的广泛实证研究表明，我们的算法是高效和有效的。

{"title":"A Game-theoretic Approach to Data Interaction","authors":"Ben McCamish, Vahid Ghadakchi, Arash Termehchy, B. Touri, E. Cotilla-Sánchez, Liang Huang, Soravit Changpinyo","doi":"10.1145/3351450","DOIUrl":"https://doi.org/10.1145/3351450","url":null,"abstract":"As most users do not precisely know the structure and/or the content of databases, their queries do not exactly reflect their information needs. The database management system (DBMS) may interact with users and use their feedback on the returned results to learn the information needs behind their queries. Current query interfaces assume that users do not learn and modify the way they express their information needs in the form of queries during their interaction with the DBMS. Using a real-world interaction workload, we show that users learn and modify how to express their information needs during their interactions with the DBMS and their learning is accurately modeled by a well-known reinforcement learning mechanism. As current data interaction systems assume that users do not modify their strategies, they cannot discover the information needs behind users’ queries effectively. We model the interaction between the user and the DBMS as a game with identical interest between two rational agents whose goal is to establish a common language for representing information needs in the form of queries. We propose a reinforcement learning method that learns and answers the information needs behind queries and adapts to the changes in users’ strategies and proves that it improves the effectiveness of answering queries, stochastically speaking. We propose two efficient implementations of this method over large relational databases. Our extensive empirical studies over real-world query workloads indicate that our algorithms are efficient and effective.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"170 1","pages":"1 - 44"},"PeriodicalIF":0.0,"publicationDate":"2020-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77468158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Design and Evaluation of an RDMA-aware Data Shuffling Operator for Parallel Database Systems 并行数据库系统中rdma感知数据变换算子的设计与评价

ACM Transactions on Database Systems (TODS)

Pub Date : 2019-12-12 DOI: 10.1145/3360900

Feilong Liu, Lingyan Yin, Spyros Blanas

The commoditization of high-performance networking has sparked research interest in the RDMA capability of this hardware. One-sided RDMA primitives, in particular, have generated substantial excitement due to the ability to directly access remote memory from within an application without involving the TCP/IP stack or the remote CPU. This article considers how to leverage RDMA to improve the analytical performance of parallel database systems. To shuffle data efficiently using RDMA, one needs to consider a complex design space that includes (1) the number of open connections, (2) the contention for the shared network interface, (3) the RDMA transport function, and (4) how much memory should be reserved to exchange data between nodes during query processing. We contribute eight designs that capture salient tradeoffs in this design space as well as an adaptive algorithm to dynamically manage RDMA-registered memory. We comprehensively evaluate how transport-layer decisions impact the query performance of a database system for different generations of InfiniBand. We find that a shuffling operator that uses the RDMA Send/Receive transport function over the Unreliable Datagram transport service can transmit data up to 4× faster than an RDMA-capable MPI implementation in a 16-node cluster. The response time of TPC-H queries improves by as much as 2×.

高性能网络的商品化引发了对这种硬件的RDMA能力的研究兴趣。特别是单侧RDMA原语，由于能够从应用程序内部直接访问远程内存，而不涉及TCP/IP堆栈或远程CPU，因此产生了巨大的兴奋。本文考虑如何利用RDMA来提高并行数据库系统的分析性能。为了使用RDMA有效地洗刷数据，需要考虑一个复杂的设计空间，其中包括(1)打开连接的数量，(2)共享网络接口的争用，(3)RDMA传输功能，以及(4)在查询处理期间应该保留多少内存在节点之间交换数据。我们提供了八个设计，这些设计捕获了这个设计空间中的突出权衡，以及一个动态管理rdma注册内存的自适应算法。我们全面评估了传输层决策如何影响不同代InfiniBand数据库系统的查询性能。我们发现，在16节点集群中，使用RDMA发送/接收传输功能的变换算子在不可靠数据报传输服务上传输数据的速度比具有RDMA功能的MPI实现快4倍。TPC-H查询的响应时间提高了2倍。

{"title":"Design and Evaluation of an RDMA-aware Data Shuffling Operator for Parallel Database Systems","authors":"Feilong Liu, Lingyan Yin, Spyros Blanas","doi":"10.1145/3360900","DOIUrl":"https://doi.org/10.1145/3360900","url":null,"abstract":"The commoditization of high-performance networking has sparked research interest in the RDMA capability of this hardware. One-sided RDMA primitives, in particular, have generated substantial excitement due to the ability to directly access remote memory from within an application without involving the TCP/IP stack or the remote CPU. This article considers how to leverage RDMA to improve the analytical performance of parallel database systems. To shuffle data efficiently using RDMA, one needs to consider a complex design space that includes (1) the number of open connections, (2) the contention for the shared network interface, (3) the RDMA transport function, and (4) how much memory should be reserved to exchange data between nodes during query processing. We contribute eight designs that capture salient tradeoffs in this design space as well as an adaptive algorithm to dynamically manage RDMA-registered memory. We comprehensively evaluate how transport-layer decisions impact the query performance of a database system for different generations of InfiniBand. We find that a shuffling operator that uses the RDMA Send/Receive transport function over the Unreliable Datagram transport service can transmit data up to 4× faster than an RDMA-capable MPI implementation in a 16-node cluster. The response time of TPC-H queries improves by as much as 2×.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"34 1","pages":"1 - 45"},"PeriodicalIF":0.0,"publicationDate":"2019-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79486412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Dichotomies for Evaluating Simple Regular Path Queries 评估简单规则路径查询的二分类

ACM Transactions on Database Systems (TODS)

Pub Date : 2019-10-15 DOI: 10.1145/3331446

W. Martens, T. Trautner

Regular path queries (RPQs) are a central component of graph databases. We investigate decision and enumeration problems concerning the evaluation of RPQs under several semantics that have recently been considered: arbitrary paths, shortest paths, paths without node repetitions (simple paths), and paths without edge repetitions (trails). Whereas arbitrary and shortest paths can be dealt with efficiently, simple paths and trails become computationally difficult already for very small RPQs. We study RPQ evaluation for simple paths and trails from a parameterized complexity perspective and define a class of simple transitive expressions that is prominent in practice and for which we can prove dichotomies for the evaluation problem. We observe that, even though simple path and trail semantics are intractable for RPQs in general, they are feasible for the vast majority of RPQs that are used in practice. At the heart of this study is a result of independent interest: the two disjoint paths problem in directed graphs is W[1]-hard if parameterized by the length of one of the two paths.

正则路径查询(rpq)是图数据库的核心组成部分。我们研究了最近考虑的几种语义下关于rpq评估的决策和枚举问题:任意路径，最短路径，无节点重复的路径(简单路径)和无边缘重复的路径(路径)。尽管任意和最短路径可以有效地处理，但对于非常小的rpq来说，简单的路径和轨迹在计算上已经很困难了。我们从参数化复杂性的角度研究了简单路径和轨迹的RPQ求值问题，并定义了一类在实践中很突出的简单传递表达式，我们可以证明二分类的求值问题。我们观察到，尽管简单的路径和路径语义通常对rpq来说是难以处理的，但它们对于实践中使用的绝大多数rpq来说是可行的。本研究的核心是一个独立兴趣的结果:有向图中的两条不相交路径问题是W[1]，如果用两条路径之一的长度参数化，则很难。

引用次数: 17

ChronicleDB

ACM Transactions on Database Systems (TODS)

Pub Date : 2019-10-15 DOI: 10.1145/3342357

M. Seidemann, Nikolaus Glombiewski, Michael Körber, B. Seeger

Reactive security monitoring, self-driving cars, the Internet of Things (IoT), and many other novel applications require systems for both writing events arriving at very high and fluctuating rates to persistent storage as well as supporting analytical ad hoc queries. As standard database systems are not capable of delivering the required write performance, log-based systems, key-value stores, and other write-optimized data stores have emerged recently. However, the drawbacks of these systems are a fair query performance and the lack of suitable instant recovery mechanisms in case of system failures. In this article, we present ChronicleDB, a novel database system with a storage layout tailored for high write performance under fluctuating data rates and powerful indexing capabilities to support a variety of queries. In addition, ChronicleDB offers low-cost fault tolerance and instant recovery within milliseconds. Unlike previous work, ChronicleDB is designed either as a serverless library to be tightly integrated in an application or as a standalone database server. Our results of an experimental evaluation with real and synthetic data reveal that ChronicleDB clearly outperforms competing systems with respect to both write and query performance.

响应式安全监控、自动驾驶汽车、物联网(IoT)和许多其他新颖的应用程序都需要系统以非常高和波动的速率将事件写入持久存储，并支持分析性的临时查询。由于标准数据库系统无法提供所需的写性能，因此最近出现了基于日志的系统、键值存储和其他写优化数据存储。然而，这些系统的缺点是查询性能一般，并且在系统发生故障时缺乏适当的即时恢复机制。在本文中，我们介绍了ChronicleDB，这是一个新颖的数据库系统，它的存储布局为在波动数据速率下的高写入性能和强大的索引功能量身定制，以支持各种查询。此外，ChronicleDB提供低成本的容错和毫秒级的即时恢复。与以前的工作不同，ChronicleDB被设计为紧密集成在应用程序中的无服务器库，或者作为独立的数据库服务器。我们对真实数据和合成数据的实验评估结果表明，在写和查询性能方面，ChronicleDB明显优于竞争系统。

{"title":"ChronicleDB","authors":"M. Seidemann, Nikolaus Glombiewski, Michael Körber, B. Seeger","doi":"10.1145/3342357","DOIUrl":"https://doi.org/10.1145/3342357","url":null,"abstract":"Reactive security monitoring, self-driving cars, the Internet of Things (IoT), and many other novel applications require systems for both writing events arriving at very high and fluctuating rates to persistent storage as well as supporting analytical ad hoc queries. As standard database systems are not capable of delivering the required write performance, log-based systems, key-value stores, and other write-optimized data stores have emerged recently. However, the drawbacks of these systems are a fair query performance and the lack of suitable instant recovery mechanisms in case of system failures. In this article, we present ChronicleDB, a novel database system with a storage layout tailored for high write performance under fluctuating data rates and powerful indexing capabilities to support a variety of queries. In addition, ChronicleDB offers low-cost fault tolerance and instant recovery within milliseconds. Unlike previous work, ChronicleDB is designed either as a serverless library to be tightly integrated in an application or as a standalone database server. Our results of an experimental evaluation with real and synthetic data reveal that ChronicleDB clearly outperforms competing systems with respect to both write and query performance.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"1 1","pages":"1 - 45"},"PeriodicalIF":0.0,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73491543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Efficient Algorithms for Approximate Single-Source Personalized PageRank Queries 近似单源个性化PageRank查询的高效算法

ACM Transactions on Database Systems (TODS)

Pub Date : 2019-08-28 DOI: 10.1145/3360902

Sibo Wang, Renchi Yang, Runhui Wang, Xiaokui Xiao, Zhewei Wei, Wenqing Lin, Y. Yang, N. Tang

Given a graph G, a source node s, and a target node t, the personalized PageRank (PPR) of t with respect to s is the probability that a random walk starting from s terminates at t. An important variant of the PPR query is single-source PPR (SSPPR), which enumerates all nodes in G and returns the top-k nodes with the highest PPR values with respect to a given source s. PPR in general and SSPPR in particular have important applications in web search and social networks, e.g., in Twitter’s Who-To-Follow recommendation service. However, PPR computation is known to be expensive on large graphs and resistant to indexing. Consequently, previous solutions either use heuristics, which do not guarantee result quality, or rely on the strong computing power of modern data centers, which is costly. Motivated by this, we propose effective index-free and index-based algorithms for approximate PPR processing, with rigorous guarantees on result quality. We first present FORA, an approximate SSPPR solution that combines two existing methods—Forward Push (which is fast but does not guarantee quality) and Monte Carlo Random Walk (accurate but slow)—in a simple and yet non-trivial way, leading to both high accuracy and efficiency. Further, FORA includes a simple and effective indexing scheme, as well as a module for top-k selection with high pruning power. Extensive experiments demonstrate that the proposed solutions are orders of magnitude more efficient than their respective competitors. Notably, on a billion-edge Twitter dataset, FORA answers a top-500 approximate SSPPR query within 1s, using a single commodity server.

给定一个图G,年代,源节点和目标节点t, t的个性化网页排名(PPR)对s的概率是随机漫步从s t终止。PPR查询是单一的一个重要变体PPR (SSPPR),其中列举了在G的所有节点,并返回top-k PPR最高的节点值对于一个给定源。一般PPR特别是SSPPR在网络搜索和社交网络有着重要的应用,例如,在Twitter的Who-To-Follow推荐服务中。然而，众所周知，PPR计算在大型图上是昂贵的，并且难以建立索引。因此，以前的解决方案要么使用启发式方法，但不能保证结果质量，要么依赖于现代数据中心的强大计算能力，这是昂贵的。基于此，我们提出了有效的无索引和基于索引的近似PPR处理算法，并严格保证结果质量。我们首先提出了FORA，一种近似的SSPPR解决方案，它结合了两种现有的方法-向前推进(快速但不保证质量)和蒙特卡罗随机漫步(准确但缓慢)-以一种简单而非平凡的方式，实现了高精度和高效率。此外，FORA还包括一个简单有效的索引方案，以及一个具有高修剪能力的top-k选择模块。大量的实验表明，所提出的解决方案比各自的竞争对手效率高几个数量级。值得注意的是，在十亿边缘的Twitter数据集上，FORA使用单个商品服务器，在15秒内回答前500强的近似SSPPR查询。

{"title":"Efficient Algorithms for Approximate Single-Source Personalized PageRank Queries","authors":"Sibo Wang, Renchi Yang, Runhui Wang, Xiaokui Xiao, Zhewei Wei, Wenqing Lin, Y. Yang, N. Tang","doi":"10.1145/3360902","DOIUrl":"https://doi.org/10.1145/3360902","url":null,"abstract":"Given a graph G, a source node s, and a target node t, the personalized PageRank (PPR) of t with respect to s is the probability that a random walk starting from s terminates at t. An important variant of the PPR query is single-source PPR (SSPPR), which enumerates all nodes in G and returns the top-k nodes with the highest PPR values with respect to a given source s. PPR in general and SSPPR in particular have important applications in web search and social networks, e.g., in Twitter’s Who-To-Follow recommendation service. However, PPR computation is known to be expensive on large graphs and resistant to indexing. Consequently, previous solutions either use heuristics, which do not guarantee result quality, or rely on the strong computing power of modern data centers, which is costly. Motivated by this, we propose effective index-free and index-based algorithms for approximate PPR processing, with rigorous guarantees on result quality. We first present FORA, an approximate SSPPR solution that combines two existing methods—Forward Push (which is fast but does not guarantee quality) and Monte Carlo Random Walk (accurate but slow)—in a simple and yet non-trivial way, leading to both high accuracy and efficiency. Further, FORA includes a simple and effective indexing scheme, as well as a module for top-k selection with high pruning power. Extensive experiments demonstrate that the proposed solutions are orders of magnitude more efficient than their respective competitors. Notably, on a billion-edge Twitter dataset, FORA answers a top-500 approximate SSPPR query within 1s, using a single commodity server.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"63 1","pages":"1 - 37"},"PeriodicalIF":0.0,"publicationDate":"2019-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89697303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33