ACM Transactions on Database Systems (TODS)最新文献

英文中文

Dependencies for Graphs 图的依赖关系

ACM Transactions on Database Systems (TODS)

Pub Date : 2019-02-13 DOI: 10.1145/3287285

W. Fan, Ping Lu

This article proposes a class of dependencies for graphs, referred to as graph entity dependencies (GEDs). A GED is defined as a combination of a graph pattern and an attribute dependency. In a uniform format, GEDs can express graph functional dependencies with constant literals to catch inconsistencies, and keys carrying id literals to identify entities (vertices) in a graph. We revise the chase for GEDs and prove its Church-Rosser property. We characterize GED satisfiability and implication, and establish the complexity of these problems and the validation problem for GEDs, in the presence and absence of constant literals and id literals. We also develop a sound, complete and independent axiom system for finite implication of GEDs. In addition, we extend GEDs with built-in predicates or disjunctions, to strike a balance between the expressive power and complexity. We settle the complexity of the satisfiability, implication, and validation problems for these extensions.

本文提出了一类图的依赖关系，称为图实体依赖关系(GEDs)。GED被定义为图形模式和属性依赖的组合。在统一的格式中，ged可以用常量字面值表示图的功能依赖关系，以捕获不一致，并使用携带id字面值的键来标识图中的实体(顶点)。我们修改了对ged的追逐，并证明了它的Church-Rosser属性。我们描述了GED的可满足性和蕴涵性，并建立了这些问题的复杂性和GED的验证问题，在常数字面量和id字面量存在和不存在的情况下。我们还建立了一个健全、完备、独立的GEDs有限蕴涵公理体系。此外，我们使用内置的谓词或析取来扩展ged，以在表达能力和复杂性之间取得平衡。我们解决了这些扩展的可满足性、隐含性和验证性问题的复杂性。

引用次数: 61

Wander Join and XDB 漫游Join和XDB

ACM Transactions on Database Systems (TODS)

Pub Date : 2019-01-29 DOI: 10.1145/3284551

Feifei Li, Bin Wu, K. Yi, Zhuoyue Zhao

Joins are expensive, and online aggregation over joins was proposed to mitigate the cost, which offers users a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and even needs unrealistic assumptions (e.g., tuples in a table are stored in random order). This article proposes a new approach, the wander join algorithm, to the online aggregation problem by performing random walks over the underlying join graph. We also design an optimizer that chooses the optimal plan for conducting the random walks without having to collect any statistics a priori. Compared with ripple join, wander join is particularly efficient for equality joins involving multiple tables, but also supports θ-joins. Selection predicates and group-by clauses can be handled as well. To demonstrate the usefulness of wander join, we have designed and implemented XDB (approXimate DB) by integrating wander join into various systems including PostgreSQL, Spark, and a stand-alone plug-in version using PL/SQL. The design and implementation of XDB has demonstrated wander join’s practicality in a full-fledged database system. Extensive experiments using the TPC-H benchmark have demonstrated the superior performance of wander join over ripple join.

连接是昂贵的，在线聚合取代连接是为了降低成本而提出的，它为用户提供了在连续的在线方式下查询效率和准确性之间良好而灵活的权衡。然而，在内部和外部内存中，最先进的方法是基于波纹连接，这仍然非常昂贵，甚至需要不切实际的假设(例如，表中的元组以随机顺序存储)。本文提出了一种新的方法，漫游连接算法，通过在底层连接图上执行随机行走来解决在线聚合问题。我们还设计了一个优化器，它可以选择进行随机漫步的最佳计划，而无需先验地收集任何统计数据。与ripple join相比，wander join对于涉及多个表的相等连接特别有效，但也支持θ-连接。还可以处理选择谓词和group-by子句。为了演示wander join的有用性，我们设计并实现了XDB (approXimate DB)，将wander join集成到各种系统中，包括PostgreSQL、Spark和使用PL/SQL的独立插件版本。XDB的设计和实现已经证明了漫游连接在成熟的数据库系统中的实用性。使用TPC-H基准的大量实验证明了漫游连接优于波纹连接的性能。

{"title":"Wander Join and XDB","authors":"Feifei Li, Bin Wu, K. Yi, Zhuoyue Zhao","doi":"10.1145/3284551","DOIUrl":"https://doi.org/10.1145/3284551","url":null,"abstract":"Joins are expensive, and online aggregation over joins was proposed to mitigate the cost, which offers users a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and even needs unrealistic assumptions (e.g., tuples in a table are stored in random order). This article proposes a new approach, the wander join algorithm, to the online aggregation problem by performing random walks over the underlying join graph. We also design an optimizer that chooses the optimal plan for conducting the random walks without having to collect any statistics a priori. Compared with ripple join, wander join is particularly efficient for equality joins involving multiple tables, but also supports θ-joins. Selection predicates and group-by clauses can be handled as well. To demonstrate the usefulness of wander join, we have designed and implemented XDB (approXimate DB) by integrating wander join into various systems including PostgreSQL, Spark, and a stand-alone plug-in version using PL/SQL. The design and implementation of XDB has demonstrated wander join’s practicality in a full-fledged database system. Extensive experiments using the TPC-H benchmark have demonstrated the superior performance of wander join over ripple join.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"13 1","pages":"1 - 41"},"PeriodicalIF":0.0,"publicationDate":"2019-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77722006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Historic Moments Discovery in Sequence Data 序列数据中的历史性时刻发现

ACM Transactions on Database Systems (TODS)

Pub Date : 2019-01-29 DOI: 10.1145/3276975

Ran Bai, W. Hon, Eric Lo, Zhian He, Kenny Q. Zhu

Many emerging applications are based on finding interesting subsequences from sequence data. Finding “prominent streaks,” a set of the longest contiguous subsequences with values all above (or below) a certain threshold, from sequence data is one of that kind that receives much attention. Motivated from real applications, we observe that prominent streaks alone are not insightful enough but require the discovery of something we coined as “historic moments” as companions. In this article, we present an algorithm to efficiently compute historic moments from sequence data. The algorithm is incremental and space optimal, meaning that when facing new data arrival, it is able to efficiently refresh the results by keeping minimal information. Case studies show that historic moments can significantly improve the insights offered by prominent streaks alone. Furthermore, experiments show that our algorithm can outperform the baseline in both time and space.

许多新兴的应用都是基于从序列数据中找到有趣的子序列。从序列数据中寻找“突出的条纹”，即一组值均高于(或低于)某个阈值的最长连续子序列，是受到很多关注的一类。在实际应用的激励下，我们观察到，仅仅突出的条纹是不够深刻的，而是需要发现一些我们称之为“历史时刻”的东西作为伴侣。本文提出了一种从序列数据中高效计算历史矩的算法。该算法是增量和空间最优的，这意味着当面对新数据到达时，它能够通过保留最小的信息来有效地刷新结果。案例研究表明，历史时刻可以显著提高仅凭突出条纹提供的洞察力。此外，实验表明，该算法在时间和空间上都优于基线。

引用次数: 1

Representations and Optimizations for Embedded Parallel Dataflow Languages 嵌入式并行数据流语言的表示与优化

ACM Transactions on Database Systems (TODS)

Pub Date : 2019-01-29 DOI: 10.1145/3281629

Alexander B. Alexandrov, Georgi Krastev, V. Markl

Parallel dataflow engines such as Apache Hadoop, Apache Spark, and Apache Flink are an established alternative to relational databases for modern data analysis applications. A characteristic of these systems is a scalable programming model based on distributed collections and parallel transformations expressed by means of second-order functions such as map and reduce. Notable examples are Flink’s DataSet and Spark’s RDD programming abstractions. These programming models are realized as EDSLs—domain specific languages embedded in a general-purpose host language such as Java, Scala, or Python. This approach has several advantages over traditional external DSLs such as SQL or XQuery. First, syntactic constructs from the host language (e.g., anonymous functions syntax, value definitions, and fluent syntax via method chaining) can be reused in the EDSL. This eases the learning curve for developers already familiar with the host language. Second, it allows for seamless integration of library methods written in the host language via the function parameters passed to the parallel dataflow operators. This reduces the effort for developing analytics dataflows that go beyond pure SQL and require domain-specific logic. At the same time, however, state-of-the-art parallel dataflow EDSLs exhibit a number of shortcomings. First, one of the main advantages of an external DSL such as SQL—the high-level, declarative Select-From-Where syntax—is either lost completely or mimicked in a non-standard way. Second, execution aspects such as caching, join order, and partial aggregation have to be decided by the programmer. Optimizing them automatically is very difficult due to the limited program context available in the intermediate representation of the DSL. In this article, we argue that the limitations listed above are a side effect of the adopted type-based embedding approach. As a solution, we propose an alternative EDSL design based on quotations. We present a DSL embedded in Scala and discuss its compiler pipeline, intermediate representation, and some of the enabled optimizations. We promote the algebraic type of bags in union representation as a model for distributed collections and its associated structural recursion scheme and monad as a model for parallel collection processing. At the source code level, Scala’s comprehension syntax over a bag monad can be used to encode Select-From-Where expressions in a standard way. At the intermediate representation level, maintaining comprehensions as a first-class citizen can be used to simplify the design and implementation of holistic dataflow optimizations that accommodate for nesting and control-flow. The proposed DSL design therefore reconciles the benefits of embedded parallel dataflow DSLs with the declarativity and optimization potential of external DSLs like SQL.

并行数据流引擎(如Apache Hadoop、Apache Spark和Apache Flink)是现代数据分析应用程序中关系数据库的替代方案。这些系统的一个特点是基于分布式集合和并行转换的可扩展编程模型，通过map和reduce等二阶函数表示。值得注意的例子是Flink的DataSet和Spark的RDD编程抽象。这些编程模型被实现为嵌入在通用宿主语言(如Java、Scala或Python)中的edsl领域特定语言。与传统的外部dsl(如SQL或XQuery)相比，这种方法有几个优点。首先，来自宿主语言的语法结构(例如，匿名函数语法、值定义和通过方法链接的流畅语法)可以在EDSL中重用。这简化了已经熟悉宿主语言的开发人员的学习曲线。其次，它允许通过传递给并行数据流操作符的函数参数无缝集成用宿主语言编写的库方法。这减少了开发超越纯SQL并需要特定于域的逻辑的分析数据流的工作量。然而，与此同时，最先进的并行数据流edsl显示出许多缺点。首先，外部DSL(如sql)的主要优点之一——高级声明性的Select-From-Where语法——要么完全丢失，要么以非标准的方式被模仿。其次，执行方面(如缓存、连接顺序和部分聚合)必须由程序员决定。由于DSL中间表示中可用的程序上下文有限，因此自动优化它们非常困难。在本文中，我们认为上面列出的限制是采用基于类型的嵌入方法的副作用。作为解决方案，我们提出了一种基于报价的替代EDSL设计。我们提出了一个嵌入Scala的DSL，并讨论了它的编译器管道、中间表示和一些启用的优化。我们将联合表示中的代数类型袋作为分布式集合及其相关结构递归方案的模型，并将monad作为并行集合处理的模型。在源代码级别，Scala在包单子上的推导语法可用于以标准方式编码Select-From-Where表达式。在中间表示级别上，将推导式维护为一级公民可用于简化整体数据流优化的设计和实现，以适应嵌套和控制流。因此，建议的DSL设计将嵌入式并行数据流DSL的优点与外部DSL(如SQL)的声明性和优化潜力相协调。

{"title":"Representations and Optimizations for Embedded Parallel Dataflow Languages","authors":"Alexander B. Alexandrov, Georgi Krastev, V. Markl","doi":"10.1145/3281629","DOIUrl":"https://doi.org/10.1145/3281629","url":null,"abstract":"Parallel dataflow engines such as Apache Hadoop, Apache Spark, and Apache Flink are an established alternative to relational databases for modern data analysis applications. A characteristic of these systems is a scalable programming model based on distributed collections and parallel transformations expressed by means of second-order functions such as map and reduce. Notable examples are Flink’s DataSet and Spark’s RDD programming abstractions. These programming models are realized as EDSLs—domain specific languages embedded in a general-purpose host language such as Java, Scala, or Python. This approach has several advantages over traditional external DSLs such as SQL or XQuery. First, syntactic constructs from the host language (e.g., anonymous functions syntax, value definitions, and fluent syntax via method chaining) can be reused in the EDSL. This eases the learning curve for developers already familiar with the host language. Second, it allows for seamless integration of library methods written in the host language via the function parameters passed to the parallel dataflow operators. This reduces the effort for developing analytics dataflows that go beyond pure SQL and require domain-specific logic. At the same time, however, state-of-the-art parallel dataflow EDSLs exhibit a number of shortcomings. First, one of the main advantages of an external DSL such as SQL—the high-level, declarative Select-From-Where syntax—is either lost completely or mimicked in a non-standard way. Second, execution aspects such as caching, join order, and partial aggregation have to be decided by the programmer. Optimizing them automatically is very difficult due to the limited program context available in the intermediate representation of the DSL. In this article, we argue that the limitations listed above are a side effect of the adopted type-based embedding approach. As a solution, we propose an alternative EDSL design based on quotations. We present a DSL embedded in Scala and discuss its compiler pipeline, intermediate representation, and some of the enabled optimizations. We promote the algebraic type of bags in union representation as a model for distributed collections and its associated structural recursion scheme and monad as a model for parallel collection processing. At the source code level, Scala’s comprehension syntax over a bag monad can be used to encode Select-From-Where expressions in a standard way. At the intermediate representation level, maintaining comprehensions as a first-class citizen can be used to simplify the design and implementation of holistic dataflow optimizations that accommodate for nesting and control-flow. The proposed DSL design therefore reconciles the benefits of embedded parallel dataflow DSLs with the declarativity and optimization potential of external DSLs like SQL.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"12 1","pages":"1 - 44"},"PeriodicalIF":0.0,"publicationDate":"2019-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91106125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Scalable Analytics on Fast Data 快速数据的可扩展分析

ACM Transactions on Database Systems (TODS)

Pub Date : 2019-01-22 DOI: 10.1145/3283811

Andreas Kipf, Varun Pandey, Jan Böttcher, Lucas Braun, Thomas Neumann, A. Kemper

Today’s streaming applications demand increasingly high event throughput rates and are often subject to strict latency constraints. To allow for more complex workloads, such as window-based aggregations, streaming systems need to support stateful event processing. This introduces new challenges for streaming engines as the state needs to be maintained in a consistent and durable manner and simultaneously accessed by complex queries for real-time analytics. Modern streaming systems, such as Apache Flink, do not allow for efficiently exposing the state to analytical queries. Thus, data engineers are forced to keep the state in external data stores, which significantly increases the latencies until events become visible to analytical queries. Proprietary solutions have been created to meet data freshness constraints. These solutions are expensive, error-prone, and difficult to maintain. Main-memory database systems, such as HyPer, achieve extremely low query response times while maintaining high update rates, which makes them well-suited for analytical streaming workloads. In this article, we explore extensions to database systems to match the performance and usability of streaming systems.

今天的流应用程序要求越来越高的事件吞吐率，并且经常受到严格的延迟限制。为了支持更复杂的工作负载，比如基于窗口的聚合，流系统需要支持有状态事件处理。这给流引擎带来了新的挑战，因为需要以一致和持久的方式维护状态，并同时被复杂的查询访问以进行实时分析。现代流系统(如Apache Flink)不允许将状态有效地暴露给分析查询。因此，数据工程师被迫将状态保存在外部数据存储中，这大大增加了事件对分析查询可见之前的延迟。已经创建了专有解决方案来满足数据新鲜度的限制。这些解决方案价格昂贵，容易出错，而且难以维护。主内存数据库系统，如HyPer，在保持高更新率的同时实现极低的查询响应时间，这使得它们非常适合分析流工作负载。在本文中，我们将探讨数据库系统的扩展，以匹配流系统的性能和可用性。

{"title":"Scalable Analytics on Fast Data","authors":"Andreas Kipf, Varun Pandey, Jan Böttcher, Lucas Braun, Thomas Neumann, A. Kemper","doi":"10.1145/3283811","DOIUrl":"https://doi.org/10.1145/3283811","url":null,"abstract":"Today’s streaming applications demand increasingly high event throughput rates and are often subject to strict latency constraints. To allow for more complex workloads, such as window-based aggregations, streaming systems need to support stateful event processing. This introduces new challenges for streaming engines as the state needs to be maintained in a consistent and durable manner and simultaneously accessed by complex queries for real-time analytics. Modern streaming systems, such as Apache Flink, do not allow for efficiently exposing the state to analytical queries. Thus, data engineers are forced to keep the state in external data stores, which significantly increases the latencies until events become visible to analytical queries. Proprietary solutions have been created to meet data freshness constraints. These solutions are expensive, error-prone, and difficult to maintain. Main-memory database systems, such as HyPer, achieve extremely low query response times while maintaining high update rates, which makes them well-suited for analytical streaming workloads. In this article, we explore extensions to database systems to match the performance and usability of streaming systems.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"6 1","pages":"1 - 35"},"PeriodicalIF":0.0,"publicationDate":"2019-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90035050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Parallelizing Sequential Graph Computations 并行顺序图计算

ACM Transactions on Database Systems (TODS)

Pub Date : 2018-12-16 DOI: 10.1145/3282488

W. Fan, Jingbo Xu, Yinghui Wu, Wenyuan Yu, Jiaxin Jiang, Zeyu Zheng, Bohan Zhang, Yang Cao, Chao Tian

This article presents GRAPE, a parallel GRAPh Engine for graph computations. GRAPE differs from prior systems in its ability to parallelize existing sequential graph algorithms as a whole, without the need for recasting the entire algorithm into a new model. Underlying GRAPE are a simple programming model and a principled approach based on fixpoint computation that starts with partial evaluation and uses an incremental function as the intermediate consequence operator. We show that users can devise existing sequential graph algorithms with minor additions, and GRAPE parallelizes the computation. Under a monotonic condition, the GRAPE parallelization guarantees to converge at correct answers as long as the sequential algorithms are correct. Moreover, we show that algorithms in MapReduce, BSP, and PRAM can be optimally simulated on GRAPE. In addition to the ease of programming, we experimentally verify that GRAPE achieves comparable performance to the state-of-the-art graph systems using real-life and synthetic graphs.

本文介绍了一个用于图计算的并行图引擎GRAPE。GRAPE与以前的系统的不同之处在于，它能够将现有的顺序图算法作为一个整体并行化，而不需要将整个算法重新转换到一个新模型中。GRAPE的基础是一个简单的编程模型和基于固定点计算的原则方法，该方法从部分求值开始，并使用增量函数作为中间结果运算符。我们表明，用户可以设计现有的顺序图算法，并进行少量的添加，而GRAPE使计算并行化。在单调条件下，只要序列算法是正确的，GRAPE并行化保证收敛于正确答案。此外，我们还证明了MapReduce、BSP和PRAM中的算法可以在GRAPE上进行最佳模拟。除了易于编程之外，我们还通过实验验证了GRAPE可以实现与使用现实生活和合成图的最先进图系统相当的性能。

引用次数: 30

On the Enumeration Complexity of Unions of Conjunctive Queries 论合取查询联合的枚举复杂度

ACM Transactions on Database Systems (TODS)

Pub Date : 2018-12-10 DOI: 10.1145/3450263

Nofar Carmeli, Markus Kröll

We study the enumeration complexity of Unions of Conjunctive Queries (UCQs). We aim to identify the UCQs that are tractable in the sense that the answer tuples can be enumerated with a linear preprocessing phase and a constant delay between every successive tuples. It has been established that, in the absence of self-joins and under conventional complexity assumptions, the CQs that admit such an evaluation are precisely the free-connex ones. A union of tractable CQs is always tractable. We generalize the notion of free-connexity from CQs to UCQs, thus showing that some unions containing intractable CQs are, in fact, tractable. Interestingly, some unions consisting of only intractable CQs are tractable too. We show how to use the techniques presented in this article also in settings where the database contains cardinality dependencies (including functional dependencies and key constraints) or when the UCQs contain disequalities. The question of finding a full characterization of the tractability of UCQs remains open. Nevertheless, we prove that, for several classes of queries, free-connexity fully captures the tractable UCQs.

研究了合取查询联合的枚举复杂度。我们的目标是识别可处理的ucq，因为答案元组可以通过线性预处理阶段和每个连续元组之间的恒定延迟来枚举。已经确定的是，在没有自连接的情况下，在传统的复杂性假设下，承认这种评估的cq正是自由连接的cq。可处理的cq的联合总是可处理的。我们将自由连接的概念从cq推广到ucq，从而表明一些包含棘手cq的联合实际上是可处理的。有趣的是，一些仅由棘手的cq组成的联合也是容易处理的。我们还将展示如何在数据库包含基数依赖项(包括功能依赖项和键约束)或ucq包含不相等项的情况下使用本文中介绍的技术。找到对ucq可追溯性的完整描述的问题仍然没有解决。然而，我们证明了，对于一些查询类，自由连接完全捕获了可处理的ucq。

引用次数: 5

Learning From Query-Answers 从问答中学习

ACM Transactions on Database Systems (TODS)

Pub Date : 2018-12-08 DOI: 10.1145/3277503

Niccolò Meneghetti, Oliver Kennedy, Wolfgang Gatterbauer

Tuple-independent and disjoint-independent probabilistic databases (TI- and DI-PDBs) represent uncertain data in a factorized form as a product of independent random variables that represent either tuples (TI-PDBs) or sets of tuples (DI-PDBs). When the user submits a query, the database derives the marginal probabilities of each output-tuple, exploiting the underlying assumptions of statistical independence. While query processing in TI- and DI-PDBs has been studied extensively, limited research has been dedicated to the problems of updating or deriving the parameters from observations of query results. Addressing this problem is the main focus of this article. We first introduce Beta Probabilistic Databases (B-PDBs), a generalization of TI-PDBs designed to support both (i) belief updating and (ii) parameter learning in a principled and scalable way. The key idea of B-PDBs is to treat each parameter as a latent, Beta-distributed random variable. We show how this simple expedient enables both belief updating and parameter learning in a principled way, without imposing any burden on regular query processing. Building on B-PDBs, we then introduce Dirichlet Probabilistic Databases (D-PDBs), a generalization of DI-PDBs with similar properties. We provide the following key contributions for both B- and D-PDBs: (i) We study the complexity of performing Bayesian belief updates and devise efficient algorithms for certain tractable classes of queries; (ii) we propose a soft-EM algorithm for computing maximum-likelihood estimates of the parameters; (iii) we present an algorithm for efficiently computing conditional probabilities, allowing us to efficiently implement B- and D-PDBs via a standard relational engine; and (iv) we support our conclusions with extensive experimental results.

元组独立和不分离独立概率数据库(TI-和DI-PDBs)以分解形式表示不确定数据，作为表示元组(TI- pdbs)或元组集合(DI-PDBs)的独立随机变量的乘积。当用户提交查询时，数据库利用统计独立性的基本假设，派生出每个输出元组的边际概率。虽然TI-和DI-PDBs中的查询处理已经得到了广泛的研究，但很少有研究致力于从查询结果的观察中更新或派生参数的问题。解决这个问题是本文的主要焦点。我们首先介绍Beta概率数据库(B-PDBs)，这是TI-PDBs的一种推广，旨在以原则性和可扩展的方式支持(i)信念更新和(ii)参数学习。B-PDBs的关键思想是将每个参数视为潜在的beta分布随机变量。我们将展示这个简单的权宜之计如何以一种有原则的方式实现信念更新和参数学习，而不会给常规查询处理带来任何负担。在B-PDBs的基础上，我们介绍了Dirichlet概率数据库(D-PDBs)，它是具有类似性质的DI-PDBs的推广。我们为B-和D-PDBs提供了以下关键贡献:(i)我们研究了执行贝叶斯信念更新的复杂性，并为某些可处理的查询类设计了有效的算法;(ii)我们提出了一种软em算法来计算参数的最大似然估计;(iii)我们提出了一种有效计算条件概率的算法，使我们能够通过标准的关系引擎有效地实现B-和D-PDBs;(iv)我们用广泛的实验结果来支持我们的结论。

{"title":"Learning From Query-Answers","authors":"Niccolò Meneghetti, Oliver Kennedy, Wolfgang Gatterbauer","doi":"10.1145/3277503","DOIUrl":"https://doi.org/10.1145/3277503","url":null,"abstract":"Tuple-independent and disjoint-independent probabilistic databases (TI- and DI-PDBs) represent uncertain data in a factorized form as a product of independent random variables that represent either tuples (TI-PDBs) or sets of tuples (DI-PDBs). When the user submits a query, the database derives the marginal probabilities of each output-tuple, exploiting the underlying assumptions of statistical independence. While query processing in TI- and DI-PDBs has been studied extensively, limited research has been dedicated to the problems of updating or deriving the parameters from observations of query results. Addressing this problem is the main focus of this article. We first introduce Beta Probabilistic Databases (B-PDBs), a generalization of TI-PDBs designed to support both (i) belief updating and (ii) parameter learning in a principled and scalable way. The key idea of B-PDBs is to treat each parameter as a latent, Beta-distributed random variable. We show how this simple expedient enables both belief updating and parameter learning in a principled way, without imposing any burden on regular query processing. Building on B-PDBs, we then introduce Dirichlet Probabilistic Databases (D-PDBs), a generalization of DI-PDBs with similar properties. We provide the following key contributions for both B- and D-PDBs: (i) We study the complexity of performing Bayesian belief updates and devise efficient algorithms for certain tractable classes of queries; (ii) we propose a soft-EM algorithm for computing maximum-likelihood estimates of the parameters; (iii) we present an algorithm for efficiently computing conditional probabilities, allowing us to efficiently implement B- and D-PDBs via a standard relational engine; and (iv) we support our conclusions with extensive experimental results.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"50 1","pages":"1 - 41"},"PeriodicalIF":0.0,"publicationDate":"2018-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81284892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Optimal Bloom Filters and Adaptive Merging for LSM-Trees lsm树的最优Bloom过滤器和自适应合并

ACM Transactions on Database Systems (TODS)

Pub Date : 2018-12-08 DOI: 10.1145/3276980

Niv Dayan, Manos Athanassoulis, Stratos Idreos

In this article, we show that key-value stores backed by a log-structured merge-tree (LSM-tree) exhibit an intrinsic tradeoff between lookup cost, update cost, and main memory footprint, yet all existing designs expose a suboptimal and difficult to tune tradeoff among these metrics. We pinpoint the problem to the fact that modern key-value stores suboptimally co-tune the merge policy, the buffer size, and the Bloom filters’ false-positive rates across the LSM-tree’s different levels. We present Monkey, an LSM-tree based key-value store that strikes the optimal balance between the costs of updates and lookups with any given main memory budget. The core insight is that worst-case lookup cost is proportional to the sum of the false-positive rates of the Bloom filters across all levels of the LSM-tree. Contrary to state-of-the-art key-value stores that assign a fixed number of bits-per-element to all Bloom filters, Monkey allocates memory to filters across different levels so as to minimize the sum of their false-positive rates. We show analytically that Monkey reduces the asymptotic complexity of the worst-case lookup I/O cost, and we verify empirically using an implementation on top of RocksDB that Monkey reduces lookup latency by an increasing margin as the data volume grows (50--80% for the data sizes we experimented with). Furthermore, we map the design space onto a closed-form model that enables adapting the merging frequency and memory allocation to strike the best tradeoff among lookup cost, update cost and main memory, depending on the workload (proportion of lookups and updates), the dataset (number and size of entries), and the underlying hardware (main memory available, disk vs. flash). We show how to use this model to answer what-if design questions about how changes in environmental parameters impact performance and how to adapt the design of the key-value store for optimal performance.

在本文中，我们展示了由日志结构的合并树(LSM-tree)支持的键值存储在查找成本、更新成本和主内存占用之间表现出内在的权衡，然而所有现有的设计都暴露了这些指标之间的次优且难以调优的权衡。我们将问题定位为这样一个事实，即现代键值存储在lsm树的不同级别上对合并策略、缓冲区大小和Bloom过滤器的误报率进行了次优协同调优。我们介绍了Monkey，一个基于lsm树的键值存储，它在任何给定的主内存预算下都能在更新和查找成本之间达到最佳平衡。核心观点是，最坏情况下的查找成本与跨lsm树的所有级别的Bloom过滤器的假阳性率的总和成正比。与最先进的键值存储(为所有Bloom过滤器分配固定数量的每个元素的位数)不同，Monkey为不同级别的过滤器分配内存，以最小化其误报率的总和。我们通过分析表明，Monkey降低了最坏情况下查找I/O成本的渐近复杂性，并且我们通过使用RocksDB之上的实现经验验证，随着数据量的增长，Monkey减少查找延迟的幅度越来越大(对于我们实验的数据大小，为50% -80%)。此外，我们将设计空间映射到一个封闭形式的模型上，该模型允许调整合并频率和内存分配，以便根据工作负载(查找和更新的比例)、数据集(条目的数量和大小)和底层硬件(可用的主存、磁盘与闪存)，在查找成本、更新成本和主存之间实现最佳权衡。我们将展示如何使用该模型来回答有关环境参数的变化如何影响性能以及如何调整键值存储的设计以获得最佳性能的假设设计问题。

{"title":"Optimal Bloom Filters and Adaptive Merging for LSM-Trees","authors":"Niv Dayan, Manos Athanassoulis, Stratos Idreos","doi":"10.1145/3276980","DOIUrl":"https://doi.org/10.1145/3276980","url":null,"abstract":"In this article, we show that key-value stores backed by a log-structured merge-tree (LSM-tree) exhibit an intrinsic tradeoff between lookup cost, update cost, and main memory footprint, yet all existing designs expose a suboptimal and difficult to tune tradeoff among these metrics. We pinpoint the problem to the fact that modern key-value stores suboptimally co-tune the merge policy, the buffer size, and the Bloom filters’ false-positive rates across the LSM-tree’s different levels. We present Monkey, an LSM-tree based key-value store that strikes the optimal balance between the costs of updates and lookups with any given main memory budget. The core insight is that worst-case lookup cost is proportional to the sum of the false-positive rates of the Bloom filters across all levels of the LSM-tree. Contrary to state-of-the-art key-value stores that assign a fixed number of bits-per-element to all Bloom filters, Monkey allocates memory to filters across different levels so as to minimize the sum of their false-positive rates. We show analytically that Monkey reduces the asymptotic complexity of the worst-case lookup I/O cost, and we verify empirically using an implementation on top of RocksDB that Monkey reduces lookup latency by an increasing margin as the data volume grows (50--80% for the data sizes we experimented with). Furthermore, we map the design space onto a closed-form model that enables adapting the merging frequency and memory allocation to strike the best tradeoff among lookup cost, update cost and main memory, depending on the workload (proportion of lookups and updates), the dataset (number and size of entries), and the underlying hardware (main memory available, disk vs. flash). We show how to use this model to answer what-if design questions about how changes in environmental parameters impact performance and how to adapt the design of the key-value store for optimal performance.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"1 1","pages":"1 - 48"},"PeriodicalIF":0.0,"publicationDate":"2018-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86446463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

MacroBase

ACM Transactions on Database Systems (TODS)

Pub Date : 2018-12-06 DOI: 10.1145/3276463

Firas Abuzaid, Peter D. Bailis, Jialin Ding, Edward Gan, S. Madden, D. Narayanan, Kexin Rong, S. Suri

As data volumes continue to rise, manual inspection is becoming increasingly untenable. In response, we present MacroBase, a data analytics engine that prioritizes end-user attention in high-volume fast data streams. MacroBase enables efficient, accurate, and modular analyses that highlight and aggregate important and unusual behavior, acting as a search engine for fast data. MacroBase is able to deliver order-of-magnitude speedups over alternatives by optimizing the combination of explanation (i.e., feature selection) and classification tasks and by leveraging a new reservoir sampler and heavy-hitters sketch specialized for fast data streams. As a result, MacroBase delivers accurate results at speeds of up to 2M events per second per query on a single core. The system has delivered meaningful results in production, including at a telematics company monitoring hundreds of thousands of vehicles.

随着数据量的不断增加，人工检查变得越来越站不住脚。作为回应，我们提出了MacroBase，这是一个数据分析引擎，可以在大容量快速数据流中优先考虑最终用户的注意力。MacroBase支持高效、准确和模块化的分析，突出显示和聚合重要的和不寻常的行为，充当快速数据的搜索引擎。通过优化解释(即特征选择)和分类任务的组合，以及利用新的储层采样器和专门用于快速数据流的重量级草图，MacroBase能够提供数量级的速度提升。因此，MacroBase在单个核心上以每秒2M个事件的速度提供准确的结果。该系统已经在生产中提供了有意义的结果，包括在一家远程信息处理公司监控数十万辆汽车。

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

ACM Transactions on Database Systems (TODS)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀