Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems最新文献

英文中文

New Algorithms for Monotone Classification 单调分类的新算法

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458324

Yufei Tao, Yu Wang

In em monotone classification, the input is a set P of n points in d-dimensional space, where each point carries a label 0 or 1. A point p em dominates another point q if the coordinate of p is at least that of q on every dimension. A em monotone classifier is a function h mapping each d-dimensional point to $0, 1 $, subject to the condition that $h(p) ge h(q)$ holds whenever p dominates q. The classifier h em mis-classifies a point $p in P$ if $h(p)$ is different from the label of p. The em error of h is the number of points in P mis-classified by h. The objective is to find a monotone classifier with a small error. The problem is fundamental to numerous database applications in entity matching, record linkage, and duplicate detection. This paper studies two variants of the problem. In the first em active version, all the labels are hidden in the beginning; an algorithm must pay a unit cost to em probe (i.e., reveal) the label of a point in P. We prove that $Ømega(n)$ probes are necessary to find an optimal classifier even in one-dimensional space ($d=1$). On the other hand, given an arbitrary $eps > 0$, we show how to obtain (with high probability) a monotone classifier whose error is worse than the optimum by at most a $1 + eps$ factor, while probing $tO(w/eps^2)$ labels, where w is the dominance width of P and $tO(.)$ hides a polylogarithmic factor. For constant $eps$, the probing cost matches an existing lower bound up to an $tO(1)$ factor. In the second em passive version, the point labels in P are explicitly given; the goal is to minimize CPU computation in finding an optimal classifier. We show that the problem can be settled in time polynomial to both d and n.

在单调分类中，输入是d维空间中n个点的集合P，其中每个点带有一个标签0或1。点p em优于另一个点q，如果p的坐标在每个维度上至少是q的坐标。一个em单调分类器是一个函数h，将每个d维点映射到$ 0,1 $，条件是当p优于q时，$h(p) ge h(q)$成立。如果$h(p)$与p的标签不同，分类器h em对p $中的点$p 进行错误分类。h的em误差是p中被h错误分类的点的数量。目标是找到一个误差小的单调分类器。这个问题是实体匹配、记录链接和重复检测中许多数据库应用程序的基础。本文研究了该问题的两个变体。在第一个em活动版本中，所有标签都隐藏在开始;我们证明了$Ømega(n)$探针对于找到最优分类器是必要的，即使在一维空间($d=1$)中。另一方面，给定任意$eps > 0$，我们展示了如何(以高概率)获得一个单调分类器，其误差最多比最优值差$1 + eps$因子，同时探测$ to (w/eps^2)$标签，其中w是P的优势宽度，$ to(.)$隐藏了一个多对数因子。对于常数$eps$，探测成本匹配到$ to(1)$因子的现有下界。在第二个被动版本中，P中的点标签被显式给出;目标是在寻找最优分类器时最小化CPU计算。我们证明了这个问题可以用d和n的时间多项式来解决。

{"title":"New Algorithms for Monotone Classification","authors":"Yufei Tao, Yu Wang","doi":"10.1145/3452021.3458324","DOIUrl":"https://doi.org/10.1145/3452021.3458324","url":null,"abstract":"In em monotone classification, the input is a set P of n points in d-dimensional space, where each point carries a label 0 or 1. A point p em dominates another point q if the coordinate of p is at least that of q on every dimension. A em monotone classifier is a function h mapping each d-dimensional point to $0, 1 $, subject to the condition that $h(p) ge h(q)$ holds whenever p dominates q. The classifier h em mis-classifies a point $p in P$ if $h(p)$ is different from the label of p. The em error of h is the number of points in P mis-classified by h. The objective is to find a monotone classifier with a small error. The problem is fundamental to numerous database applications in entity matching, record linkage, and duplicate detection. This paper studies two variants of the problem. In the first em active version, all the labels are hidden in the beginning; an algorithm must pay a unit cost to em probe (i.e., reveal) the label of a point in P. We prove that $Ømega(n)$ probes are necessary to find an optimal classifier even in one-dimensional space ($d=1$). On the other hand, given an arbitrary $eps > 0$, we show how to obtain (with high probability) a monotone classifier whose error is worse than the optimum by at most a $1 + eps$ factor, while probing $tO(w/eps^2)$ labels, where w is the dominance width of P and $tO(.)$ hides a polylogarithmic factor. For constant $eps$, the probing cost matches an existing lower bound up to an $tO(1)$ factor. In the second em passive version, the point labels in P are explicitly given; the goal is to minimize CPU computation in finding an optimal classifier. We show that the problem can be settled in time polynomial to both d and n.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129858584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Approximation Algorithms for Large Scale Data Analysis 大规模数据分析的近似算法

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458813

B. Saha

One of the greatest successes of computational complexity theory is the classification of countless fundamental computational problems into polynomial-time and NP-hard ones, two classes that are often referred to as tractable and intractable, respectively. However, this crude distinction of algorithmic efficiency is clearly insufficient when handling today's large scale of data. We need a finer-grained design and analysis of algorithms that pinpoints the exact exponent of polynomial running time, and a better understanding of when a speed-up is not possible. Based on stronger complexity assumptions than P vs NP, like the Strong Exponential Time Hypothesis, recently conditional lower bounds for a variety of fundamental problems in P have been proposed. Unfortunately, these conditional lower bounds often break down when one may settle for a near-optimal solution. Indeed, approximation algorithms can play a significant role when designing fast algorithms not just for traditional NP Hard problems, but also for polynomial time problems. For some applications arising in machine learning, the time complexity of the underlying algorithms is not sufficient to ensure a fast solution. It is often needed to collect side information about the data to ensure high accuracy. This requires low query complexity. In this presentation, we will cover new facets of fast algorithm design for large scale data analysis that emphasizes on the role of developing approximation algorithms for better polynomial time/query complexity.

计算复杂性理论最大的成功之一是将无数的基本计算问题分类为多项式时间问题和np困难问题，这两类问题通常分别被称为易处理和难处理。然而，在处理今天的大规模数据时，这种对算法效率的粗略区分显然是不够的。我们需要对算法进行更细粒度的设计和分析，以确定多项式运行时间的确切指数，并更好地理解何时不可能进行加速。基于比P vs NP更强的复杂性假设，如强指数时间假设，最近提出了P中各种基本问题的条件下界。不幸的是，这些条件下界往往打破了当一个人可能满足于一个接近最优的解决方案。事实上，近似算法在设计快速算法时可以发挥重要作用，不仅适用于传统的NP困难问题，也适用于多项式时间问题。对于机器学习中出现的一些应用，底层算法的时间复杂度不足以确保快速解决。通常需要收集数据的侧面信息，以确保数据的高准确性。这需要较低的查询复杂度。在本次演讲中，我们将介绍大规模数据分析快速算法设计的新方面，重点是开发近似算法的作用，以提高多项式时间/查询复杂性。

{"title":"Approximation Algorithms for Large Scale Data Analysis","authors":"B. Saha","doi":"10.1145/3452021.3458813","DOIUrl":"https://doi.org/10.1145/3452021.3458813","url":null,"abstract":"One of the greatest successes of computational complexity theory is the classification of countless fundamental computational problems into polynomial-time and NP-hard ones, two classes that are often referred to as tractable and intractable, respectively. However, this crude distinction of algorithmic efficiency is clearly insufficient when handling today's large scale of data. We need a finer-grained design and analysis of algorithms that pinpoints the exact exponent of polynomial running time, and a better understanding of when a speed-up is not possible. Based on stronger complexity assumptions than P vs NP, like the Strong Exponential Time Hypothesis, recently conditional lower bounds for a variety of fundamental problems in P have been proposed. Unfortunately, these conditional lower bounds often break down when one may settle for a near-optimal solution. Indeed, approximation algorithms can play a significant role when designing fast algorithms not just for traditional NP Hard problems, but also for polynomial time problems. For some applications arising in machine learning, the time complexity of the underlying algorithms is not sufficient to ensure a fast solution. It is often needed to collect side information about the data to ensure high accuracy. This requires low query complexity. In this presentation, we will cover new facets of fast algorithm design for large scale data analysis that emphasizes on the role of developing approximation algorithms for better polynomial time/query complexity.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126640121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Two-Attribute Skew Free, Isolated CP Theorem, and Massively Parallel Joins 二属性无偏、孤立CP定理和大规模并行连接

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458321

Miao Qiao, Yufei Tao

This paper presents an algorithm to process a multi-way join with load $tO(n/p^2/(α φ) )$ under the MPC model, where n is the number of tuples in the input relations, α the maximum arity of those relations, p the number of machines, and φ a newly introduced parameter called the em generalized vertex packing number. The algorithm owes to two new findings. The first is a em two-attribute skew free technique to partition the join result for parallel computation. The second is an em isolated cartesian product theorem, which provides fresh graph-theoretic insights on joins with α ge 3$ and generalizes an existing theorem on α = 2$.

本文在MPC模型下提出了一种处理负载$ to (n/p^2/(α φ))$的多路连接的算法，其中n为输入关系中的元组个数，α为这些关系的最大次数，p为机器数量，φ为新引入的参数em广义顶点填充数。该算法得益于两项新发现。第一种是一种无双属性倾斜的技术，用于为并行计算划分连接结果。二是孤立笛卡尔积定理，它提供了关于α = 3$连接的新的图论见解，并推广了关于α = 2$的已有定理。

引用次数: 3

Cover or Pack: New Upper and Lower Bounds for Massively Parallel Joins 覆盖或包:大规模并行连接的新上限和下限

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458319

Xiao Hu

This paper considers the worst-case complexity of multi-round join evaluation in the Massively Parallel Computation (MPC) model. Unlike the sequential RAM model, in which there is a unified optimal algorithm based on the AGM bound for all join queries, worst-case optimal algorithms have been achieved on a very restrictive class of joins in the MPC model. The only known lower bound is still derived from the AGM bound, in terms of the optimal fractional edge covering number of the query. In this work, we make efforts towards bridging this gap. We design an instance-dependent algorithm for the class of α-acyclic join queries. In particular, when the maximum size of input relations is bounded, this complexity has a closed form in terms of the optimal fractional edge covering number of the query, which is worst-case optimal. Beyond acyclic joins, we surprisingly find that the optimal fractional edge covering number does not lead to a tight lower bound. More specifically, we prove for a class of cyclic joins a higher lower bound in terms of the optimal fractional edge packing number of the query, which is matched by existing algorithms, thus optimal. This new result displays a significant distinction for join evaluation, not only between acyclic and cyclic joins, but also between the fine-grained RAM and coarse-grained MPC model.

本文研究了大规模并行计算(MPC)模型中多轮连接计算的最坏情况复杂度。与顺序RAM模型不同，在顺序RAM模型中，所有连接查询都有一个基于AGM边界的统一最优算法，而MPC模型中的最坏情况最优算法是在非常严格的连接类上实现的。根据查询的最优分数边覆盖数，唯一已知的下界仍然是从AGM界派生出来的。在这项工作中，我们努力弥合这一差距。针对α-无环连接查询类，设计了一种实例依赖算法。特别是，当输入关系的最大大小有界时，这种复杂性在查询的最优分数边覆盖数方面具有封闭形式，这是最坏情况下最优的。在非环连接之外，我们惊奇地发现最优分数边覆盖数不会导致紧下界。更具体地说，我们证明了一类循环连接在查询的最优分数边填充数方面有一个上下界，它与现有算法相匹配，因此是最优的。这一新结果显示了连接评估的显著差异，不仅存在于非循环连接和循环连接之间，而且存在于细粒度RAM和粗粒度MPC模型之间。

{"title":"Cover or Pack: New Upper and Lower Bounds for Massively Parallel Joins","authors":"Xiao Hu","doi":"10.1145/3452021.3458319","DOIUrl":"https://doi.org/10.1145/3452021.3458319","url":null,"abstract":"This paper considers the worst-case complexity of multi-round join evaluation in the Massively Parallel Computation (MPC) model. Unlike the sequential RAM model, in which there is a unified optimal algorithm based on the AGM bound for all join queries, worst-case optimal algorithms have been achieved on a very restrictive class of joins in the MPC model. The only known lower bound is still derived from the AGM bound, in terms of the optimal fractional edge covering number of the query. In this work, we make efforts towards bridging this gap. We design an instance-dependent algorithm for the class of α-acyclic join queries. In particular, when the maximum size of input relations is bounded, this complexity has a closed form in terms of the optimal fractional edge covering number of the query, which is worst-case optimal. Beyond acyclic joins, we surprisingly find that the optimal fractional edge covering number does not lead to a tight lower bound. More specifically, we prove for a class of cyclic joins a higher lower bound in terms of the optimal fractional edge packing number of the query, which is matched by existing algorithms, thus optimal. This new result displays a significant distinction for join evaluation, not only between acyclic and cyclic joins, but also between the fine-grained RAM and coarse-grained MPC model.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"66 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114125004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Deciding Boundedness of Monadic Sirups 一元群的有界性判定

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458332

S. Kikot, Á. Kurucz, V. Podolskii, M. Zakharyaschev

We show that deciding boundedness (aka FO-rewritability) of monadic single rule datalog programs (sirups) is 2Exp-hard, which matches the upper bound known since 1988 and finally settles a long-standing open problem. We obtain this result as a byproduct of an attempt to classify monadic 'disjunctive sirups'---Boolean conjunctive queries $q$ with unary and binary predicates mediated by a disjunctive rule $T(x) łor F(x) łeftarrow A(x)$---according to the data complexity of their evaluation. Apart from establishing that deciding FO-rewritability of disjunctive sirups with a dag-shaped $q$ is also 2Exp-hard, we make substantial progress towards obtaining a complete FO/Ł-hardness dichotomy of disjunctive sirups with ditree-shaped $q$.

我们证明了一元单规则数据程序(sirups)的有界性(又名fo -可重写性)的判定是2Exp-hard的，它与1988年以来已知的上界相匹配，最终解决了一个长期存在的开放问题。我们获得了这个结果，作为尝试分类一元“析取sirups”的副产品-布尔合取查询$q$与一元和二元谓词由析取规则$T(x) łor F(x) łeftarrow a (x)$调解-根据其评估的数据复杂性。除了证明具有d形$q$的析取小环的FO-可重写性也是2 e-难的之外，我们在获得具有异形$q$的析取小环的完整FO/Ł-hardness二分法方面取得了实质性进展。

引用次数: 1

Probabilistic Databases under Updates: Boolean Query Evaluation and Ranked Enumeration 更新下的概率数据库:布尔查询求值和排序枚举

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458326

Christoph Berkholz, M. Merz

We consider tuple-independent probabilistic databases in a dynamic setting, where tuples can be inserted or deleted. In this context we are interested in efficient data structures for maintaining the query result of Boolean as well as non-Boolean queries. For Boolean queries, we show how the known lifted inference rules can be made dynamic, so that they support single-tuple updates with only a constant number of arithmetic operations. As a consequence, we obtain that the probability of every safe UCQ can be maintained with constant update time. For non-Boolean queries, our task is to enumerate all result tuples ranked by their probability. We develop lifted inference rules for non-Boolean queries, and, based on these rules, provide a dynamic data structure that allows both log-time updates and ranked enumeration with logarithmic delay. As an application, we identify a fragment of non-repeating conjunctive queries that supports log-time updates as well as log-delay ranked enumeration. This characterisation is tight under the OMv-conjecture.

我们在动态环境中考虑元组独立的概率数据库，其中元组可以被插入或删除。在这种情况下，我们对维护布尔查询和非布尔查询结果的有效数据结构感兴趣。对于布尔查询，我们将展示如何将已知的提升推理规则设置为动态的，以便它们支持仅使用恒定数量的算术运算的单元组更新。因此，我们得到每个安全UCQ的概率可以保持恒定的更新时间。对于非布尔查询，我们的任务是枚举所有按概率排序的结果元组。我们为非布尔查询开发了提升推理规则，并基于这些规则提供了一种动态数据结构，该结构允许日志时间更新和具有对数延迟的排序枚举。作为一个应用程序，我们确定了一个支持日志时间更新和日志延迟排序枚举的非重复联合查询片段。这种特征在omv猜想下是严密的。

引用次数: 2

Stackless Processing of Streamed Trees 流树的无堆栈处理

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458320

Corentin Barloy, Filip Murlak, Charles Paperman

Processing tree-structured data in the streaming model is a challenge: capturing regular properties of streamed trees by means of a stack is costly in memory, but falling back to finite-state automata drastically limits the computational power. We propose an intermediate stackless model based on register automata equipped with a single counter, used to maintain the current depth in the tree. We explore the power of this model to validate and query streamed trees. Our main result is an effective characterization of regular path queries (RPQs) that can be evaluated stacklessly---with and without registers. In particular, we confirm the conjectured characterization of tree languages defined by DTDs that are recognizable without registers, by Segoufin and Vianu (2002), in the special case of tree languages defined by means of an RPQ.

在流模型中处理树结构数据是一个挑战:通过堆栈捕获流树的规则属性在内存中是昂贵的，但是回归到有限状态自动机极大地限制了计算能力。我们提出了一种基于寄存器自动机的中间无堆栈模型，该模型配备了一个计数器，用于维持树中的当前深度。我们将探索该模型在验证和查询流树方面的强大功能。我们的主要结果是对常规路径查询(rpq)的有效表征，这些查询可以无堆叠地进行评估——无论是否使用寄存器。特别是，我们确认了Segoufin和Vianu(2002)在通过RPQ定义的树状语言的特殊情况下，由dtd定义的树状语言的推测特征，这些特征在没有寄存器的情况下是可识别的。

引用次数: 2

Datalog Unchained Datalog锁不住的

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458815

V. Vianu

This is the companion paper of a talk in the Gems of PODS series, that reviews the development, starting at PODS 1988, of a family of Datalog-like languages with procedural, forward chaining semantics, providing an alternative to the classical declarative, model-theoretic semantics. These languages also provide a unified formalism that can express important classes of queries including fixpoint, while, and all computable queries. They can also incorporate in a natural fashion updates and nondeterminism. Datalog variants with forward chaining semantics have been adopted in a variety of settings, including active databases, production systems, distributed data exchange, and data-driven reactive systems.

本文是“PODS的精华”系列讲座的附属论文，该系列回顾了从PODS 1988年开始的一系列类似datalog的语言的发展，这些语言具有过程的、前向链接的语义，为经典的声明性、模型理论语义提供了一种替代方案。这些语言还提供了一种统一的形式，可以表达重要的查询类，包括定点查询、while查询和所有可计算查询。它们还可以以自然的方式包含更新和不确定性。具有前向链语义的数据表变体已在各种设置中采用，包括活动数据库、生产系统、分布式数据交换和数据驱动的响应系统。

引用次数: 5

Benchmarking Approximate Consistent Query Answering 近似一致查询回答的基准测试

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458309

M. Calautti, Marco Console, Andreas Pieris

Consistent query answering (CQA) aims to deliver meaningful answers when queries are evaluated over inconsistent databases. Such answers must be certainly true in all repairs, which are consistent databases whose difference from the inconsistent one is somehow minimal. Although CQA provides a clean framework for querying inconsistent databases, it is arguably more informative to compute the percentage of repairs in which a candidate answer is true, instead of simply saying that is true in all repairs, or is false in at least one repair. It should not be surprising, though, that computing this percentage is computationally hard. On the other hand, for practically relevant settings such as conjunctive queries and primary keys, there are data-efficient randomized approximation schemes for approximating this percentage. Our goal is to perform a thorough experimental evaluation and comparison of those approximation schemes. Our analysis provides new insights on which technique is indicated depending on key characteristics of the input, and it further provides evidence that making approximate CQA as described above feasible in practice is not an unrealistic goal.

一致性查询回答(CQA)的目的是在对不一致的数据库执行查询时提供有意义的答案。这样的答案在所有修复中肯定是正确的，这些修复是一致的数据库，它们与不一致的数据库之间的差异在某种程度上是最小的。尽管CQA为查询不一致的数据库提供了一个清晰的框架，但是计算候选答案为真的修复百分比，而不是简单地说它在所有修复中为真，或者至少在一个修复中为假，可以提供更多的信息。但是，计算这个百分比在计算上很困难，这并不奇怪。另一方面，对于实际相关的设置，如连接查询和主键，有数据高效的随机近似方案来近似这个百分比。我们的目标是对这些近似方案进行彻底的实验评估和比较。我们的分析为根据输入的关键特征指示哪种技术提供了新的见解，并且进一步提供了证据，证明在实践中实现上述近似CQA并非不切实际的目标。

{"title":"Benchmarking Approximate Consistent Query Answering","authors":"M. Calautti, Marco Console, Andreas Pieris","doi":"10.1145/3452021.3458309","DOIUrl":"https://doi.org/10.1145/3452021.3458309","url":null,"abstract":"Consistent query answering (CQA) aims to deliver meaningful answers when queries are evaluated over inconsistent databases. Such answers must be certainly true in all repairs, which are consistent databases whose difference from the inconsistent one is somehow minimal. Although CQA provides a clean framework for querying inconsistent databases, it is arguably more informative to compute the percentage of repairs in which a candidate answer is true, instead of simply saying that is true in all repairs, or is false in at least one repair. It should not be surprising, though, that computing this percentage is computationally hard. On the other hand, for practically relevant settings such as conjunctive queries and primary keys, there are data-efficient randomized approximation schemes for approximating this percentage. Our goal is to perform a thorough experimental evaluation and comparison of those approximation schemes. Our analysis provides new insights on which technique is indicated depending on key characteristics of the input, and it further provides evidence that making approximate CQA as described above feasible in practice is not an unrealistic goal.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127683956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Synchronization Schemas 同步模式

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458317

R. Alur, Phillip Hilliard, Z. Ives, Konstantinos Kallas, Konstantinos Mamouras, Filip Niksic, C. Stanford, V. Tannen, Anton Xue

We present a type-theoretic framework for data stream processing for real-time decision making, where the desired computation involves a mix of sequential computation, such as smoothing and detection of peaks and surges, and naturally parallel computation, such as relational operations, key-based partitioning, and map-reduce. Our framework unifies sequential (ordered) and relational (unordered) data models. In particular, we define synchronization schemas as types, and series-parallel streams (SPS) as objects of these types. A synchronization schema imposes a hierarchical structure over relational types that succinctly captures ordering and synchronization requirements among different kinds of data items. Series-parallel streams naturally model objects such as relations, sequences, sequences of relations, sets of streams indexed by key values, time-based and event-based windows, and more complex structures obtained by nesting of these. We introduce series-parallel stream transformers (SPST) as a domain-specific language for modular specification of deterministic transformations over such streams. SPSTs provably specify only monotonic transformations allowing streamability, have a modular structure that can be exploited for correct parallel implementation, and are composable allowing specification of complex queries as a pipeline of transformations.

我们提出了一个用于实时决策的数据流处理的类型理论框架，其中所需的计算涉及顺序计算的混合，例如平滑和检测峰值和激增，以及自然并行计算，例如关系操作，基于键的分区和映射约简。我们的框架统一了顺序(有序)和关系(无序)数据模型。特别是，我们将同步模式定义为类型，并将串行并行流(SPS)定义为这些类型的对象。同步模式在关系类型上施加层次结构，这种结构可以简洁地捕获不同类型数据项之间的排序和同步需求。串行并行流自然地建模对象，如关系、序列、关系序列、按键值索引的流集、基于时间和基于事件的窗口，以及通过这些嵌套获得的更复杂的结构。我们引入串并联流转换器(SPST)作为特定于领域的语言，用于此类流上的确定性转换的模块化规范。可以证明，spst只指定允许流化的单调转换，具有可用于正确并行实现的模块化结构，并且是可组合的，允许将复杂查询指定为转换管道。

{"title":"Synchronization Schemas","authors":"R. Alur, Phillip Hilliard, Z. Ives, Konstantinos Kallas, Konstantinos Mamouras, Filip Niksic, C. Stanford, V. Tannen, Anton Xue","doi":"10.1145/3452021.3458317","DOIUrl":"https://doi.org/10.1145/3452021.3458317","url":null,"abstract":"We present a type-theoretic framework for data stream processing for real-time decision making, where the desired computation involves a mix of sequential computation, such as smoothing and detection of peaks and surges, and naturally parallel computation, such as relational operations, key-based partitioning, and map-reduce. Our framework unifies sequential (ordered) and relational (unordered) data models. In particular, we define synchronization schemas as types, and series-parallel streams (SPS) as objects of these types. A synchronization schema imposes a hierarchical structure over relational types that succinctly captures ordering and synchronization requirements among different kinds of data items. Series-parallel streams naturally model objects such as relations, sequences, sequences of relations, sets of streams indexed by key values, time-based and event-based windows, and more complex structures obtained by nesting of these. We introduce series-parallel stream transformers (SPST) as a domain-specific language for modular specification of deterministic transformations over such streams. SPSTs provably specify only monotonic transformations allowing streamability, have a modular structure that can be exploited for correct parallel implementation, and are composable allowing specification of complex queries as a pipeline of transformations.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122819801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀