Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems最新文献_第2页

Consistent Query Answering for Primary Keys and Conjunctive Queries with Negated Atoms 主键和带否定原子的合取查询的一致性查询应答

Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2018-05-27 DOI: 10.1145/3196959.3196982

Paraschos Koutris, J. Wijsen

This paper studies query answering on databases that may be inconsistent with respect to primary key constraints. A repair is any consistent database that is obtained by deleting a minimal set of tuples. Given a Boolean query q, the problem CERTAINTY(q) takes a database as input and asks whether q is true in every repair of the database. A significant complexity classification task is to determine, given q, whether CERTAINTY(q) is first-order definable (and thus solvable by a single SQL query). This problem has been extensively studied for self-join-free conjunctive queries. An important extension of this class of queries is to allow negated atoms. It turns out that if negated atoms are allowed, CERTAINTY(q) can express some classical matching problems. This paper studies the existence and construction of first-order definitions for CERTAINTY(q) for q in the class of self-join-free conjunctive queries with negated atoms.

本文研究了可能存在主键约束不一致的数据库查询应答问题。修复是通过删除元组的最小集合获得的任何一致性数据库。给定一个布尔查询q，问题确定性(q)将数据库作为输入，并在每次修复数据库时询问q是否为真。一个重要的复杂分类任务是，在给定q的情况下，确定确定性(q)是否是一阶可定义的(因此可以通过单个SQL查询解决)。这个问题已经被广泛地研究用于自连接无连接查询。这类查询的一个重要扩展是允许否定原子。结果表明，如果允许存在负原子，则确定性(q)可以表示一些经典的匹配问题。研究了具有否定原子的自连接无合查询类中q的确定性(q)的一阶定义的存在性和构造。

引用次数: 16

Blockchains: Past, Present, and Future 区块链:过去、现在和未来

Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2018-05-27 DOI: 10.1145/3196959.3197545

Arvind Narayanan

Blockchain technology is assembled from pieces that have long pedigrees in the academic literature, such as linked timestamping, consensus, and proof of work. In this tutorial, I'll begin by summarizing these components and how they fit together in Bitcoin's blockchain design. Then I'll present abstract models of blockchains; such abstractions help us understand and reason about the similarities and differences between the numerous proposed blockchain designs in a succinct way. Here is one such abstraction. Blockchains can be understood in terms of (1) a log of messages: for example, a ledger of financial transactions; (2) the state that summarizes the result of processing the log: for example, a set of account balances; (3) a set of validity rules for messages/state updates: for example, transactions must spend no more than the available balances, must have verifiable signatures, etc; (4) consistency rules that determine whether two views of the log by different participants on the network are consistent with each other. In the second half of the tutorial I'll describe several research directions, focusing on those likely to be of interest to the PODS community. Here are a few examples. Efficient verification of state. A participant might want to verify a statement about a small part of the global state, such as the inclusion of a particular transaction in the blockchain. While the basics have been worked out, and involve techniques such as hash pointers, Merkle trees, and other "authenticated data structures", many interesting questions remain. Reconciling different views of consensus. In the game theory view of blockchains, all players are rational and follow their incentives; there are no honest, faulty, or malicious players. When does this view lead to similar or different predictions compared to the traditional consensus literature? Can we come up with hybrid models that reconcile these assumptions? Scaling and sharding. In traditional designs, the blockchain is fully replicated by every node, leading to massive inefficiency and severely limiting transaction throughput. What are the fundamental limits to scaling, and how can we improve scalability without weakening security? In particular, is it possible to shard the blockchain, that is, partition it among subsets of nodes, given the Byzantine setting?

区块链技术是由学术文献中有着悠久历史的部分组成的，比如链接的时间戳、共识和工作证明。在本教程中，我将首先总结这些组件以及它们如何在比特币的区块链设计中组合在一起。然后，我将介绍区块链的抽象模型;这样的抽象帮助我们以简洁的方式理解和推理众多提议的区块链设计之间的异同。这里有一个这样的抽象概念。区块链可以被理解为:(1)消息日志:例如，金融交易的分类账;(2)汇总日志处理结果的状态:例如，一组账户余额;(3)一套消息/状态更新的有效性规则:例如，交易必须花费不超过可用余额，必须具有可验证的签名等;(4)一致性规则，确定网络中不同参与者对日志的两个视图是否一致。在本教程的后半部分中，我将描述几个研究方向，重点关注PODS社区可能感兴趣的方向。这里有几个例子。有效的状态验证。参与者可能想要验证关于全局状态的一小部分的陈述，例如在区块链中包含特定交易。虽然已经解决了基础问题，并且涉及到哈希指针、Merkle树和其他“经过身份验证的数据结构”等技术，但仍然存在许多有趣的问题。调和不同的观点。在区块链的博弈论观点中，所有参与者都是理性的，并遵循他们的激励;没有诚实、错误或恶意的玩家。与传统的共识文献相比，这种观点在什么时候会导致类似或不同的预测?我们能否提出一种混合模型来调和这些假设?缩放和分片。在传统设计中，区块链被每个节点完全复制，导致效率低下，严重限制了交易吞吐量。扩展的基本限制是什么，我们如何在不削弱安全性的情况下提高可伸缩性?特别是，在给定拜占庭设置的情况下，是否有可能对区块链进行分片，即在节点子集之间进行分区?

{"title":"Blockchains: Past, Present, and Future","authors":"Arvind Narayanan","doi":"10.1145/3196959.3197545","DOIUrl":"https://doi.org/10.1145/3196959.3197545","url":null,"abstract":"Blockchain technology is assembled from pieces that have long pedigrees in the academic literature, such as linked timestamping, consensus, and proof of work. In this tutorial, I'll begin by summarizing these components and how they fit together in Bitcoin's blockchain design. Then I'll present abstract models of blockchains; such abstractions help us understand and reason about the similarities and differences between the numerous proposed blockchain designs in a succinct way. Here is one such abstraction. Blockchains can be understood in terms of (1) a log of messages: for example, a ledger of financial transactions; (2) the state that summarizes the result of processing the log: for example, a set of account balances; (3) a set of validity rules for messages/state updates: for example, transactions must spend no more than the available balances, must have verifiable signatures, etc; (4) consistency rules that determine whether two views of the log by different participants on the network are consistent with each other. In the second half of the tutorial I'll describe several research directions, focusing on those likely to be of interest to the PODS community. Here are a few examples. Efficient verification of state. A participant might want to verify a statement about a small part of the global state, such as the inclusion of a particular transaction in the blockchain. While the basics have been worked out, and involve techniques such as hash pointers, Merkle trees, and other \"authenticated data structures\", many interesting questions remain. Reconciling different views of consensus. In the game theory view of blockchains, all players are rational and follow their incentives; there are no honest, faulty, or malicious players. When does this view lead to similar or different predictions compared to the traditional consensus literature? Can we come up with hybrid models that reconcile these assumptions? Scaling and sharding. In traditional designs, the blockchain is fully replicated by every node, leading to massive inefficiency and severely limiting transaction throughput. What are the fundamental limits to scaling, and how can we improve scalability without weakening security? In particular, is it possible to shard the blockchain, that is, partition it among subsets of nodes, given the Byzantine setting?","PeriodicalId":344370,"journal":{"name":"Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130848286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Enumeration for FO Queries over Nowhere Dense Graphs 无处密集图上FO查询的枚举

Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2018-05-27 DOI: 10.1145/3196959.3196971

Nicole Schweikardt, L. Segoufin, Alexandre Vigny

We consider the evaluation of first-order queries over classes of databases that are nowhere dense. The notion of nowhere dense classes was introduced by Nesetril and Ossona de Mendez as a formalization of classes of "sparse" graphs and generalizes many well-known classes of graphs, such as classes of bounded degree, bounded tree-width, or bounded expansion. It has recently been shown by Grohe, Kreutzer, and Siebertz that over nowhere dense classes of databases, first-order sentences can be evaluated in pseudo-linear time (pseudo-linear time means that for all ε there exists an algorithm working in time O(n1+ε), where n is the size of the database). For first-order queries of higher arities, we show that over any nowhere dense class of databases, the set of their solutions can be enumerated with constant delay after a pseudo-linear time preprocessing. In the same context, we also show that after a pseudo-linear time preprocessing we can, on input of a tuple, test in constant time whether it is a solution to the query.

我们考虑对不密集的数据库类进行一阶查询的求值。无处密集类的概念是由Nesetril和Ossona de Mendez作为“稀疏”图类的形式化引入的，并推广了许多众所周知的图类，例如有界度类、有界树宽类或有界展开类。最近Grohe, Kreutzer和Siebertz已经证明，在任何密集的数据库类中，一阶句子都可以在伪线性时间内求值(伪线性时间意味着对于所有ε存在一个在时间O(n1+ε)中工作的算法，其中n是数据库的大小)。对于高阶的一阶查询，我们证明了在任意密集的数据库中，经过伪线性时间预处理后，它们的解集可以被恒延迟枚举。在相同的上下文中，我们还表明，经过伪线性时间预处理后，我们可以在常量时间内测试元组的输入是否是查询的解决方案。

引用次数: 32

Optimal Differentially Private Algorithms for k-Means Clustering k-均值聚类的最优差分私有算法

Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2018-05-27 DOI: 10.1145/3196959.3196977

Zhiyi Huang, Jinyan Liu

We consider privacy-preserving k-means clustering. For the objective of minimizing the Wasserstein distance between the output and the optimal solution, we show that there is a polynomial-time (ε,δ)-differentially private algorithm which, for any sufficiently large Φ2 well-separated datasets, outputs k centers that are within Wasserstein distance Ø(Φ2) from the optimal. This result improves the previous bounds by removing the dependence on ε, number of centers k, and dimension d. Further, we prove a matching lower bound that no (ε, δ)-differentially private algorithm can guarantee Wasserstein distance less than Ømega (Φ2) and, thus, our positive result is optimal up to a constant factor. For minimizing the k-means objective when the dimension d is bounded, we propose a polynomial-time private local search algorithm that outputs an αn-additive approximation when the size of the dataset is at least ~Ø (k3/2 · d · ε-1 · poly(α-1)).

我们考虑保护隐私的k-均值聚类。为了最小化输出和最优解之间的Wasserstein距离，我们证明了存在一个多项式时间(ε，δ)差分私有算法，对于任何足够大的Φ2分离良好的数据集，输出k个在Wasserstein距离Ø(Φ2)内的中心。该结果通过消除对ε，中心数k和维数d的依赖来改进先前的边界。此外，我们证明了一个匹配的下界，即没有(ε， δ)-差分私有算法可以保证Wasserstein距离小于Ømega (Φ2)，因此，我们的正结果是最优的，直到一个常数因子。为了在维数d有界时最小化k-means目标，我们提出了一种多项式时间私有局部搜索算法，当数据集的大小至少为~Ø (k3/2·d·ε-1·poly(α-1))时，该算法输出αn-可加性逼近。

引用次数: 32

Reflections on Schema Mappings, Data Exchange, and Metadata Management 关于模式映射、数据交换和元数据管理的思考

Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2018-05-27 DOI: 10.1145/3196959.3196991

Phokion G. Kolaitis

A schema mapping is a high-level specification of the relationship between two database schemas. For the past fifteen years, schema mappings have played an essential role in the modeling and analysis of data exchange, data integration, and related data inter-operability tasks. The aim of this talk is to critically reflect on the body of work carried out to date, describe some of the persisting challenges, and suggest directions for future work. The first part of the talk will focus on schema-mapping languages, especially on the language of GLAV (global-and-local as view) mappings and its two main sublanguages, the language of GAV (global-as-view) mappings and the language of LAV (local-as-view) mappings. After highlighting the fundamental structural properties of these languages, we will discuss how structural properties can actually characterize schema-mapping languages. The second part of the talk will focus on metadata management by considering operators on schema mappings, such as the composition operator and the inverse operator. We will discuss why richer languages are needed to express these operators, and will illustrate some of their uses in schema-mapping evolution. The third and final part of the talk will focus on the derivation of schema mappings from semantic information. In particular, we will discuss a variety of approaches for deriving schema mappings from data examples, including casting the derivation of schema mappings as an optimization problem and as a learning problem.

模式映射是两个数据库模式之间关系的高级规范。在过去的15年中，模式映射在数据交换、数据集成和相关数据互操作性任务的建模和分析中发挥了重要作用。本次演讲的目的是批判性地反思迄今为止所开展的工作，描述一些持续存在的挑战，并为未来的工作提出方向。演讲的第一部分将重点介绍模式映射语言，特别是全局和局部作为视图的映射语言及其两个主要的子语言，全局即视图的映射语言和局部即视图的映射语言。在强调了这些语言的基本结构属性之后，我们将讨论结构属性如何实际表征模式映射语言。讲座的第二部分将通过考虑模式映射上的操作符来关注元数据管理，例如组合操作符和逆操作符。我们将讨论为什么需要更丰富的语言来表达这些操作符，并将说明它们在模式映射演变中的一些用途。讲座的第三部分，也是最后一部分，将重点讨论从语义信息派生模式映射。特别是，我们将讨论从数据示例中派生模式映射的各种方法，包括将模式映射的派生转换为优化问题和学习问题。

{"title":"Reflections on Schema Mappings, Data Exchange, and Metadata Management","authors":"Phokion G. Kolaitis","doi":"10.1145/3196959.3196991","DOIUrl":"https://doi.org/10.1145/3196959.3196991","url":null,"abstract":"A schema mapping is a high-level specification of the relationship between two database schemas. For the past fifteen years, schema mappings have played an essential role in the modeling and analysis of data exchange, data integration, and related data inter-operability tasks. The aim of this talk is to critically reflect on the body of work carried out to date, describe some of the persisting challenges, and suggest directions for future work. The first part of the talk will focus on schema-mapping languages, especially on the language of GLAV (global-and-local as view) mappings and its two main sublanguages, the language of GAV (global-as-view) mappings and the language of LAV (local-as-view) mappings. After highlighting the fundamental structural properties of these languages, we will discuss how structural properties can actually characterize schema-mapping languages. The second part of the talk will focus on metadata management by considering operators on schema mappings, such as the composition operator and the inverse operator. We will discuss why richer languages are needed to express these operators, and will illustrate some of their uses in schema-mapping evolution. The third and final part of the talk will focus on the derivation of schema mappings from semantic information. In particular, we will discuss a variety of approaches for deriving schema mappings from data examples, including casting the derivation of schema mappings as an optimization problem and as a learning problem.","PeriodicalId":344370,"journal":{"name":"Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132740468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Entity Matching with Active Monotone Classification 主动单调分类的实体匹配

Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2018-05-27 DOI: 10.1145/3196959.3196984

Yufei Tao

Given two sets of entities X and Y, entity matching aims to decide whether x and y represent the same entity for each pair (x, y) ın X x Y. As the last resort, human experts can be called upon to inspect every (x, y), but this is expensive because the correct verdict could not be determined without investigation efforts dedicated specifically to the two entities x and y involved. It is therefore important to design an algorithm that asks humans to look at only some pairs, and renders the verdicts on the other pairs automatically with good accuracy. At the core of most (if not all) existing approaches is the following classification problem. The input is a set P of points in Rd, each of which carries a binary label: 0 or 1. A classifier F is a function from Rd to (0, 1). The objective is to find a classifier that captures the labels of a large number of points in P. In this paper, we cast the problem as an instance of active learning where the goal is to learn a monotone classifier F, namely, F(p) ≥ F(q) holds whenever the coordinate of p is at least that of q on all dimensions. In our formulation, the labels of all points in P are hidden at the beginning. An algorithm A can invoke an oracle, which discloses the label of a point p ın P chosen by A. The algorithm may do so repetitively, until it has garnered enough information to produce F. The cost of A is the number of times that the oracle is called. The challenge is to strike a good balance between the cost and the accuracy of the classifier produced. We describe algorithms with non-trivial guarantees on the cost and accuracy simultaneously. We also prove lower bounds that establish the asymptotic optimality of our solutions for a wide range of parameters.

给定两组实体X和Y，实体匹配的目的是确定X和Y是否代表每对(X, Y) ın X X Y.作为最后的手段，可以要求人类专家检查每个(X, Y)，但这是昂贵的，因为如果没有专门针对所涉及的两个实体X和Y的调查工作，就无法确定正确的判决。因此，设计一种算法是很重要的，它要求人们只看一些配对，并以良好的准确性自动对其他配对做出判断。大多数(如果不是全部)现有方法的核心是以下分类问题。输入是Rd中P个点的集合，每个点都带有一个二进制标号:0或1。分类器F是一个从Rd到(0,1)的函数。目标是找到一个能够捕获p中大量点的标签的分类器。在本文中，我们将这个问题作为一个主动学习的实例，其目标是学习一个单调分类器F，即F(p)≥F(q)，只要p的坐标在所有维度上至少是q的坐标。在我们的公式中，P中所有点的标签一开始都是隐藏的。算法A可以调用一个oracle，它公开A选择的点p ın p的标签。算法可以重复这样做，直到它收集到足够的信息产生f。A的代价是调用oracle的次数。面临的挑战是在成本和分类器的准确性之间取得良好的平衡。我们描述了同时对成本和准确性有重要保证的算法。我们还证明了在大范围参数下解的渐近最优性的下界。

{"title":"Entity Matching with Active Monotone Classification","authors":"Yufei Tao","doi":"10.1145/3196959.3196984","DOIUrl":"https://doi.org/10.1145/3196959.3196984","url":null,"abstract":"Given two sets of entities X and Y, entity matching aims to decide whether x and y represent the same entity for each pair (x, y) ın X x Y. As the last resort, human experts can be called upon to inspect every (x, y), but this is expensive because the correct verdict could not be determined without investigation efforts dedicated specifically to the two entities x and y involved. It is therefore important to design an algorithm that asks humans to look at only some pairs, and renders the verdicts on the other pairs automatically with good accuracy. At the core of most (if not all) existing approaches is the following classification problem. The input is a set P of points in Rd, each of which carries a binary label: 0 or 1. A classifier F is a function from Rd to (0, 1). The objective is to find a classifier that captures the labels of a large number of points in P. In this paper, we cast the problem as an instance of active learning where the goal is to learn a monotone classifier F, namely, F(p) ≥ F(q) holds whenever the coordinate of p is at least that of q on all dimensions. In our formulation, the labels of all points in P are hidden at the beginning. An algorithm A can invoke an oracle, which discloses the label of a point p ın P chosen by A. The algorithm may do so repetitively, until it has garnered enough information to produce F. The cost of A is the number of times that the oracle is called. The challenge is to strike a good balance between the cost and the accuracy of the classifier produced. We describe algorithms with non-trivial guarantees on the cost and accuracy simultaneously. We also prove lower bounds that establish the asymptotic optimality of our solutions for a wide range of parameters.","PeriodicalId":344370,"journal":{"name":"Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124404221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Set Similarity Search for Skewed Data 为倾斜数据设置相似度搜索

Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2018-04-09 DOI: 10.1145/3196959.3196985

Samuel McCauley, Jesper W. Mikkelsen, R. Pagh

Set similarity join, as well as the corresponding indexing problem set similarity search, are fundamental primitives for managing noisy or uncertain data. For example, these primitives can be used in data cleaning to identify different representations of the same object. In many cases one can represent an object as a sparse 0-1 vector, or equivalently as the set of nonzero entries in such a vector. A set similarity join can then be used to identify those pairs that have an exceptionally large dot product (or intersection, when viewed as sets). We choose to focus on identifying vectors with large Pearson correlation, but results extend to other similarity measures. In particular, we consider the indexing problem of identifying correlated vectors in a set S of vectors sampled from 0,1d. Given a query vector y and a parameter alpha in (0,1), we need to search for an alpha-correlated vector x in a data structure representing the vectors of S. This kind of similarity search has been intensely studied in worst-case (non-random data) settings. Existing theoretically well-founded methods for set similarity search are often inferior to heuristics that take advantage of skew in the data distribution, i.e., widely differing frequencies of 1s across the d dimensions. The main contribution of this paper is to analyze the set similarity problem under a random data model that reflects the kind of skewed data distributions seen in practice, allowing theoretical results much stronger than what is possible in worst-case settings. Our indexing data structure is a recursive, data-dependent partitioning of vectors inspired by recent advances in set similarity search. Previous data-dependent methods do not seem to allow us to exploit skew in item frequencies, so we believe that our work sheds further light on the power of data dependence.

集合相似度连接以及相应的索引问题集相似度搜索是管理有噪声或不确定数据的基本要素。例如，这些原语可用于数据清理，以识别同一对象的不同表示。在许多情况下，可以将对象表示为稀疏的0-1向量，或者等效地表示为该向量中的非零项的集合。然后可以使用集合相似连接来识别那些具有特别大的点积(或交集，当被视为集合时)的对。我们选择专注于识别具有较大Pearson相关性的向量，但结果扩展到其他相似性度量。特别地，我们考虑了在从0,1d采样的向量集合S中识别相关向量的索引问题。给定一个查询向量y和(0,1)中的参数alpha，我们需要在表示s的向量的数据结构中搜索与alpha相关的向量x。这种相似性搜索已经在最坏情况(非随机数据)设置中得到了深入研究。现有的理论基础良好的集合相似性搜索方法通常不如利用数据分布偏态的启发式方法，即在d维中15的频率差异很大。本文的主要贡献是分析了随机数据模型下的集合相似问题，该模型反映了实践中看到的数据分布的偏斜，使得理论结果比最坏情况下的结果强得多。我们的索引数据结构是一种递归的、数据相关的向量划分，灵感来自于集合相似度搜索的最新进展。以前的数据依赖方法似乎不允许我们利用项目频率的倾斜，所以我们相信我们的工作进一步揭示了数据依赖的力量。

{"title":"Set Similarity Search for Skewed Data","authors":"Samuel McCauley, Jesper W. Mikkelsen, R. Pagh","doi":"10.1145/3196959.3196985","DOIUrl":"https://doi.org/10.1145/3196959.3196985","url":null,"abstract":"Set similarity join, as well as the corresponding indexing problem set similarity search, are fundamental primitives for managing noisy or uncertain data. For example, these primitives can be used in data cleaning to identify different representations of the same object. In many cases one can represent an object as a sparse 0-1 vector, or equivalently as the set of nonzero entries in such a vector. A set similarity join can then be used to identify those pairs that have an exceptionally large dot product (or intersection, when viewed as sets). We choose to focus on identifying vectors with large Pearson correlation, but results extend to other similarity measures. In particular, we consider the indexing problem of identifying correlated vectors in a set S of vectors sampled from 0,1d. Given a query vector y and a parameter alpha in (0,1), we need to search for an alpha-correlated vector x in a data structure representing the vectors of S. This kind of similarity search has been intensely studied in worst-case (non-random data) settings. Existing theoretically well-founded methods for set similarity search are often inferior to heuristics that take advantage of skew in the data distribution, i.e., widely differing frequencies of 1s across the d dimensions. The main contribution of this paper is to analyze the set similarity problem under a random data model that reflects the kind of skewed data distributions seen in practice, allowing theoretical results much stronger than what is possible in worst-case settings. Our indexing data structure is a recursive, data-dependent partitioning of vectors inspired by recent advances in set similarity search. Previous data-dependent methods do not seem to allow us to exploit skew in item frequencies, so we believe that our work sheds further light on the power of data dependence.","PeriodicalId":344370,"journal":{"name":"Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132178163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Worst-Case Optimal Join Algorithms: Techniques, Results, and Open Problems 最坏情况下最优连接算法:技术，结果和开放问题

Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2018-03-27 DOI: 10.1145/3196959.3196990

H. Ngo

Worst-case optimal join algorithms are the class of join algorithms whose runtime match the worst-case output size of a given join query. While the first provably worse-case optimal join algorithm was discovered relatively recently, the techniques and results surrounding these algorithms grow out of decades of research from a wide range of areas, intimately connecting graph theory, algorithms, information theory, constraint satisfaction, database theory, and geometric inequalities. These ideas are not just paperware: in addition to academic project implementations, two variations of such algorithms are the work-horse join algorithms of commercial database and data analytics engines. This paper aims to be a brief introduction to the design and analysis of worst-case optimal join algorithms. We discuss the key techniques for proving runtime and output size bounds. We particularly focus on the fascinating connection between join algorithms and information theoretic inequalities, and the idea of how one can turn a proof into an algorithm. Finally, we conclude with a representative list of fundamental open problems in this area.

最坏情况最优连接算法是一类连接算法，其运行时匹配给定连接查询的最坏情况输出大小。虽然第一个可证明的最差情况最优连接算法是最近才发现的，但围绕这些算法的技术和结果是在数十年的广泛领域的研究中发展起来的，这些领域与图论、算法、信息论、约束满足、数据库理论和几何不等式密切相关。这些想法不仅仅是纸上谈品:除了学术项目实现之外，这些算法的两种变体是商业数据库和数据分析引擎的主干连接算法。本文旨在简要介绍最坏情况最优连接算法的设计和分析。我们讨论了证明运行时和输出大小界限的关键技术。我们特别关注连接算法和信息论不等式之间的迷人联系，以及如何将证明转化为算法的想法。最后，我们总结了这一领域的基本开放性问题的代表性清单。

{"title":"Worst-Case Optimal Join Algorithms: Techniques, Results, and Open Problems","authors":"H. Ngo","doi":"10.1145/3196959.3196990","DOIUrl":"https://doi.org/10.1145/3196959.3196990","url":null,"abstract":"Worst-case optimal join algorithms are the class of join algorithms whose runtime match the worst-case output size of a given join query. While the first provably worse-case optimal join algorithm was discovered relatively recently, the techniques and results surrounding these algorithms grow out of decades of research from a wide range of areas, intimately connecting graph theory, algorithms, information theory, constraint satisfaction, database theory, and geometric inequalities. These ideas are not just paperware: in addition to academic project implementations, two variations of such algorithms are the work-horse join algorithms of commercial database and data analytics engines. This paper aims to be a brief introduction to the design and analysis of worst-case optimal join algorithms. We discuss the key techniques for proving runtime and output size bounds. We particularly focus on the fascinating connection between join algorithms and information theoretic inequalities, and the idea of how one can turn a proof into an algorithm. Finally, we conclude with a representative list of fundamental open problems in this area.","PeriodicalId":344370,"journal":{"name":"Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"190 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133384842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Data Streams with Bounded Deletions 有界删除的数据流

Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2018-03-23 DOI: 10.1145/3196959.3196986

Rajesh Jayaram, David P. Woodruff

Two prevalent models in the data stream literature are the insertion-only and turnstile models. Unfortunately, many important streaming problems require a Θ(log(n)) multiplicative factor more space for turnstile streams than for insertion-only streams. This complexity gap often arises because the underlying frequency vector f is very close to $0$, after accounting for all insertions and deletions to items. Signal detection in such streams is difficult, given the large number of deletions. In this work, we propose an intermediate model which, given a parameter α ≥ 1, lower bounds the norm |f|p by a 1/α-fraction of the Lp mass of the stream had all updates been positive. Here, for a vector f, |f|p = (∑i=1n |fi|p)1/p, and the value of p we choose depends on the application. This gives a fluid medium between insertion only streams (with α = 1), and turnstile streams (with α = poly(n)), and allows for analysis in terms of α. We show that for streams with this α-property, for many fundamental streaming problems we can replace a O(log(n)) factor in the space usage for algorithms in the turnstile model with a O(log(α)) factor. This is true for identifying heavy hitters, inner product estimation, L0 estimation, L1 estimation, L1 sampling, and support sampling. For each problem, we give matching or nearly matching lower bounds for α-property streams. We note that in practice, many important turnstile data streams are in fact α-property streams for small values of α. For such applications, our results represent significant improvements in efficiency for all the aforementioned problems.

数据流文献中流行的两种模型是纯插入模型和旋转门模型。不幸的是，许多重要的流问题需要Θ(log(n))的乘法因子，对于旋转门流比仅插入流有更多的空间。这种复杂性差距经常出现，因为底层频率向量f非常接近$0$，在考虑了所有条目的插入和删除之后。由于大量的删除，在这样的流中检测信号是困难的。在这项工作中，我们提出了一个中间模型，当参数α≥1时，流的Lp质量的1/α分数的范数|的|p的下界所有更新都是正的。这里，对于向量f， |f|p =(∑i=1n |fi|p)1/p，我们选择的p的值取决于应用。这给出了仅插入流(α = 1)和旋转门流(α = poly(n))之间的流体介质，并允许根据α进行分析。我们证明，对于具有这种α-性质的流，对于许多基本的流问题，我们可以用O(log(α))因子代替旋转门模型中算法的空间使用中的O(log(n))因子。这对于识别重量级人物、内积估计、L0估计、L1估计、L1抽样和支持抽样都是正确的。对于每个问题，我们给出了α-性质流的匹配或近似匹配下界。我们注意到，在实践中，许多重要的旋转门数据流实际上是α-性质流对于小的α值。对于这样的应用，我们的结果代表了上述所有问题的效率的显著提高。

{"title":"Data Streams with Bounded Deletions","authors":"Rajesh Jayaram, David P. Woodruff","doi":"10.1145/3196959.3196986","DOIUrl":"https://doi.org/10.1145/3196959.3196986","url":null,"abstract":"Two prevalent models in the data stream literature are the insertion-only and turnstile models. Unfortunately, many important streaming problems require a Θ(log(n)) multiplicative factor more space for turnstile streams than for insertion-only streams. This complexity gap often arises because the underlying frequency vector f is very close to $0$, after accounting for all insertions and deletions to items. Signal detection in such streams is difficult, given the large number of deletions. In this work, we propose an intermediate model which, given a parameter α ≥ 1, lower bounds the norm |f|p by a 1/α-fraction of the Lp mass of the stream had all updates been positive. Here, for a vector f, |f|p = (∑i=1n |fi|p)1/p, and the value of p we choose depends on the application. This gives a fluid medium between insertion only streams (with α = 1), and turnstile streams (with α = poly(n)), and allows for analysis in terms of α. We show that for streams with this α-property, for many fundamental streaming problems we can replace a O(log(n)) factor in the space usage for algorithms in the turnstile model with a O(log(α)) factor. This is true for identifying heavy hitters, inner product estimation, L0 estimation, L1 estimation, L1 sampling, and support sampling. For each problem, we give matching or nearly matching lower bounds for α-property streams. We note that in practice, many important turnstile data streams are in fact α-property streams for small values of α. For such applications, our results represent significant improvements in efficiency for all the aforementioned problems.","PeriodicalId":344370,"journal":{"name":"Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116187031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Constant Delay Algorithms for Regular Document Spanners 用于常规文档扳手的恒定延迟算法

Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2018-03-14 DOI: 10.1145/3196959.3196987

F. Florenzano, Cristian Riveros, M. Ugarte, Stijn Vansummeren, D. Vrgoc

Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages in order to locate the data that a user wants to extract from a text document, and then store this data into variables. Since document spanners can easily generate large outputs, it is important to have good evaluation algorithms that can generate the extracted data in a quick succession, and with relatively little precomputation time. Towards this goal, we present a practical evaluation algorithm that allows constant delay enumeration of a spanner's output after a precomputation phase that is linear in the document. While the algorithm assumes that the spanner is specified in a syntactic variant of variable set automata, we also study how it can be applied when the spanner is specified by general variable set automata, regex formulas, or spanner algebras. Finally, we study the related problem of counting the number of outputs of a document spanner, providing a fine grained analysis of the classes of document spanners that support efficient enumeration of their results.

正则表达式和带有捕获变量的自动机模型是基于规则的信息抽取的核心工具。这些形式化也称为常规文档生成器，它们使用常规语言定位用户想要从文本文档中提取的数据，然后将这些数据存储到变量中。由于文档生成器可以很容易地生成大量输出，因此重要的是要有良好的评估算法，能够快速连续地生成提取的数据，并且使用相对较少的预计算时间。为了实现这一目标，我们提出了一种实用的评估算法，该算法允许在文档中线性的预计算阶段之后对扳手的输出进行恒定的延迟枚举。虽然该算法假设扳手是在变量集自动机的语法变体中指定的，但我们也研究了当扳手是由一般变量集自动机、正则表达式公式或扳手代数指定时，如何应用该算法。最后，我们研究了计算文档生成工具输出数量的相关问题，提供了对支持有效枚举其结果的文档生成工具类的细粒度分析。

{"title":"Constant Delay Algorithms for Regular Document Spanners","authors":"F. Florenzano, Cristian Riveros, M. Ugarte, Stijn Vansummeren, D. Vrgoc","doi":"10.1145/3196959.3196987","DOIUrl":"https://doi.org/10.1145/3196959.3196987","url":null,"abstract":"Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages in order to locate the data that a user wants to extract from a text document, and then store this data into variables. Since document spanners can easily generate large outputs, it is important to have good evaluation algorithms that can generate the extracted data in a quick succession, and with relatively little precomputation time. Towards this goal, we present a practical evaluation algorithm that allows constant delay enumeration of a spanner's output after a precomputation phase that is linear in the document. While the algorithm assumes that the spanner is specified in a syntactic variant of variable set automata, we also study how it can be applied when the spanner is specified by general variable set automata, regex formulas, or spanner algebras. Finally, we study the related problem of counting the number of outputs of a document spanner, providing a fine grained analysis of the classes of document spanners that support efficient enumeration of their results.","PeriodicalId":344370,"journal":{"name":"Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121216725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35