Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems最新文献

英文中文

Algorithmic Techniques for Independent Query Sampling 独立查询抽样的算法技术

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2022-06-12 DOI: 10.1145/3517804.3526068

Yufei Tao

Unlike a reporting query that returns all the elements satisfying a predicate, query sampling returns only a sample set of those elements and has long been recognized as an important method in database systems. PODS'14 saw the introduction of independent query sampling (IQS), which extends traditional query sampling with the requirement that the sample outputs of all the queries be mutually independent. The new requirement improves the precision of query estimation, facilitates the execution of randomized algorithms, and enhances the fairness and diversity of query answers. IQS calls for new index structures because conventional indexes are designed to report complete query answers and thus becomes too expensive for extracting only a few random samples. The phenomenon has created an exciting opportunity to revisit the structure for every reporting query known in computer science. There has been considerable progress since 2014 in this direction. This paper distills the existing solutions into several generic techniques that, when put together, can be utilized to solve a great variety of IQS problems with attractive performance guarantees.

与返回满足谓词的所有元素的报告查询不同，查询抽样只返回这些元素的一个样本集，并且一直被认为是数据库系统中的一种重要方法。PODS’14引入了独立查询抽样(IQS)，它扩展了传统的查询抽样，要求所有查询的样本输出是相互独立的。新的要求提高了查询估计的精度，方便了随机化算法的执行，增强了查询答案的公平性和多样性。IQS需要新的索引结构，因为传统的索引被设计为报告完整的查询答案，因此对于仅提取少量随机样本来说，成本太高。这种现象为重新审视计算机科学中已知的每个报告查询的结构创造了一个令人兴奋的机会。自2014年以来，在这个方向上取得了相当大的进展。本文将现有的解决方案提炼成几种通用技术，当这些技术组合在一起时，可以用于解决各种具有吸引力性能保证的iq问题。

引用次数: 6

Document Spanners - A Brief Overview of Concepts, Results, and Recent Developments 文档扳手-概念，结果和最新发展的简要概述

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2022-06-12 DOI: 10.1145/3517804.3526069

Markus L. Schmid, Nicole Schweikardt

The information extraction framework of document spanners was introduced by Fagin, Kimelfeld, Reiss, and Vansummeren (PODS 2013, J. ACM 2015) as a formalisation of the query language AQL, which is used in IBM's information extraction engine SystemT. Since 2013, this framework has been investigated in depth by the principles of database management community and beyond. The present paper gives a brief overview of concepts, results, and recent developments concerning document spanners.

文档扳手的信息提取框架是由Fagin, Kimelfeld, Reiss和Vansummeren (PODS 2013, J. ACM 2015)作为IBM信息提取引擎SystemT中使用的查询语言AQL的形式化引入的。自2013年以来，这个框架已经被数据库管理社区的原则和其他人深入研究。本文简要概述了有关文档生成器的概念、结果和最新发展。

引用次数: 3

Estimation of the Size of Union of Delphic Sets: Achieving Independence from Stream Size 德尔菲集并集大小的估计:实现流大小的独立性

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2022-06-12 DOI: 10.1145/3517804.3526222

Kuldeep S. Meel, Sourav Chakraborty, N. V. Vinodchandran

Given a family of sets (S1, S2,... SM) over a universe Ω, estimating the size of their union in the data streaming model is a fundamental computational problem with a wide variety of applications. The holy grail in the field of streaming is to seek design of algorithms that achieve (ε, δ)-approximation with poly(log |Ω|, ε-1, log δ-1) space and update time complexity. Earlier investigations achieve algorithms with desired space and update time complexity for restricted cases such as singletons (Distinct Elements problem), one-dimensional ranges, arithmetic progressions, and sub-cubes. However, techniques used in these works fail for many other simple structured sets. A prominent example is that of Klee's Measure Problem (KMP), wherein every set Si is represented by an axis-parallel rectangle in d-dimensional spaces. Despite extensive prior work, the best-known streaming algorithms for many of these cases depend on the size of the stream, and therefore the problem of whether there exists a streaming algorithm for estimations of size of the union of sets with poly(log |Ω|, ε-1, log δ-1) space and update time complexity has remained open. In this work, we focus on certain general families of sets called Delphic families (which allows efficient membership, sampling, and cardinality queries). Such families of sets capture several well-known problems, including KMP, test coverage, and hypervolume estimation. The primary contribution of our work is to resolve the above-mentioned open problem for streams over Delphic families. In particular, we design the first streaming algorithm for estimating |⋃i=1M Si| with poly(log |Ω|, ε-1, log δ-1) space and update time complexity (independent of M, the length of the stream) when each Si is a member from a Delphic family of sets. We further generalize our results to larger families of sets, called approximate-Delphic families, for which the size of a set can be known approximately but not exactly. Our results resolve two of the open problems listed in Meel, Vinodchandran, Chakraborty (PODS-21).

给定一个集合族(S1, S2，…SM)在一个宇宙Ω上，在数据流模型中估计它们的联合的大小是一个具有广泛应用的基本计算问题。流媒体领域的圣杯是寻求算法的设计，以poly(log |Ω|， ε-1, log δ-1)空间和更新时间复杂度来实现(ε， δ)-逼近。早期的研究实现了具有所需空间和更新时间复杂度的算法，用于限制情况，如单例(不同元素问题)、一维范围、等差数列和子立方体。然而，在这些作品中使用的技术不适用于许多其他简单的结构集。一个突出的例子是Klee's Measure Problem (KMP)，其中每个集合Si都由d维空间中的轴平行矩形表示。尽管之前有大量的工作，最著名的流算法在许多情况下依赖于流的大小，因此是否存在一种流算法来估计具有poly(log |Ω|， ε-1, log δ-1)空间和更新时间复杂度的集的并集的大小的问题仍然是开放的。在这项工作中，我们关注的是被称为德尔菲族的集合的某些一般族(它允许有效的隶属、抽样和基数查询)。这样的集合族捕获了几个众所周知的问题，包括KMP、测试覆盖率和超容量估计。我们工作的主要贡献是解决上述关于德尔菲族的流的开放性问题。特别地，我们设计了第一个流算法，当每个Si是一个德尔菲集合族的成员时，使用poly(log |Ω|， ε-1, log δ-1)空间和更新时间复杂度(与流的长度M无关)来估计|∈i=1M Si|。我们进一步将我们的结果推广到更大的集合族，称为近似德尔菲族，其中集合的大小可以近似地知道，但不是精确地知道。我们的结果解决了Meel, Vinodchandran, Chakraborty (PODS-21)中列出的两个开放问题。

{"title":"Estimation of the Size of Union of Delphic Sets: Achieving Independence from Stream Size","authors":"Kuldeep S. Meel, Sourav Chakraborty, N. V. Vinodchandran","doi":"10.1145/3517804.3526222","DOIUrl":"https://doi.org/10.1145/3517804.3526222","url":null,"abstract":"Given a family of sets (S1, S2,... SM) over a universe Ω, estimating the size of their union in the data streaming model is a fundamental computational problem with a wide variety of applications. The holy grail in the field of streaming is to seek design of algorithms that achieve (ε, δ)-approximation with poly(log |Ω|, ε-1, log δ-1) space and update time complexity. Earlier investigations achieve algorithms with desired space and update time complexity for restricted cases such as singletons (Distinct Elements problem), one-dimensional ranges, arithmetic progressions, and sub-cubes. However, techniques used in these works fail for many other simple structured sets. A prominent example is that of Klee's Measure Problem (KMP), wherein every set Si is represented by an axis-parallel rectangle in d-dimensional spaces. Despite extensive prior work, the best-known streaming algorithms for many of these cases depend on the size of the stream, and therefore the problem of whether there exists a streaming algorithm for estimations of size of the union of sets with poly(log |Ω|, ε-1, log δ-1) space and update time complexity has remained open. In this work, we focus on certain general families of sets called Delphic families (which allows efficient membership, sampling, and cardinality queries). Such families of sets capture several well-known problems, including KMP, test coverage, and hypervolume estimation. The primary contribution of our work is to resolve the above-mentioned open problem for streams over Delphic families. In particular, we design the first streaming algorithm for estimating |⋃i=1M Si| with poly(log |Ω|, ε-1, log δ-1) space and update time complexity (independent of M, the length of the stream) when each Si is a member from a Delphic family of sets. We further generalize our results to larger families of sets, called approximate-Delphic families, for which the size of a set can be known approximately but not exactly. Our results resolve two of the open problems listed in Meel, Vinodchandran, Chakraborty (PODS-21).","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122764242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Towards Theory for Real-World Data 面向现实世界数据的理论

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2022-06-12 DOI: 10.1145/3517804.3526066

W. Martens

Fundamental research on data manipulation languages is often motivated by the search for balance between desirable properties, such as expressiveness, robustness, compositionality, the existence of efficient algorithms, etc. Real-world data can be helpful for this search in many different respects. Data sets may exhibit common structures that efficient algorithms can exploit. Query logs and schemas can give us an idea of single features that are used very often, or groups of features that are frequently used together. In this sense, they can guide us towards features or fragments of data manipulation languages that are common in practice and may therefore be worthy of deeper study. In other cases, we may even get a glimpse on features that are not well-understood by users, which may inspire us to redesign them or develop tools that increase their ease-of-use. This tutorial aims to provide, first of all, an overview on several practical studies that have been conducted in the areas of tree-structured and graph-structured data, with a focus on cases with strong interaction between analysis of the data and fundamental research. Second, it aims to provide a set of lessons learned after the investigation of some large-scale logs consisting of more than 850 million queries.

对数据操作语言进行基础研究的动机往往是寻找理想属性之间的平衡，如表达性、鲁棒性、组合性、有效算法的存在等。真实世界的数据可以在许多不同方面对这种搜索有所帮助。数据集可能显示出有效算法可以利用的共同结构。查询日志和模式可以让我们了解经常使用的单个特性，或者经常一起使用的一组特性。从这个意义上说，它们可以引导我们找到在实践中常见的数据操作语言的特征或片段，因此可能值得深入研究。在其他情况下，我们甚至可以看到用户不太理解的特性，这可能会激励我们重新设计它们或开发增加其易用性的工具。本教程的目的是提供，首先，几个已经在树结构和图结构数据领域进行的实际研究的概述，重点是在数据分析和基础研究之间有很强的相互作用的案例。其次，它旨在提供一组在调查了一些包含超过8.5亿个查询的大型日志后得到的经验教训。

{"title":"Towards Theory for Real-World Data","authors":"W. Martens","doi":"10.1145/3517804.3526066","DOIUrl":"https://doi.org/10.1145/3517804.3526066","url":null,"abstract":"Fundamental research on data manipulation languages is often motivated by the search for balance between desirable properties, such as expressiveness, robustness, compositionality, the existence of efficient algorithms, etc. Real-world data can be helpful for this search in many different respects. Data sets may exhibit common structures that efficient algorithms can exploit. Query logs and schemas can give us an idea of single features that are used very often, or groups of features that are frequently used together. In this sense, they can guide us towards features or fragments of data manipulation languages that are common in practice and may therefore be worthy of deeper study. In other cases, we may even get a glimpse on features that are not well-understood by users, which may inspire us to redesign them or develop tools that increase their ease-of-use. This tutorial aims to provide, first of all, an overview on several practical studies that have been conducted in the areas of tree-structured and graph-structured data, with a focus on cases with strong interaction between analysis of the data and fundamental research. Second, it aims to provide a set of lessons learned after the investigation of some large-scale logs consisting of more than 850 million queries.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"172 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132284083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Non-Uniformly Terminating Chase: Size and Complexity 非均匀终止追逐:大小和复杂性

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2022-04-22 DOI: 10.1145/3517804.3524146

M. Calautti, G. Gottlob, Andreas Pieris

The chase procedure, originally introduced for checking implication of database constraints, and later on used for computing data exchange solutions, has recently become a central algorithmic tool in rule-based ontological reasoning. In this context, a key problem is non-uniform chase termination: does the chase of a database w.r.t. a rule-based ontology terminate? And if this is the case, what is the size of the result of the chase? We focus on guarded tuple-generating dependencies (TGDs), which form a robust rule-based ontology language, and study the above central questions for the semi-oblivious version of the chase. One of our main findings is that non-uniform semi-oblivious chase termination for guarded TGDs is feasible in polynomial time w.r.t. the database, and the size of the result of the chase (whenever is finite) is linear w.r.t. the database. Towards our results concerning non-uniform chase termination, we show that basic techniques such as simplification and linearization, originally introduced in the context of ontological query answering, can be safely applied to the chase termination problem.

chase过程最初用于检查数据库约束的含义，后来用于计算数据交换解决方案，最近已成为基于规则的本体论推理的核心算法工具。在这种情况下，一个关键问题是非统一的追踪终止:数据库的追踪是否会终止，而不是基于规则的本体?如果是这样的话，追逐的结果有多大?我们重点研究了守卫元组生成依赖关系(TGDs)，它形成了一种鲁棒的基于规则的本体语言，并研究了上述半遗忘版本追逐的中心问题。我们的主要发现之一是，非均匀半遗忘追逐终止在数据库的多项式时间内是可行的，并且追逐结果的大小(无论何时是有限的)是线性的。对于我们关于非均匀追逐终止的结果，我们表明，简化和线性化等基本技术，最初是在本体查询回答的背景下引入的，可以安全地应用于追逐终止问题。

引用次数: 4

Uniform Operational Consistent Query Answering 统一操作一致查询应答

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2022-04-22 DOI: 10.1145/3517804.3526230

M. Calautti, Ester Livshits, Andreas Pieris, Markus Schneider

Operational consistent query answering (CQA) is a recent framework for CQA, based on revised definitions of repairs and consistent answers, which opens up the possibility of efficient approximations with explicit error guarantees. The main idea is to iteratively apply operations (e.g., fact deletions), starting from an inconsistent database, until we reach a database that is consistent w.r.t. the given set of constraints. This gives us the flexibility of choosing the probability with which we apply an operation, which in turn allows us to calculate the probability of an operational repair, and thus, the probability with which a consistent answer is entailed. A natural way of assigning probabilities to operations is by targeting the uniform probability distribution over a reasonable space such as the set of operational repairs, the set of sequences of operations that lead to an operational repair, and the set of available operations at a certain step of the repairing process. This leads to what we generally call uniform operational CQA. The goal of this work is to perform a data complexity analysis of both exact and approximate uniform operational CQA, focusing on functional dependencies (and subclasses thereof), and conjunctive queries. The main outcome of our analysis (among other positive and negative results), is that uniform operational CQA pushes the efficiency boundaries further by ensuring the existence of efficient approximation schemes in scenarios that go beyond the simple case of primary keys, which seems to be the limit of the classical approach to CQA.

操作性一致查询回答(CQA)是CQA的最新框架，基于修正的修复和一致答案的定义，它提供了具有显式错误保证的有效近似的可能性。其主要思想是迭代地应用操作(例如，事实删除)，从一个不一致的数据库开始，直到我们到达一个与给定的约束集一致的数据库。这为我们提供了选择应用操作的概率的灵活性，这反过来又允许我们计算操作修复的概率，从而计算得到一致答案的概率。为操作分配概率的一种自然方法是在合理的空间内以均匀的概率分布为目标，例如操作维修的集合，导致操作维修的操作序列的集合，以及在维修过程的某个步骤的可用操作的集合。这导致了我们通常所说的统一操作CQA。这项工作的目标是执行精确和近似统一操作CQA的数据复杂性分析，重点是功能依赖关系(及其子类)和联合查询。我们分析的主要结果(在其他积极和消极的结果中)是，统一的操作CQA通过确保在超越主键的简单情况下的有效近似方案的存在，进一步推动了效率界限，这似乎是CQA经典方法的限制。

{"title":"Uniform Operational Consistent Query Answering","authors":"M. Calautti, Ester Livshits, Andreas Pieris, Markus Schneider","doi":"10.1145/3517804.3526230","DOIUrl":"https://doi.org/10.1145/3517804.3526230","url":null,"abstract":"Operational consistent query answering (CQA) is a recent framework for CQA, based on revised definitions of repairs and consistent answers, which opens up the possibility of efficient approximations with explicit error guarantees. The main idea is to iteratively apply operations (e.g., fact deletions), starting from an inconsistent database, until we reach a database that is consistent w.r.t. the given set of constraints. This gives us the flexibility of choosing the probability with which we apply an operation, which in turn allows us to calculate the probability of an operational repair, and thus, the probability with which a consistent answer is entailed. A natural way of assigning probabilities to operations is by targeting the uniform probability distribution over a reasonable space such as the set of operational repairs, the set of sequences of operations that lead to an operational repair, and the set of available operations at a certain step of the repairing process. This leads to what we generally call uniform operational CQA. The goal of this work is to perform a data complexity analysis of both exact and approximate uniform operational CQA, focusing on functional dependencies (and subclasses thereof), and conjunctive queries. The main outcome of our analysis (among other positive and negative results), is that uniform operational CQA pushes the efficiency boundaries further by ensuring the existence of efficient approximation schemes in scenarios that go beyond the simple case of primary keys, which seems to be the limit of the classical approach to CQA.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"14 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116647109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

The White-Box Adversarial Data Stream Model 白盒对抗数据流模型

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2022-04-19 DOI: 10.1145/3517804.3526228

M. Ajtai, V. Braverman, T. S. Jayram, Sandeep Silwal, Alec Sun, David P. Woodruff, Samson Zhou

There has been a flurry of recent literature studying streaming algorithms for which the input stream is chosen adaptively by a black-box adversary who observes the output of the streaming algorithm at each time step. However, these algorithms fail when the adversary has access to the internal state of the algorithm, rather than just the output of the algorithm. We study streaming algorithms in the white-box adversarial model, where the stream is chosen adaptively by an adversary who observes the entire internal state of the algorithm at each time step. We show that nontrivial algorithms are still possible. We first give a randomized algorithm for the L1-heavy hitters problem that outperforms the optimal deterministic Misra-Gries algorithm on long streams. If the white-box adversary is computationally bounded, we use cryptographic techniques to reduce the memory of our L1-heavy hitters algorithm even further and to design a number of additional algorithms for graph, string, and linear algebra problems. The existence of such algorithms is surprising, as the streaming algorithm does not even have a secret key in this model, i.e., its state is entirely known to the adversary. One algorithm we design is for estimating the number of distinct elements in a stream with insertions and deletions achieving a multiplicative approximation and sublinear space; such an algorithm is impossible for deterministic algorithms. We also give a general technique that translates any two-player deterministic communication lower bound to a lower bound for randomized algorithms robust to a white-box adversary. In particular, our results show that for all p≥0, there exists a constant Cp>1 such that any Cp-approximation algorithm for Fp moment estimation in insertion-only streams with a white-box adversary requires Ω(n) space for a universe of size n. Similarly, there is a constant C>1 such that any C-approximation algorithm in an insertion-only stream for matrix rank requires Ω(n) space with a white-box adversary. These results do not contradict our upper bounds since they assume the adversary has unbounded computational power. Our algorithmic results based on cryptography thus show a separation between computationally bounded and unbounded adversaries. Finally, we prove a lower bound of Ω(log(n)) bits for the fundamental problem of deterministic approximate counting in a stream of 0s and 1s, which holds even if we know how many total stream updates we have seen so far at each point in the stream. Such a lower bound for approximate counting with additional information was previously unknown, and in our context, it shows a separation between multiplayer deterministic maximum communication and the white-box space complexity of a streaming algorithm.

最近有大量研究流算法的文献，其中输入流由黑箱攻击者自适应地选择，该攻击者在每个时间步观察流算法的输出。然而，当攻击者能够访问算法的内部状态，而不仅仅是算法的输出时，这些算法就会失败。我们在白盒对抗模型中研究流算法，其中流由对手自适应地选择，对手在每个时间步观察算法的整个内部状态。我们证明了非平凡算法仍然是可能的。我们首先给出了一种随机算法来解决L1-heavy hit问题，该算法在长数据流上优于最优确定性Misra-Gries算法。如果白盒对手在计算上是有限的，我们使用加密技术来进一步减少我们的L1-heavy hitters算法的内存，并为图、字符串和线性代数问题设计许多额外的算法。这种算法的存在是令人惊讶的，因为流算法在这个模型中甚至没有密钥，也就是说，它的状态是对手完全知道的。我们设计的一种算法是用于估计具有插入和删除的流中不同元素的数量，从而实现乘法近似和亚线性空间;这种算法对于确定性算法来说是不可能的。我们还给出了一种通用技术，将任何两方确定性通信的下界转换为对白盒对手具有鲁棒性的随机算法的下界。特别地，我们的结果表明，对于所有p≥0，存在一个常数Cp>，使得在具有白盒对手的纯插入流中，任何用于Fp矩估计的Cp-近似算法对于大小为n的宇宙都需要Ω(n)空间。类似地，存在一个常数C>，使得在具有矩阵秩的纯插入流中，任何C-近似算法都需要具有白盒对手的Ω(n)空间。这些结果与我们的上限并不矛盾，因为它们假设对手具有无限的计算能力。因此，我们基于密码学的算法结果显示了计算有界和无界对手之间的分离。最后，我们证明了在0和1的流中确定性近似计数的基本问题的Ω(log(n))位的下界，即使我们知道到目前为止我们在流中的每个点上看到的流更新总数也是如此。这种带有附加信息的近似计数的下界以前是未知的，在我们的上下文中，它显示了多人确定性最大通信和流算法的白盒空间复杂性之间的分离。

{"title":"The White-Box Adversarial Data Stream Model","authors":"M. Ajtai, V. Braverman, T. S. Jayram, Sandeep Silwal, Alec Sun, David P. Woodruff, Samson Zhou","doi":"10.1145/3517804.3526228","DOIUrl":"https://doi.org/10.1145/3517804.3526228","url":null,"abstract":"There has been a flurry of recent literature studying streaming algorithms for which the input stream is chosen adaptively by a black-box adversary who observes the output of the streaming algorithm at each time step. However, these algorithms fail when the adversary has access to the internal state of the algorithm, rather than just the output of the algorithm. We study streaming algorithms in the white-box adversarial model, where the stream is chosen adaptively by an adversary who observes the entire internal state of the algorithm at each time step. We show that nontrivial algorithms are still possible. We first give a randomized algorithm for the L1-heavy hitters problem that outperforms the optimal deterministic Misra-Gries algorithm on long streams. If the white-box adversary is computationally bounded, we use cryptographic techniques to reduce the memory of our L1-heavy hitters algorithm even further and to design a number of additional algorithms for graph, string, and linear algebra problems. The existence of such algorithms is surprising, as the streaming algorithm does not even have a secret key in this model, i.e., its state is entirely known to the adversary. One algorithm we design is for estimating the number of distinct elements in a stream with insertions and deletions achieving a multiplicative approximation and sublinear space; such an algorithm is impossible for deterministic algorithms. We also give a general technique that translates any two-player deterministic communication lower bound to a lower bound for randomized algorithms robust to a white-box adversary. In particular, our results show that for all p≥0, there exists a constant Cp>1 such that any Cp-approximation algorithm for Fp moment estimation in insertion-only streams with a white-box adversary requires Ω(n) space for a universe of size n. Similarly, there is a constant C>1 such that any C-approximation algorithm in an insertion-only stream for matrix rank requires Ω(n) space with a white-box adversary. These results do not contradict our upper bounds since they assume the adversary has unbounded computational power. Our algorithmic results based on cryptography thus show a separation between computationally bounded and unbounded adversaries. Finally, we prove a lower bound of Ω(log(n)) bits for the fundamental problem of deterministic approximate counting in a stream of 0s and 1s, which holds even if we know how many total stream updates we have seen so far at each point in the stream. Such a lower bound for approximate counting with additional information was previously unknown, and in our context, it shows a separation between multiplayer deterministic maximum communication and the white-box space complexity of a streaming algorithm.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117325517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Approximately Counting Subgraphs in Data Streams 数据流中子图的近似计数

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2022-03-27 DOI: 10.1145/3517804.3524145

Hendrik Fichtenberger, Pan Peng

Estimating the number of subgraphs in data streams is a fundamental problem that has received great attention in the past decade. In this paper, we give improved streaming algorithms for approximately counting the number of occurrences of an arbitrary subgraph H, denoted #H, when the input graph G is represented as a stream of m edges. To obtain our algorithms, we provide a generic transformation that converts constant-round sublinear-time graph algorithms in the query access model to constant-pass sublinear-space graph streaming algorithms. Using this transformation, we obtain the following results. • We give a 3-pass turnstile streaming algorithm for (1 ± ε)-approximating #H in Õ(mρ(H) /ε2⋅#H) space, where ρ(H) is the fractional edge-cover of H. This improves upon and generalizes a result of McGregor et al. [PODS 2016], who gave a 3-pass insertion-only streaming algorithm for (1 ± ε)-approximating the number #T of triangles in Õ(m3/2/ε2 ⋅ #T) space if the algorithm is given additional oracle access to the degrees.• We provide a constant-pass streaming algorithm for (1 ± ε)-approximating #Kr in Õ(m/λr-2 ε2 ⋅ #Kr) space for any r ≥ 3, in a graph G with degeneracy λ, where Kr is a clique on r vertices. This resolves a conjecture by Bera and Seshadhri [PODS 2020]. More generally, our reduction relates the adaptivity of a query algorithm to the pass complexity of a corresponding streaming algorithm, and it is applicable to all algorithms in standard sublinear-time graph query models, e.g., the (augmented) general model.

估算数据流中的子图数量是一个基本问题，在过去十年中受到了极大关注。在本文中，我们给出了改进的流算法，当输入图 G 表示为 m 条边的流时，可近似计算任意子图 H 的出现次数，记为 #H。为了获得我们的算法，我们提供了一种通用变换，它能将查询访问模型中的恒定回合子线性时间图算法转换为恒定传递子线性空间图流算法。利用这种转换，我们得到了以下结果。- 我们给出了 Õ(mρ(H) /ε2⋅#H) 空间中 (1 ± ε)-approximating #H 的 3-pass turnstile 流算法，其中 ρ(H) 是 H 的边覆盖率分数。[PODS 2016]，他们给出了一种只需插入的 3 次流式算法，即如果给算法额外的度数神谕访问权限，则可以在Õ(m3/2/ε2 ⋅ #T)空间中逼近 (1 ± ε)#T 的三角形个数。- 我们提供了一种在Õ(m/λr-2 ε2 ⋅ #Kr)空间中用于(1 ± ε)逼近 #Kr 的恒通流算法，该算法适用于任意 r ≥ 3 的具有退化性 λ 的图 G，其中 Kr 是 r 个顶点上的一个簇。这解决了 Bera 和 Seshadhri [PODS 2020] 的猜想。更一般地说，我们的还原将查询算法的适应性与相应流算法的通过复杂度联系起来，它适用于标准亚线性时间图查询模型中的所有算法，例如（增强的）一般模型。

{"title":"Approximately Counting Subgraphs in Data Streams","authors":"Hendrik Fichtenberger, Pan Peng","doi":"10.1145/3517804.3524145","DOIUrl":"https://doi.org/10.1145/3517804.3524145","url":null,"abstract":"Estimating the number of subgraphs in data streams is a fundamental problem that has received great attention in the past decade. In this paper, we give improved streaming algorithms for approximately counting the number of occurrences of an arbitrary subgraph H, denoted #H, when the input graph G is represented as a stream of m edges. To obtain our algorithms, we provide a generic transformation that converts constant-round sublinear-time graph algorithms in the query access model to constant-pass sublinear-space graph streaming algorithms. Using this transformation, we obtain the following results. • We give a 3-pass turnstile streaming algorithm for (1 ± ε)-approximating #H in Õ(mρ(H) /ε2⋅#H) space, where ρ(H) is the fractional edge-cover of H. This improves upon and generalizes a result of McGregor et al. [PODS 2016], who gave a 3-pass insertion-only streaming algorithm for (1 ± ε)-approximating the number #T of triangles in Õ(m3/2/ε2 ⋅ #T) space if the algorithm is given additional oracle access to the degrees.• We provide a constant-pass streaming algorithm for (1 ± ε)-approximating #Kr in Õ(m/λr-2 ε2 ⋅ #Kr) space for any r ≥ 3, in a graph G with degeneracy λ, where Kr is a clique on r vertices. This resolves a conjecture by Bera and Seshadhri [PODS 2020]. More generally, our reduction relates the adaptivity of a query algorithm to the pass complexity of a corresponding streaming algorithm, and it is applicable to all algorithms in standard sublinear-time graph query models, e.g., the (augmented) general model.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127348018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Dichotomy in Consistent Query Answering for Primary Keys and Unary Foreign Keys 主键和一元外键一致性查询应答中的二分法

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2022-03-25 DOI: 10.1145/3517804.3524157

Miika Hannula, J. Wijsen

Since 2005, significant progress has been made in the problem of Consistent Query Answering (CQA) with respect to primary keys. In this problem, the input is a database instance that may violate one or more primary key constraints. A repair is defined as a maximal subinstance that satisfies all primary keys. Given a Boolean query q, the question then is whether q holds true in every repair. So far, theoretical research in this field has not addressed the combination of primary key and foreign key constraints, despite the importance of referential integrity in database systems. This paper addresses the problem of CQA with respect to both primary keys and foreign keys. In this setting, it is natural to adopt the notion of symmetric-difference repairs, because foreign keys can be repaired by inserting new tuples. We consider the case where foreign keys are unary, and queries are conjunctive queries without self-joins. In this setting, we characterize the boundary between those CQA problems that admit a consistent first-order rewriting, and those that do not.

自2005年以来，关于主键的一致性查询应答(CQA)问题取得了重大进展。在这个问题中，输入是一个可能违反一个或多个主键约束的数据库实例。修复被定义为满足所有主键的最大子实例。给定一个布尔查询q，那么问题是q是否在每次修复中都成立。尽管参考完整性在数据库系统中很重要，但到目前为止，该领域的理论研究还没有解决主键和外键约束的组合问题。本文讨论了关于主键和外键的CQA问题。在这种情况下，采用对称差分修复的概念是很自然的，因为可以通过插入新的元组来修复外键。我们考虑外键是一元的情况，查询是没有自连接的联合查询。在这种情况下，我们描述了那些允许一致一阶重写和不允许一致一阶重写的CQA问题之间的边界。

引用次数: 5

Efficiently Enumerating Answers to Ontology-Mediated Queries 有效枚举本体中介查询的答案

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2022-03-17 DOI: 10.1145/3517804.3524166

C. Lutz, Marcin Przybylko

We study the enumeration of answers to ontology-mediated queries (OMQs) where the ontology is a set of guarded TGDs or formulated in the description logic ELI and the query is a conjunctive query (CQ). In addition to the traditional notion of an answer, we propose and study two novel notions of partial answers that can take into account nulls generated by existential quantifiers in the ontology. Our main result is that enumeration of the traditional complete answers and of both kinds of partial answers is possible with linear-time preprocessing and constant delay for OMQs that are both acyclic and free-connex acyclic. We also provide partially matching lower bounds. Similar results are obtained for the related problems of testing a single answer in linear time and of testing multiple answers in constant time after linear time preprocessing. In both cases, the border between tractability and intractability is characterized by similar, but slightly different acyclicity properties.

本文研究了本体中介查询(omq)的答案枚举，其中本体是一组受保护的tgd或在描述逻辑ELI中表述，查询是一个连接查询(CQ)。除了传统的答案概念之外，我们提出并研究了两个新的部分答案概念，它们可以考虑本体中存在量词产生的null。我们的主要结果是，对于非环和自由连通非环的omq，在线性时间预处理和恒定延迟下，可以枚举传统的完全答案和两种部分答案。我们还提供了部分匹配的下界。对于线性时间内测试单个答案和线性时间预处理后在恒定时间内测试多个答案的相关问题，也得到了类似的结果。在这两种情况下，易驾驭性和难驾驭性之间的边界具有相似但略有不同的不周期性。

引用次数: 3

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀