Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems最新文献_第3页

Tight Fine-Grained Bounds for Direct Access on Join Queries 在连接查询上直接访问的紧密细粒度边界

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2022-01-07 DOI: 10.1145/3517804.3526234

K. Bringmann, Nofar Carmeli, S. Mengel

We consider the task of lexicographic direct access to query answers. That is, we want to simulate an array containing the answers of a join query sorted in a lexicographic order chosen by the user. A recent dichotomy showed for which queries and orders this task can be done in polylogarithmic access time after quasilinear preprocessing, but this dichotomy does not tell us how much time is required in the cases classified as hard. We determine the preprocessing time needed to achieve polylogarithmic access time for all self-join free queries and all lexicographical orders. To this end, we propose a decomposition-based general algorithm for direct access on join queries. We then explore its optimality by proving lower bounds for the preprocessing time based on the hardness of a certain online Set-Disjointness problem, which shows that our algorithm's bounds are tight for all lexicographic orders on self-join free queries. Then, we prove the hardness of Set-Disjointness based on the Zero-Clique Conjecture which is an established conjecture from fine-grained complexity theory. We also show that similar techniques can be used to prove that, for enumerating answers to Loomis-Whitney joins, it is not possible to significantly improve upon trivially computing all answers at preprocessing. This, in turn, gives further evidence (based on the Zero-Clique Conjecture) to the enumeration hardness of self-join free cyclic joins with respect to linear preprocessing and constant delay.

我们考虑字典直接访问查询答案的任务。也就是说，我们希望模拟一个数组，其中包含按照用户选择的字典顺序排序的连接查询的答案。最近的一个二分法显示了在拟线性预处理后，这个任务可以在多对数访问时间内完成哪些查询和顺序，但是这个二分法并没有告诉我们在分类为困难的情况下需要多少时间。我们确定为实现所有自连接自由查询和所有字典顺序的多对数访问时间所需的预处理时间。为此，我们提出了一种基于分解的通用算法，用于连接查询的直接访问。然后，我们通过基于某个在线集-不连接问题的硬度证明预处理时间的下界来探索其最优性，这表明我们的算法的边界对于自连接自由查询的所有字典顺序都是紧的。然后，我们利用零团猜想证明集合不相交的硬度，零团猜想是细粒度复杂性理论中已建立的一个猜想。我们还表明，类似的技术可以用来证明，对于枚举卢米斯-惠特尼连接的答案，在预处理中计算所有答案是不可能显著改进的。这反过来又进一步证明了(基于零团猜想)自连接自由循环连接在线性预处理和常延迟方面的枚举硬度。

{"title":"Tight Fine-Grained Bounds for Direct Access on Join Queries","authors":"K. Bringmann, Nofar Carmeli, S. Mengel","doi":"10.1145/3517804.3526234","DOIUrl":"https://doi.org/10.1145/3517804.3526234","url":null,"abstract":"We consider the task of lexicographic direct access to query answers. That is, we want to simulate an array containing the answers of a join query sorted in a lexicographic order chosen by the user. A recent dichotomy showed for which queries and orders this task can be done in polylogarithmic access time after quasilinear preprocessing, but this dichotomy does not tell us how much time is required in the cases classified as hard. We determine the preprocessing time needed to achieve polylogarithmic access time for all self-join free queries and all lexicographical orders. To this end, we propose a decomposition-based general algorithm for direct access on join queries. We then explore its optimality by proving lower bounds for the preprocessing time based on the hardness of a certain online Set-Disjointness problem, which shows that our algorithm's bounds are tight for all lexicographic orders on self-join free queries. Then, we prove the hardness of Set-Disjointness based on the Zero-Clique Conjecture which is an established conjecture from fine-grained complexity theory. We also show that similar techniques can be used to prove that, for enumerating answers to Loomis-Whitney joins, it is not possible to significantly improve upon trivially computing all answers at preprocessing. This, in turn, gives further evidence (based on the Zero-Clique Conjecture) to the enumeration hardness of self-join free cyclic joins with respect to linear preprocessing and constant delay.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129279553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Efficient Enumeration for Annotated Grammars 注释语法的有效枚举

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2022-01-03 DOI: 10.1145/3517804.3526232

Antoine Amarilli, Louis Jachiet, Martin Muñoz, Cristian Riveros

We introduce annotated grammars, an extension of context-free grammars which allows annotations on terminals. Our model extends the standard notion of regular spanners, and is more expressive than the extraction grammars recently introduced by Peterfreund. We study the enumeration problem for annotated grammars: fixing a grammar, and given a string as input, enumerate all annotations of the string that form a word derivable from the grammar. Our first result is an algorithm for unambiguous annotated grammars, which preprocesses the input string in cubic time and enumerates all annotations with output-linear delay. This improves over Peterfreund's result, which needs quintic time preprocessing to achieve this delay bound. We then study how we can reduce the preprocessing time while keeping the same delay bound, by making additional assumptions on the grammar. Specifically, we present a class of grammars which only have one derivation shape for all outputs, for which we can enumerate with quadratic time preprocessing. We also give classes that generalize regular spanners for which linear time preprocessing suffices.

我们引入了注释语法，它是上下文无关语法的扩展，允许在终端上进行注释。我们的模型扩展了正则扳手的标准概念，并且比Peterfreund最近引入的提取语法更具表现力。研究了带注释语法的枚举问题:固定一个语法，给定一个字符串作为输入，枚举该字符串的所有注释，这些注释构成了一个可从该语法派生的词。我们的第一个结果是无二义注释语法的算法，它在三次时间内预处理输入字符串，并枚举所有具有输出线性延迟的注释。这比Peterfreund的结果有所改进，后者需要五次时间预处理才能达到这个延迟界限。然后，我们通过对语法进行额外的假设，研究如何在保持相同延迟范围的情况下减少预处理时间。具体地说，我们提出了一类对所有输出只有一个派生形状的语法，我们可以用二次时间预处理来枚举它。我们还给出了一般化常规扳手的类，其中线性时间预处理就足够了。

{"title":"Efficient Enumeration for Annotated Grammars","authors":"Antoine Amarilli, Louis Jachiet, Martin Muñoz, Cristian Riveros","doi":"10.1145/3517804.3526232","DOIUrl":"https://doi.org/10.1145/3517804.3526232","url":null,"abstract":"We introduce annotated grammars, an extension of context-free grammars which allows annotations on terminals. Our model extends the standard notion of regular spanners, and is more expressive than the extraction grammars recently introduced by Peterfreund. We study the enumeration problem for annotated grammars: fixing a grammar, and given a string as input, enumerate all annotations of the string that form a word derivable from the grammar. Our first result is an algorithm for unambiguous annotated grammars, which preprocesses the input string in cubic time and enumerates all annotations with output-linear delay. This improves over Peterfreund's result, which needs quintic time preprocessing to achieve this delay bound. We then study how we can reduce the preprocessing time while keeping the same delay bound, by making additional assumptions on the grammar. Specifically, we present a class of grammars which only have one derivation shape for all outputs, for which we can enumerate with quadratic time preprocessing. We also give classes that generalize regular spanners for which linear time preprocessing suffices.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132191347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Determinacy of Real Conjunctive Queries. The Boolean Case 实连接查询的确定性。布尔情况

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-12-23 DOI: 10.1145/3517804.3524168

J. Kwiecień, J. Marcinkowski, Piotr Ostropolski-Nalewaja

In their classical 1993 paper Chaudhuri and Vardi notice that some fundamental database theory results and techniques fail to survive when we try to see query answers as bags (multisets) of tuples rather than as sets of tuples. But disappointingly, almost 30 years later the bag-semantics based database theory is still in the infancy. We do not even know whether conjunctive query containment is decidable. And this is not due to lack of interest, but because, in the multiset world, everything suddenly gets discouragingly complicated. In this paper we try to re-examine, in the bag semantics scenario, the query determinacy problem, which has recently been intensively studied in the set semantics scenario. We show that query determinacy (under bag semantics) is decidable for boolean conjunctive queries and undecidable for unions of such queries (in contrast to the set semantics scenario, where the UCQ case remains decidable even for unary queries). We also show that -- surprisingly -- for path queries determinacy under bag semantics coincides with determinacy under set semantics (and thus it is decidable).

在他们1993年的经典论文中，Chaudhuri和Vardi注意到，当我们试图将查询答案视为元组的包(多集)而不是元组的集合时，一些基本的数据库理论结果和技术将无法生存。但令人失望的是，近30年后，基于包语义的数据库理论仍处于起步阶段。我们甚至不知道联合查询包含是否可确定。这并不是因为缺乏兴趣，而是因为，在多集世界里，一切都突然变得令人沮丧地复杂起来。在本文中，我们试图在包语义场景中重新审视查询确定性问题，该问题最近在集合语义场景中得到了广泛的研究。我们展示了查询确定性(在包语义下)对于布尔连接查询是可确定的，对于此类查询的联合查询是不可确定的(与集合语义场景相反，在集合语义场景中，UCQ情况即使对于一元查询也是可确定的)。我们还表明——令人惊讶的是——对于路径查询，包语义下的确定性与集合语义下的确定性是一致的(因此它是可决定的)。

{"title":"Determinacy of Real Conjunctive Queries. The Boolean Case","authors":"J. Kwiecień, J. Marcinkowski, Piotr Ostropolski-Nalewaja","doi":"10.1145/3517804.3524168","DOIUrl":"https://doi.org/10.1145/3517804.3524168","url":null,"abstract":"In their classical 1993 paper Chaudhuri and Vardi notice that some fundamental database theory results and techniques fail to survive when we try to see query answers as bags (multisets) of tuples rather than as sets of tuples. But disappointingly, almost 30 years later the bag-semantics based database theory is still in the infancy. We do not even know whether conjunctive query containment is decidable. And this is not due to lack of interest, but because, in the multiset world, everything suddenly gets discouragingly complicated. In this paper we try to re-examine, in the bag semantics scenario, the query determinacy problem, which has recently been intensively studied in the set semantics scenario. We show that query determinacy (under bag semantics) is decidable for boolean conjunctive queries and undecidable for unions of such queries (in contrast to the set semantics scenario, where the UCQ case remains decidable even for unary queries). We also show that -- surprisingly -- for path queries determinacy under bag semantics coincides with determinacy under set semantics (and thus it is decidable).","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"254 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131817914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Randomize the Future: Asymptotically Optimal Locally Private Frequency Estimation Protocol for Longitudinal Data 随机化未来:纵向数据的渐近最优局部私有频率估计协议

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-12-22 DOI: 10.1145/3517804.3526226

O. Ohrimenko, Anthony Wirth, Hao Wu

Longitudinal data tracking under Local Differential Privacy (LDP) is a challenging task. Baseline solutions that repeatedly invoke a protocol designed for one-time computation lead to linear decay in the privacy or utility guarantee with respect to the number of computations. To avoid this, the recent approach of Erlingsson et al. (2020) exploits the potential sparsity of user data that changes only infrequently. Their protocol targets the fundamental problem of frequency estimation for longitudinal binary data, with l∞ error of O ((1 / ε) ⋅ (log d)3/2 ⋅ k ⋅ √ n ⋅ log (d / β)), where ε is the privacy budget, d is the number of time periods, k is the maximum number of changes of user data, and β is the failure probability. Notably, the error bound scales polylogarithmically with d, but linearly with k. In this paper, we break through the linear dependence on k in the estimation error. Our new protocol has error O ((1 / ε) ⋅ (log d) ⋅ √ k ⋅ n ⋅ log (d / β)), matching the lower bound up to a logarithmic factor. The protocol is an online one, that outputs an estimate at each time period. The key breakthrough is a new randomizer for sequential data, FutureRand, with two key features. The first is a composition strategy that correlates the noise across the non-zero elements of the sequence. The second is a pre-computation technique which, by exploiting the symmetry of input space, enables the randomizer to output the results on the fly, without knowing future inputs. Our protocol closes the error gap between existing online and offline algorithms.

本地差分隐私(LDP)下的纵向数据跟踪是一个具有挑战性的任务。重复调用为一次性计算设计的协议的基线解决方案会导致隐私或效用保证相对于计算数量的线性衰减。为了避免这种情况，Erlingsson等人(2020)最近的方法利用了不经常变化的用户数据的潜在稀疏性。他们的协议针对纵向二进制数据频率估计的基本问题，其l∞误差为O ((1 / ε)⋅(log d)3/2⋅k⋅√n⋅log (d / β))，其中ε为隐私预算，d为时间段数量，k为用户数据的最大变化次数，β为失效概率。值得注意的是，误差界与d呈多对数关系，而与k呈线性关系。在本文中，我们突破了估计误差对k的线性依赖。我们的新协议的误差为O ((1 / ε)⋅(log d)⋅√k⋅n⋅log (d / β))，与下限匹配到一个对数因子。该协议是一个在线协议，它在每个时间段输出一个估计。关键的突破是一个新的序列数据随机化器，FutureRand，具有两个关键特征。第一种是组合策略，将序列的非零元素之间的噪声关联起来。第二种是预计算技术，通过利用输入空间的对称性，使随机发生器能够在不知道未来输入的情况下动态输出结果。我们的协议缩小了现有在线和离线算法之间的误差差距。

{"title":"Randomize the Future: Asymptotically Optimal Locally Private Frequency Estimation Protocol for Longitudinal Data","authors":"O. Ohrimenko, Anthony Wirth, Hao Wu","doi":"10.1145/3517804.3526226","DOIUrl":"https://doi.org/10.1145/3517804.3526226","url":null,"abstract":"Longitudinal data tracking under Local Differential Privacy (LDP) is a challenging task. Baseline solutions that repeatedly invoke a protocol designed for one-time computation lead to linear decay in the privacy or utility guarantee with respect to the number of computations. To avoid this, the recent approach of Erlingsson et al. (2020) exploits the potential sparsity of user data that changes only infrequently. Their protocol targets the fundamental problem of frequency estimation for longitudinal binary data, with l∞ error of O ((1 / ε) ⋅ (log d)3/2 ⋅ k ⋅ √ n ⋅ log (d / β)), where ε is the privacy budget, d is the number of time periods, k is the maximum number of changes of user data, and β is the failure probability. Notably, the error bound scales polylogarithmically with d, but linearly with k. In this paper, we break through the linear dependence on k in the estimation error. Our new protocol has error O ((1 / ε) ⋅ (log d) ⋅ √ k ⋅ n ⋅ log (d / β)), matching the lower bound up to a logarithmic factor. The protocol is an online one, that outputs an estimate at each time period. The key breakthrough is a new randomizer for sequential data, FutureRand, with two key features. The first is a composition strategy that correlates the noise across the non-zero elements of the sequence. The second is a pre-computation technique which, by exploiting the symmetry of input space, enables the randomizer to output the results on the fly, without knowing future inputs. Our protocol closes the error gap between existing online and offline algorithms.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114611074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Lower Bounds for Sparse Oblivious Subspace Embeddings 稀疏无关子空间嵌入的下界

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-12-21 DOI: 10.1145/3517804.3526224

Yi Li, Mingmou Liu

An oblivious subspace embedding (OSE), characterized by parameters m,n,d,ε,δ, is a random matrix Π ∈ Rm x n such that for any d-dimensional subspace T ⊆ Rn, PrΠ[◨x ∈ T, (1-ε)|x|2 ≤ |Π x|2 ≤ (1+ε)|x|2] ≥ 1-δ. For ε and δ at most a small constant, we show that any OSE with one nonzero entry in each column must satisfy that m = Ω(d2/(ε2δ)), establishing the optimality of the classical Count-Sketch matrix. When an OSE has 1/(9ε) nonzero entries in each column, we show it must hold that m = Ω(εO(δ) d2), improving on the previous Ω(ε2 d2) lower bound due to Nelson and Nguyen (ICALP 2014).

以参数m、n、d、ε、δ为特征的无关子空间嵌入(OSE)是一个随机矩阵Π∈Rm x n，使得对于任意d维子空间T∈Rn, PrΠ[x∈T， (1-ε)|x|2≤|Π x|2≤(1+ε)|x|2]≥1-δ。对于ε和δ最多为一个小常数，我们证明了任何在每列中有一个非零条目的OSE必须满足m = Ω(d2/(ε2δ))，从而建立了经典Count-Sketch矩阵的最优性。当一个OSE在每列中有1/(9ε)个非零条目时，我们证明它必须保持m = Ω(εO(δ) d2)，改进了Nelson和Nguyen (ICALP 2014)之前的Ω(ε2 d2)下界。

引用次数: 3

Counting Database Repairs Entailing a Query: The Case of Functional Dependencies 计算需要查询的数据库修复:功能依赖的情况

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-12-17 DOI: 10.1145/3517804.3524147

M. Calautti, Ester Livshits, Andreas Pieris, Markus Schneider

A key task in the context of consistent query answering is to count the number of repairs that entail the query, with the ultimate goal being a precise data complexity classification. This has been achieved in the case of primary keys and self-join-free conjunctive queries (CQs) via an FP/#P-complete dichotomy. We lift this result to the more general case of functional dependencies (FDs). Another important task in this context is whenever the counting problem in question is intractable, to classify it as approximable, i.e., the target value can be efficiently approximated with error guarantees via a fully polynomial-time randomized approximation scheme (FPRAS), or as inapproximable. Although for primary keys and CQs (even with self-joins) the problem is always approximable, we prove that this is not the case for FDs. We show, however, that the class of FDs with a left-hand side chain forms an island of approximability. We see these results, apart from being interesting in their own right, as crucial steps towards a complete classification of approximate counting of repairs in the case of FDs and self-join-free CQs.

一致性查询应答上下文中的一个关键任务是计算需要查询的修复次数，其最终目标是精确的数据复杂性分类。这已经通过FP/#P-complete二分法在主键和无自连接的联合查询(cq)的情况下实现。我们将这个结果提升到更一般的功能依赖(fd)情况。在此背景下的另一个重要任务是，无论何时计数问题是棘手的，将其分类为可近似的，即目标值可以通过完全多项式时间随机近似方案(FPRAS)有效地近似，并具有误差保证，或不可近似的。尽管对于主键和cq(即使是自连接)，这个问题总是近似的，但我们证明了fd的情况并非如此。然而，我们证明了一类具有左侧链的fd形成了一个近似岛。我们看到这些结果，除了它们本身很有趣之外，在fd和自连接无cq的情况下，作为对修复近似计数的完整分类的关键步骤。

{"title":"Counting Database Repairs Entailing a Query: The Case of Functional Dependencies","authors":"M. Calautti, Ester Livshits, Andreas Pieris, Markus Schneider","doi":"10.1145/3517804.3524147","DOIUrl":"https://doi.org/10.1145/3517804.3524147","url":null,"abstract":"A key task in the context of consistent query answering is to count the number of repairs that entail the query, with the ultimate goal being a precise data complexity classification. This has been achieved in the case of primary keys and self-join-free conjunctive queries (CQs) via an FP/#P-complete dichotomy. We lift this result to the more general case of functional dependencies (FDs). Another important task in this context is whenever the counting problem in question is intractable, to classify it as approximable, i.e., the target value can be efficiently approximated with error guarantees via a fully polynomial-time randomized approximation scheme (FPRAS), or as inapproximable. Although for primary keys and CQs (even with self-joins) the problem is always approximable, we prove that this is not the case for FDs. We show, however, that the class of FDs with a left-hand side chain forms an island of approximability. We see these results, apart from being interesting in their own right, as crucial steps towards a complete classification of approximate counting of repairs in the case of FDs and self-join-free CQs.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127808276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

The Complexity of Conjunctive Queries with Degree 2 2次连接查询的复杂度

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-11-22 DOI: 10.1145/3517804.3524152

Matthias Lanzinger

It is well known that the tractability of conjunctive query answering can be characterised in terms of treewidth when the problem is restricted to queries of bounded arity. We show that a similar characterisation also exists for classes of queries with unbounded arity and degree 2. To do so we introduce hypergraph dilutions as an alternative method to primal graph minors for studying substructures of hypergraphs. Using dilutions we observe an analogue to the Excluded Grid Theorem for degree 2 hypergraphs. In consequence, we show that that the tractability of conjunctive query answering can be characterised in terms of generalised hypertree width. A similar characterisation is also shown for the corresponding counting problem. We also generalise our main structural result to arbitrary bounded degree and discuss possible paths towards a characterisation of tractable conjunctive query answering for the bounded degree case.

众所周知，当问题被限制为有界查询时，联合查询回答的可跟踪性可以用树宽度来表征。我们表明，对于无界度和度为2的查询类也存在类似的特征。为此，我们引入超图稀释作为研究超图子结构的一种替代方法。利用稀释，我们观察到二阶超图的排除网格定理的类似情形。因此，我们证明了连接查询回答的可跟踪性可以用广义超树宽度来表征。对于相应的计数问题也给出了类似的描述。我们还将我们的主要结构结果推广到任意有界度，并讨论了对有界度情况下可处理的连接查询的特征化的可能路径。

引用次数: 3

Truly Perfect Samplers for Data Streams and Sliding Windows 真正完美的采样器的数据流和滑动窗口

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-08-26 DOI: 10.1145/3517804.3524139

Rajesh Jayaram, David P. Woodruff, Samson Zhou

In the G-sampling problem, the goal is to output an index i of a vector f ∈ Rn, such that for all coordinates j ∈[n], [Pr [i=j] = (1 ± ε) (G(fj))/(∑k ∈[n] G(fk)) + γ,] where G: R → R ≥ 0 is some non-negative function. If ε = 0 and γ = 1/poly(n), the sampler is calledperfect. In the data stream model, f is defined implicitly by a sequence of updates to its coordinates, and the goal is to design such a sampler in small space. Jayaram and Woodruff (FOCS 2018) gave the first perfect Lp samplers in turnstile streams, where G(x)=|x|p, using polylog(n) space for p∈(0,2]. However, to date all known sampling algorithms are nottruly perfect, since their output distribution is only point-wise γ = 1/poly(n) close to the true distribution. This small error can be significant when samplers are run many times on successive portions of a stream, and leak potentially sensitive information about the data stream. In this work, we initiate the study oftruly perfect samplers, with ε = γ = 0, and comprehensively investigate their complexity in the data stream and sliding window models. We begin by showing that sublinear space truly perfect sampling is impossible in the turnstile model, by proving a lower bound of Ω(min(n, log 1/γ)) for any G-sampler with point-wise error γ from the true distribution. We then give a general time-efficient sublinear-space framework for developing truly perfect samplers in the insertion-only streaming and sliding window models. As specific applications, our framework addresses Lp sampling for all p>0, e.g., Õn1-1/p space for p ≥ 1, concave functions, and a large number of measure functions, including the L1-L2, Fair, Huber, and Tukey estimators. The update time of our truly perfect Lp-samplers is Ø(1), which is an exponential improvement over the running time of previous perfect Lp-samplers.

在G-抽样问题中，目标是输出向量f∈Rn的一个指标i，使得对于所有坐标j∈[n]， [Pr [i=j] =(1±ε) (G(fj))/(∑k∈[n] G(fk)) + γ，]其中G: R→R≥0是某个非负函数。如果ε = 0， γ = 1/poly(n)，则称为完美采样器。在数据流模型中，f通过对其坐标的一系列更新来隐式定义，目标是在小空间中设计这样一个采样器。Jayaram和Woodruff (FOCS 2018)在旋转门流中给出了第一个完美的Lp采样器，其中G(x)=|x|p，使用polylog(n)空间对p∈(0,2)。然而，到目前为止，所有已知的采样算法都不是真正完美的，因为它们的输出分布仅接近真实分布的逐点γ = 1/poly(n)。当采样器在流的连续部分上多次运行时，这个小错误可能很重要，并且可能泄露有关数据流的敏感信息。在这项工作中，我们启动了ε = γ = 0的真正完美采样器的研究，并全面研究了它们在数据流和滑动窗口模型中的复杂性。我们首先证明，在旋转门模型中，次线性空间真正完美的采样是不可能的，通过证明任何g抽样器的下界Ω(min(n, log 1/γ))，从真分布的点误差γ。然后，我们给出了一个通用的时间效率的亚线性空间框架，用于在仅插入流和滑动窗口模型中开发真正完美的采样器。作为具体应用，我们的框架解决了所有p>0的Lp采样，例如，p≥1的Õn1-1/p空间，凹函数和大量测量函数，包括L1-L2, Fair, Huber和Tukey估计。我们真正完美的lp采样器的更新时间为Ø(1)，与之前的完美lp采样器的运行时间相比，这是一个指数级的改进。

{"title":"Truly Perfect Samplers for Data Streams and Sliding Windows","authors":"Rajesh Jayaram, David P. Woodruff, Samson Zhou","doi":"10.1145/3517804.3524139","DOIUrl":"https://doi.org/10.1145/3517804.3524139","url":null,"abstract":"In the G-sampling problem, the goal is to output an index i of a vector f ∈ Rn, such that for all coordinates j ∈[n], [Pr [i=j] = (1 ± ε) (G(fj))/(∑k ∈[n] G(fk)) + γ,] where G: R → R ≥ 0 is some non-negative function. If ε = 0 and γ = 1/poly(n), the sampler is calledperfect. In the data stream model, f is defined implicitly by a sequence of updates to its coordinates, and the goal is to design such a sampler in small space. Jayaram and Woodruff (FOCS 2018) gave the first perfect Lp samplers in turnstile streams, where G(x)=|x|p, using polylog(n) space for p∈(0,2]. However, to date all known sampling algorithms are nottruly perfect, since their output distribution is only point-wise γ = 1/poly(n) close to the true distribution. This small error can be significant when samplers are run many times on successive portions of a stream, and leak potentially sensitive information about the data stream. In this work, we initiate the study oftruly perfect samplers, with ε = γ = 0, and comprehensively investigate their complexity in the data stream and sliding window models. We begin by showing that sublinear space truly perfect sampling is impossible in the turnstile model, by proving a lower bound of Ω(min(n, log 1/γ)) for any G-sampler with point-wise error γ from the true distribution. We then give a general time-efficient sublinear-space framework for developing truly perfect samplers in the insertion-only streaming and sliding window models. As specific applications, our framework addresses Lp sampling for all p>0, e.g., Õn1-1/p space for p ≥ 1, concave functions, and a large number of measure functions, including the L1-L2, Fair, Huber, and Tukey estimators. The update time of our truly perfect Lp-samplers is Ø(1), which is an exponential improvement over the running time of previous perfect Lp-samplers.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130577138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

High Dimensional Differentially Private Stochastic Optimization with Heavy-tailed Data 具有重尾数据的高维差分私有随机优化

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-07-23 DOI: 10.1145/3517804.3524144

Lijie Hu, Shuo Ni, Hanshen Xiao, Di Wang

As one of the most fundamental problems in machine learning, statistics and differential privacy, Differentially Private Stochastic Convex Optimization (DP-SCO) has been extensively studied in recent years. However, most of the previous work can only handle either regular data distributions or irregular data in the low dimensional space case. To better understand the challenges arising from irregular data distributions, in this paper we provide the first study on the problem of DP-SCO with heavy-tailed data in the high dimensional space. In the first part we focus on the problem over some polytope constraint (such as the l1-norm ball). We show that if the loss function is smooth and its gradient has bounded second order moment, it is possible to get a (high probability) error bound (excess population risk) of Õ(log d/(nε)1/3) in the ε-DP model, where n is the sample size and d is the dimension of the underlying space. Next, for LASSO, if the data distribution has bounded fourth-order moments, we improve the bound to Õ(log d/(nε)2/5) in the $(ε, δ)-DP model. In the second part of the paper, we study sparse learning with heavy-tailed data. We first revisit the sparse linear model and propose a truncated DP-IHT method whose output could achieve an error of Õ ((s*2 log2d)/nε), where s* is the sparsity of the underlying parameter. Then we study a more general problem over the sparsity (i.e., l0-norm) constraint, and show that it is possible to achieve an error of Õ((s*3/2 log d)/nε), which is also near optimal up to a factor of Õ(√s*), if the loss function is smooth and strongly convex.

差分私有随机凸优化(differential Private random Convex Optimization, DP-SCO)作为机器学习、统计学和差分隐私中最基本的问题之一，近年来得到了广泛的研究。然而，以往的工作大多只能处理低维空间情况下的规则数据分布或不规则数据。为了更好地理解不规则数据分布带来的挑战，本文首次对高维空间中重尾数据的DP-SCO问题进行了研究。在第一部分中，我们重点讨论了一些多面体约束(如11范数球)上的问题。我们证明，如果损失函数是光滑的，其梯度具有有界的二阶矩，则在ε-DP模型中可能得到(高概率)误差界(超额总体风险)为Õ(log d/(nε)1/3)，其中n为样本量，d为底层空间的维数。接下来，对于LASSO，如果数据分布具有有界的四阶矩，我们将$(ε， δ)-DP模型中的界改进为Õ(log d/(nε)2/5)。在论文的第二部分，我们研究了重尾数据的稀疏学习。我们首先重新研究了稀疏线性模型，并提出了一种截断的DP-IHT方法，其输出误差可以达到Õ ((s*2 log2d)/nε)，其中s*是底层参数的稀疏性。然后，我们研究了稀疏性(即10范数)约束下的一个更一般的问题，并表明如果损失函数是光滑且强凸的，则可能实现Õ((s*3/2 log d)/nε)的误差，这也是接近最优的Õ(√s*)。

{"title":"High Dimensional Differentially Private Stochastic Optimization with Heavy-tailed Data","authors":"Lijie Hu, Shuo Ni, Hanshen Xiao, Di Wang","doi":"10.1145/3517804.3524144","DOIUrl":"https://doi.org/10.1145/3517804.3524144","url":null,"abstract":"As one of the most fundamental problems in machine learning, statistics and differential privacy, Differentially Private Stochastic Convex Optimization (DP-SCO) has been extensively studied in recent years. However, most of the previous work can only handle either regular data distributions or irregular data in the low dimensional space case. To better understand the challenges arising from irregular data distributions, in this paper we provide the first study on the problem of DP-SCO with heavy-tailed data in the high dimensional space. In the first part we focus on the problem over some polytope constraint (such as the l1-norm ball). We show that if the loss function is smooth and its gradient has bounded second order moment, it is possible to get a (high probability) error bound (excess population risk) of Õ(log d/(nε)1/3) in the ε-DP model, where n is the sample size and d is the dimension of the underlying space. Next, for LASSO, if the data distribution has bounded fourth-order moments, we improve the bound to Õ(log d/(nε)2/5) in the $(ε, δ)-DP model. In the second part of the paper, we study sparse learning with heavy-tailed data. We first revisit the sparse linear model and propose a truncated DP-IHT method whose output could achieve an error of Õ ((s*2 log2d)/nε), where s* is the sparsity of the underlying parameter. Then we study a more general problem over the sparsity (i.e., l0-norm) constraint, and show that it is possible to achieve an error of Õ((s*3/2 log d)/nε), which is also near optimal up to a factor of Õ(√s*), if the loss function is smooth and strongly convex.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129581067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

The Complexity of Boolean Conjunctive Queries with Intersection Joins 具有交集连接的布尔连接查询的复杂性

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-24 DOI: 10.1145/3517804.3524156

Mahmoud Abo Khamis, George Chichirim, Antonia Kormpa, Dan Olteanu

Intersection joins over interval data are relevant in spatial and temporal data settings. A set of intervals join if their intersection is non-empty. In case of point intervals, the intersection join becomes the standard equality join. We establish the complexity of Boolean conjunctive queries with intersection joins by a many-one equivalence to disjunctions of Boolean conjunctive queries with equality joins. The complexity of any query with intersection joins is that of the hardest query with equality joins in the disjunction exhibited by our equivalence. This is captured by a new width measure called the ij-width. We also introduce a new syntactic notion of acyclicity called iota-acyclicity to characterise the class of Boolean queries with intersection joins that admit linear time computation modulo a poly-logarithmic factor in the data size. Iota-acyclicity is for intersection joins what alpha-acyclicity is for equality joins. It strictly sits between gamma-acyclicity and Berge-acyclicity. The intersection join queries that are not iota-acyclic are at least as hard as the Boolean triangle query with equality joins, which is widely considered not computable in linear time.

区间数据上的交集连接在空间和时间数据设置中是相关的。如果一组区间的交点非空，则它们连接在一起。在点间隔的情况下，交点连接成为标准的相等连接。通过对具有相等连接的布尔连接查询的析取的一个多等价，建立了具有交连接的布尔连接查询的复杂度。任何具有交集连接的查询的复杂性是由我们的等价所显示的不相交中具有相等连接的最难查询的复杂性。这是通过称为ij-width的新宽度度量来捕获的。我们还引入了一个新的无环性语法概念，称为iotta -无环性，以表征具有交集连接的布尔查询类，这些查询允许对数据大小的多对数因子进行线性时间计算。iota不环性是指交连接，α不环性是指相等连接。严格地说，它介于-无环性和伯格-无环性之间。非环交连接查询至少与具有相等连接的布尔三角形查询一样困难，后者被广泛认为在线性时间内不可计算。

{"title":"The Complexity of Boolean Conjunctive Queries with Intersection Joins","authors":"Mahmoud Abo Khamis, George Chichirim, Antonia Kormpa, Dan Olteanu","doi":"10.1145/3517804.3524156","DOIUrl":"https://doi.org/10.1145/3517804.3524156","url":null,"abstract":"Intersection joins over interval data are relevant in spatial and temporal data settings. A set of intervals join if their intersection is non-empty. In case of point intervals, the intersection join becomes the standard equality join. We establish the complexity of Boolean conjunctive queries with intersection joins by a many-one equivalence to disjunctions of Boolean conjunctive queries with equality joins. The complexity of any query with intersection joins is that of the hardest query with equality joins in the disjunction exhibited by our equivalence. This is captured by a new width measure called the ij-width. We also introduce a new syntactic notion of acyclicity called iota-acyclicity to characterise the class of Boolean queries with intersection joins that admit linear time computation modulo a poly-logarithmic factor in the data size. Iota-acyclicity is for intersection joins what alpha-acyclicity is for equality joins. It strictly sits between gamma-acyclicity and Berge-acyclicity. The intersection join queries that are not iota-acyclic are at least as hard as the Boolean triangle query with equality joins, which is widely considered not computable in linear time.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115522965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2