We consider the task of lexicographic direct access to query answers. That is, we want to simulate an array containing the answers of a join query sorted in a lexicographic order chosen by the user. A recent dichotomy showed for which queries and orders this task can be done in polylogarithmic access time after quasilinear preprocessing, but this dichotomy does not tell us how much time is required in the cases classified as hard. We determine the preprocessing time needed to achieve polylogarithmic access time for all self-join free queries and all lexicographical orders. To this end, we propose a decomposition-based general algorithm for direct access on join queries. We then explore its optimality by proving lower bounds for the preprocessing time based on the hardness of a certain online Set-Disjointness problem, which shows that our algorithm's bounds are tight for all lexicographic orders on self-join free queries. Then, we prove the hardness of Set-Disjointness based on the Zero-Clique Conjecture which is an established conjecture from fine-grained complexity theory. We also show that similar techniques can be used to prove that, for enumerating answers to Loomis-Whitney joins, it is not possible to significantly improve upon trivially computing all answers at preprocessing. This, in turn, gives further evidence (based on the Zero-Clique Conjecture) to the enumeration hardness of self-join free cyclic joins with respect to linear preprocessing and constant delay.
{"title":"Tight Fine-Grained Bounds for Direct Access on Join Queries","authors":"K. Bringmann, Nofar Carmeli, S. Mengel","doi":"10.1145/3517804.3526234","DOIUrl":"https://doi.org/10.1145/3517804.3526234","url":null,"abstract":"We consider the task of lexicographic direct access to query answers. That is, we want to simulate an array containing the answers of a join query sorted in a lexicographic order chosen by the user. A recent dichotomy showed for which queries and orders this task can be done in polylogarithmic access time after quasilinear preprocessing, but this dichotomy does not tell us how much time is required in the cases classified as hard. We determine the preprocessing time needed to achieve polylogarithmic access time for all self-join free queries and all lexicographical orders. To this end, we propose a decomposition-based general algorithm for direct access on join queries. We then explore its optimality by proving lower bounds for the preprocessing time based on the hardness of a certain online Set-Disjointness problem, which shows that our algorithm's bounds are tight for all lexicographic orders on self-join free queries. Then, we prove the hardness of Set-Disjointness based on the Zero-Clique Conjecture which is an established conjecture from fine-grained complexity theory. We also show that similar techniques can be used to prove that, for enumerating answers to Loomis-Whitney joins, it is not possible to significantly improve upon trivially computing all answers at preprocessing. This, in turn, gives further evidence (based on the Zero-Clique Conjecture) to the enumeration hardness of self-join free cyclic joins with respect to linear preprocessing and constant delay.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129279553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antoine Amarilli, Louis Jachiet, Martin Muñoz, Cristian Riveros
We introduce annotated grammars, an extension of context-free grammars which allows annotations on terminals. Our model extends the standard notion of regular spanners, and is more expressive than the extraction grammars recently introduced by Peterfreund. We study the enumeration problem for annotated grammars: fixing a grammar, and given a string as input, enumerate all annotations of the string that form a word derivable from the grammar. Our first result is an algorithm for unambiguous annotated grammars, which preprocesses the input string in cubic time and enumerates all annotations with output-linear delay. This improves over Peterfreund's result, which needs quintic time preprocessing to achieve this delay bound. We then study how we can reduce the preprocessing time while keeping the same delay bound, by making additional assumptions on the grammar. Specifically, we present a class of grammars which only have one derivation shape for all outputs, for which we can enumerate with quadratic time preprocessing. We also give classes that generalize regular spanners for which linear time preprocessing suffices.
{"title":"Efficient Enumeration for Annotated Grammars","authors":"Antoine Amarilli, Louis Jachiet, Martin Muñoz, Cristian Riveros","doi":"10.1145/3517804.3526232","DOIUrl":"https://doi.org/10.1145/3517804.3526232","url":null,"abstract":"We introduce annotated grammars, an extension of context-free grammars which allows annotations on terminals. Our model extends the standard notion of regular spanners, and is more expressive than the extraction grammars recently introduced by Peterfreund. We study the enumeration problem for annotated grammars: fixing a grammar, and given a string as input, enumerate all annotations of the string that form a word derivable from the grammar. Our first result is an algorithm for unambiguous annotated grammars, which preprocesses the input string in cubic time and enumerates all annotations with output-linear delay. This improves over Peterfreund's result, which needs quintic time preprocessing to achieve this delay bound. We then study how we can reduce the preprocessing time while keeping the same delay bound, by making additional assumptions on the grammar. Specifically, we present a class of grammars which only have one derivation shape for all outputs, for which we can enumerate with quadratic time preprocessing. We also give classes that generalize regular spanners for which linear time preprocessing suffices.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132191347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Kwiecień, J. Marcinkowski, Piotr Ostropolski-Nalewaja
In their classical 1993 paper Chaudhuri and Vardi notice that some fundamental database theory results and techniques fail to survive when we try to see query answers as bags (multisets) of tuples rather than as sets of tuples. But disappointingly, almost 30 years later the bag-semantics based database theory is still in the infancy. We do not even know whether conjunctive query containment is decidable. And this is not due to lack of interest, but because, in the multiset world, everything suddenly gets discouragingly complicated. In this paper we try to re-examine, in the bag semantics scenario, the query determinacy problem, which has recently been intensively studied in the set semantics scenario. We show that query determinacy (under bag semantics) is decidable for boolean conjunctive queries and undecidable for unions of such queries (in contrast to the set semantics scenario, where the UCQ case remains decidable even for unary queries). We also show that -- surprisingly -- for path queries determinacy under bag semantics coincides with determinacy under set semantics (and thus it is decidable).
{"title":"Determinacy of Real Conjunctive Queries. The Boolean Case","authors":"J. Kwiecień, J. Marcinkowski, Piotr Ostropolski-Nalewaja","doi":"10.1145/3517804.3524168","DOIUrl":"https://doi.org/10.1145/3517804.3524168","url":null,"abstract":"In their classical 1993 paper Chaudhuri and Vardi notice that some fundamental database theory results and techniques fail to survive when we try to see query answers as bags (multisets) of tuples rather than as sets of tuples. But disappointingly, almost 30 years later the bag-semantics based database theory is still in the infancy. We do not even know whether conjunctive query containment is decidable. And this is not due to lack of interest, but because, in the multiset world, everything suddenly gets discouragingly complicated. In this paper we try to re-examine, in the bag semantics scenario, the query determinacy problem, which has recently been intensively studied in the set semantics scenario. We show that query determinacy (under bag semantics) is decidable for boolean conjunctive queries and undecidable for unions of such queries (in contrast to the set semantics scenario, where the UCQ case remains decidable even for unary queries). We also show that -- surprisingly -- for path queries determinacy under bag semantics coincides with determinacy under set semantics (and thus it is decidable).","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"254 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131817914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Longitudinal data tracking under Local Differential Privacy (LDP) is a challenging task. Baseline solutions that repeatedly invoke a protocol designed for one-time computation lead to linear decay in the privacy or utility guarantee with respect to the number of computations. To avoid this, the recent approach of Erlingsson et al. (2020) exploits the potential sparsity of user data that changes only infrequently. Their protocol targets the fundamental problem of frequency estimation for longitudinal binary data, with l∞ error of O ((1 / ε) ⋅ (log d)3/2 ⋅ k ⋅ √ n ⋅ log (d / β)), where ε is the privacy budget, d is the number of time periods, k is the maximum number of changes of user data, and β is the failure probability. Notably, the error bound scales polylogarithmically with d, but linearly with k. In this paper, we break through the linear dependence on k in the estimation error. Our new protocol has error O ((1 / ε) ⋅ (log d) ⋅ √ k ⋅ n ⋅ log (d / β)), matching the lower bound up to a logarithmic factor. The protocol is an online one, that outputs an estimate at each time period. The key breakthrough is a new randomizer for sequential data, FutureRand, with two key features. The first is a composition strategy that correlates the noise across the non-zero elements of the sequence. The second is a pre-computation technique which, by exploiting the symmetry of input space, enables the randomizer to output the results on the fly, without knowing future inputs. Our protocol closes the error gap between existing online and offline algorithms.
{"title":"Randomize the Future: Asymptotically Optimal Locally Private Frequency Estimation Protocol for Longitudinal Data","authors":"O. Ohrimenko, Anthony Wirth, Hao Wu","doi":"10.1145/3517804.3526226","DOIUrl":"https://doi.org/10.1145/3517804.3526226","url":null,"abstract":"Longitudinal data tracking under Local Differential Privacy (LDP) is a challenging task. Baseline solutions that repeatedly invoke a protocol designed for one-time computation lead to linear decay in the privacy or utility guarantee with respect to the number of computations. To avoid this, the recent approach of Erlingsson et al. (2020) exploits the potential sparsity of user data that changes only infrequently. Their protocol targets the fundamental problem of frequency estimation for longitudinal binary data, with l∞ error of O ((1 / ε) ⋅ (log d)3/2 ⋅ k ⋅ √ n ⋅ log (d / β)), where ε is the privacy budget, d is the number of time periods, k is the maximum number of changes of user data, and β is the failure probability. Notably, the error bound scales polylogarithmically with d, but linearly with k. In this paper, we break through the linear dependence on k in the estimation error. Our new protocol has error O ((1 / ε) ⋅ (log d) ⋅ √ k ⋅ n ⋅ log (d / β)), matching the lower bound up to a logarithmic factor. The protocol is an online one, that outputs an estimate at each time period. The key breakthrough is a new randomizer for sequential data, FutureRand, with two key features. The first is a composition strategy that correlates the noise across the non-zero elements of the sequence. The second is a pre-computation technique which, by exploiting the symmetry of input space, enables the randomizer to output the results on the fly, without knowing future inputs. Our protocol closes the error gap between existing online and offline algorithms.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114611074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An oblivious subspace embedding (OSE), characterized by parameters m,n,d,ε,δ, is a random matrix Π ∈ Rm x n such that for any d-dimensional subspace T ⊆ Rn, PrΠ[◨x ∈ T, (1-ε)|x|2 ≤ |Π x|2 ≤ (1+ε)|x|2] ≥ 1-δ. For ε and δ at most a small constant, we show that any OSE with one nonzero entry in each column must satisfy that m = Ω(d2/(ε2δ)), establishing the optimality of the classical Count-Sketch matrix. When an OSE has 1/(9ε) nonzero entries in each column, we show it must hold that m = Ω(εO(δ) d2), improving on the previous Ω(ε2 d2) lower bound due to Nelson and Nguyen (ICALP 2014).
{"title":"Lower Bounds for Sparse Oblivious Subspace Embeddings","authors":"Yi Li, Mingmou Liu","doi":"10.1145/3517804.3526224","DOIUrl":"https://doi.org/10.1145/3517804.3526224","url":null,"abstract":"An oblivious subspace embedding (OSE), characterized by parameters m,n,d,ε,δ, is a random matrix Π ∈ Rm x n such that for any d-dimensional subspace T ⊆ Rn, PrΠ[◨x ∈ T, (1-ε)|x|2 ≤ |Π x|2 ≤ (1+ε)|x|2] ≥ 1-δ. For ε and δ at most a small constant, we show that any OSE with one nonzero entry in each column must satisfy that m = Ω(d2/(ε2δ)), establishing the optimality of the classical Count-Sketch matrix. When an OSE has 1/(9ε) nonzero entries in each column, we show it must hold that m = Ω(εO(δ) d2), improving on the previous Ω(ε2 d2) lower bound due to Nelson and Nguyen (ICALP 2014).","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"356 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132831468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Calautti, Ester Livshits, Andreas Pieris, Markus Schneider
A key task in the context of consistent query answering is to count the number of repairs that entail the query, with the ultimate goal being a precise data complexity classification. This has been achieved in the case of primary keys and self-join-free conjunctive queries (CQs) via an FP/#P-complete dichotomy. We lift this result to the more general case of functional dependencies (FDs). Another important task in this context is whenever the counting problem in question is intractable, to classify it as approximable, i.e., the target value can be efficiently approximated with error guarantees via a fully polynomial-time randomized approximation scheme (FPRAS), or as inapproximable. Although for primary keys and CQs (even with self-joins) the problem is always approximable, we prove that this is not the case for FDs. We show, however, that the class of FDs with a left-hand side chain forms an island of approximability. We see these results, apart from being interesting in their own right, as crucial steps towards a complete classification of approximate counting of repairs in the case of FDs and self-join-free CQs.
{"title":"Counting Database Repairs Entailing a Query: The Case of Functional Dependencies","authors":"M. Calautti, Ester Livshits, Andreas Pieris, Markus Schneider","doi":"10.1145/3517804.3524147","DOIUrl":"https://doi.org/10.1145/3517804.3524147","url":null,"abstract":"A key task in the context of consistent query answering is to count the number of repairs that entail the query, with the ultimate goal being a precise data complexity classification. This has been achieved in the case of primary keys and self-join-free conjunctive queries (CQs) via an FP/#P-complete dichotomy. We lift this result to the more general case of functional dependencies (FDs). Another important task in this context is whenever the counting problem in question is intractable, to classify it as approximable, i.e., the target value can be efficiently approximated with error guarantees via a fully polynomial-time randomized approximation scheme (FPRAS), or as inapproximable. Although for primary keys and CQs (even with self-joins) the problem is always approximable, we prove that this is not the case for FDs. We show, however, that the class of FDs with a left-hand side chain forms an island of approximability. We see these results, apart from being interesting in their own right, as crucial steps towards a complete classification of approximate counting of repairs in the case of FDs and self-join-free CQs.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127808276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It is well known that the tractability of conjunctive query answering can be characterised in terms of treewidth when the problem is restricted to queries of bounded arity. We show that a similar characterisation also exists for classes of queries with unbounded arity and degree 2. To do so we introduce hypergraph dilutions as an alternative method to primal graph minors for studying substructures of hypergraphs. Using dilutions we observe an analogue to the Excluded Grid Theorem for degree 2 hypergraphs. In consequence, we show that that the tractability of conjunctive query answering can be characterised in terms of generalised hypertree width. A similar characterisation is also shown for the corresponding counting problem. We also generalise our main structural result to arbitrary bounded degree and discuss possible paths towards a characterisation of tractable conjunctive query answering for the bounded degree case.
{"title":"The Complexity of Conjunctive Queries with Degree 2","authors":"Matthias Lanzinger","doi":"10.1145/3517804.3524152","DOIUrl":"https://doi.org/10.1145/3517804.3524152","url":null,"abstract":"It is well known that the tractability of conjunctive query answering can be characterised in terms of treewidth when the problem is restricted to queries of bounded arity. We show that a similar characterisation also exists for classes of queries with unbounded arity and degree 2. To do so we introduce hypergraph dilutions as an alternative method to primal graph minors for studying substructures of hypergraphs. Using dilutions we observe an analogue to the Excluded Grid Theorem for degree 2 hypergraphs. In consequence, we show that that the tractability of conjunctive query answering can be characterised in terms of generalised hypertree width. A similar characterisation is also shown for the corresponding counting problem. We also generalise our main structural result to arbitrary bounded degree and discuss possible paths towards a characterisation of tractable conjunctive query answering for the bounded degree case.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127639328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the G-sampling problem, the goal is to output an index i of a vector f ∈ Rn, such that for all coordinates j ∈[n], [Pr [i=j] = (1 ± ε) (G(fj))/(∑k ∈[n] G(fk)) + γ,] where G: R → R ≥ 0 is some non-negative function. If ε = 0 and γ = 1/poly(n), the sampler is calledperfect. In the data stream model, f is defined implicitly by a sequence of updates to its coordinates, and the goal is to design such a sampler in small space. Jayaram and Woodruff (FOCS 2018) gave the first perfect Lp samplers in turnstile streams, where G(x)=|x|p, using polylog(n) space for p∈(0,2]. However, to date all known sampling algorithms are nottruly perfect, since their output distribution is only point-wise γ = 1/poly(n) close to the true distribution. This small error can be significant when samplers are run many times on successive portions of a stream, and leak potentially sensitive information about the data stream. In this work, we initiate the study oftruly perfect samplers, with ε = γ = 0, and comprehensively investigate their complexity in the data stream and sliding window models. We begin by showing that sublinear space truly perfect sampling is impossible in the turnstile model, by proving a lower bound of Ω(min(n, log 1/γ)) for any G-sampler with point-wise error γ from the true distribution. We then give a general time-efficient sublinear-space framework for developing truly perfect samplers in the insertion-only streaming and sliding window models. As specific applications, our framework addresses Lp sampling for all p>0, e.g., Õn1-1/p space for p ≥ 1, concave functions, and a large number of measure functions, including the L1-L2, Fair, Huber, and Tukey estimators. The update time of our truly perfect Lp-samplers is Ø(1), which is an exponential improvement over the running time of previous perfect Lp-samplers.
{"title":"Truly Perfect Samplers for Data Streams and Sliding Windows","authors":"Rajesh Jayaram, David P. Woodruff, Samson Zhou","doi":"10.1145/3517804.3524139","DOIUrl":"https://doi.org/10.1145/3517804.3524139","url":null,"abstract":"In the G-sampling problem, the goal is to output an index i of a vector f ∈ Rn, such that for all coordinates j ∈[n], [Pr [i=j] = (1 ± ε) (G(fj))/(∑k ∈[n] G(fk)) + γ,] where G: R → R ≥ 0 is some non-negative function. If ε = 0 and γ = 1/poly(n), the sampler is calledperfect. In the data stream model, f is defined implicitly by a sequence of updates to its coordinates, and the goal is to design such a sampler in small space. Jayaram and Woodruff (FOCS 2018) gave the first perfect Lp samplers in turnstile streams, where G(x)=|x|p, using polylog(n) space for p∈(0,2]. However, to date all known sampling algorithms are nottruly perfect, since their output distribution is only point-wise γ = 1/poly(n) close to the true distribution. This small error can be significant when samplers are run many times on successive portions of a stream, and leak potentially sensitive information about the data stream. In this work, we initiate the study oftruly perfect samplers, with ε = γ = 0, and comprehensively investigate their complexity in the data stream and sliding window models. We begin by showing that sublinear space truly perfect sampling is impossible in the turnstile model, by proving a lower bound of Ω(min(n, log 1/γ)) for any G-sampler with point-wise error γ from the true distribution. We then give a general time-efficient sublinear-space framework for developing truly perfect samplers in the insertion-only streaming and sliding window models. As specific applications, our framework addresses Lp sampling for all p>0, e.g., Õn1-1/p space for p ≥ 1, concave functions, and a large number of measure functions, including the L1-L2, Fair, Huber, and Tukey estimators. The update time of our truly perfect Lp-samplers is Ø(1), which is an exponential improvement over the running time of previous perfect Lp-samplers.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130577138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As one of the most fundamental problems in machine learning, statistics and differential privacy, Differentially Private Stochastic Convex Optimization (DP-SCO) has been extensively studied in recent years. However, most of the previous work can only handle either regular data distributions or irregular data in the low dimensional space case. To better understand the challenges arising from irregular data distributions, in this paper we provide the first study on the problem of DP-SCO with heavy-tailed data in the high dimensional space. In the first part we focus on the problem over some polytope constraint (such as the l1-norm ball). We show that if the loss function is smooth and its gradient has bounded second order moment, it is possible to get a (high probability) error bound (excess population risk) of Õ(log d/(nε)1/3) in the ε-DP model, where n is the sample size and d is the dimension of the underlying space. Next, for LASSO, if the data distribution has bounded fourth-order moments, we improve the bound to Õ(log d/(nε)2/5) in the $(ε, δ)-DP model. In the second part of the paper, we study sparse learning with heavy-tailed data. We first revisit the sparse linear model and propose a truncated DP-IHT method whose output could achieve an error of Õ ((s*2 log2d)/nε), where s* is the sparsity of the underlying parameter. Then we study a more general problem over the sparsity (i.e., l0-norm) constraint, and show that it is possible to achieve an error of Õ((s*3/2 log d)/nε), which is also near optimal up to a factor of Õ(√s*), if the loss function is smooth and strongly convex.
{"title":"High Dimensional Differentially Private Stochastic Optimization with Heavy-tailed Data","authors":"Lijie Hu, Shuo Ni, Hanshen Xiao, Di Wang","doi":"10.1145/3517804.3524144","DOIUrl":"https://doi.org/10.1145/3517804.3524144","url":null,"abstract":"As one of the most fundamental problems in machine learning, statistics and differential privacy, Differentially Private Stochastic Convex Optimization (DP-SCO) has been extensively studied in recent years. However, most of the previous work can only handle either regular data distributions or irregular data in the low dimensional space case. To better understand the challenges arising from irregular data distributions, in this paper we provide the first study on the problem of DP-SCO with heavy-tailed data in the high dimensional space. In the first part we focus on the problem over some polytope constraint (such as the l1-norm ball). We show that if the loss function is smooth and its gradient has bounded second order moment, it is possible to get a (high probability) error bound (excess population risk) of Õ(log d/(nε)1/3) in the ε-DP model, where n is the sample size and d is the dimension of the underlying space. Next, for LASSO, if the data distribution has bounded fourth-order moments, we improve the bound to Õ(log d/(nε)2/5) in the $(ε, δ)-DP model. In the second part of the paper, we study sparse learning with heavy-tailed data. We first revisit the sparse linear model and propose a truncated DP-IHT method whose output could achieve an error of Õ ((s*2 log2d)/nε), where s* is the sparsity of the underlying parameter. Then we study a more general problem over the sparsity (i.e., l0-norm) constraint, and show that it is possible to achieve an error of Õ((s*3/2 log d)/nε), which is also near optimal up to a factor of Õ(√s*), if the loss function is smooth and strongly convex.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129581067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mahmoud Abo Khamis, George Chichirim, Antonia Kormpa, Dan Olteanu
Intersection joins over interval data are relevant in spatial and temporal data settings. A set of intervals join if their intersection is non-empty. In case of point intervals, the intersection join becomes the standard equality join. We establish the complexity of Boolean conjunctive queries with intersection joins by a many-one equivalence to disjunctions of Boolean conjunctive queries with equality joins. The complexity of any query with intersection joins is that of the hardest query with equality joins in the disjunction exhibited by our equivalence. This is captured by a new width measure called the ij-width. We also introduce a new syntactic notion of acyclicity called iota-acyclicity to characterise the class of Boolean queries with intersection joins that admit linear time computation modulo a poly-logarithmic factor in the data size. Iota-acyclicity is for intersection joins what alpha-acyclicity is for equality joins. It strictly sits between gamma-acyclicity and Berge-acyclicity. The intersection join queries that are not iota-acyclic are at least as hard as the Boolean triangle query with equality joins, which is widely considered not computable in linear time.
{"title":"The Complexity of Boolean Conjunctive Queries with Intersection Joins","authors":"Mahmoud Abo Khamis, George Chichirim, Antonia Kormpa, Dan Olteanu","doi":"10.1145/3517804.3524156","DOIUrl":"https://doi.org/10.1145/3517804.3524156","url":null,"abstract":"Intersection joins over interval data are relevant in spatial and temporal data settings. A set of intervals join if their intersection is non-empty. In case of point intervals, the intersection join becomes the standard equality join. We establish the complexity of Boolean conjunctive queries with intersection joins by a many-one equivalence to disjunctions of Boolean conjunctive queries with equality joins. The complexity of any query with intersection joins is that of the hardest query with equality joins in the disjunction exhibited by our equivalence. This is captured by a new width measure called the ij-width. We also introduce a new syntactic notion of acyclicity called iota-acyclicity to characterise the class of Boolean queries with intersection joins that admit linear time computation modulo a poly-logarithmic factor in the data size. Iota-acyclicity is for intersection joins what alpha-acyclicity is for equality joins. It strictly sits between gamma-acyclicity and Berge-acyclicity. The intersection join queries that are not iota-acyclic are at least as hard as the Boolean triangle query with equality joins, which is widely considered not computable in linear time.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115522965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}