Query Containment Problem (QCP) is one of the most fundamental decision problems in database query processing and optimization. Complexity of QCP for conjunctive queries has been fully understood since 1970s. But, as Chaudhuri and Vardi noticed in their classical 1993 paper this understanding is based on the assumption that query answers are sets of tuples, and it does not transfer to the situation when multi-set (bag) semantics is considered. Now, 30 years later, decidability of QCP for bag semantics remains an open question, one of the most intriguing open questions in database theory. In this paper we show a series of undecidability results for some generalizations of this problem. We show, for example, that the problem whether, for given two boolean conjunctive queries φ s and φ b , and a linear function F, the inequality F(φ s (D)) =< φ b (D) holds for each database instance D, is undecidable.
查询包含问题(QCP)是数据库查询处理和优化中最基本的决策问题之一。早在 20 世纪 70 年代,人们就已经完全理解了连接查询的 QCP 复杂性。但是,正如 Chaudhuri 和 Vardi 在他们 1993 年的经典论文中所指出的,这种理解是基于查询答案是元组集合的假设,并没有转移到考虑多集合(袋)语义的情况中。30 年后的今天,袋语义的 QCP 可判定性仍是一个未决问题,也是数据库理论中最引人入胜的未决问题之一。 在本文中,我们展示了该问题某些广义化的一系列不可判定性结果。例如,我们证明了这样一个问题:对于给定的两个布尔连接查询φ s 和φ b,以及一个线性函数 F,对于每个数据库实例 D,不等式 F(φ s (D)) =< φ b (D) 是否成立?
{"title":"Bag Semantics Conjunctive Query Containment. Four Small Steps Towards Undecidability.","authors":"Jerzy Marcinkowski, Mateusz Orda","doi":"10.1145/3651604","DOIUrl":"https://doi.org/10.1145/3651604","url":null,"abstract":"Query Containment Problem (QCP) is one of the most fundamental decision problems in database query processing and optimization.\u0000 Complexity of QCP for conjunctive queries has been fully understood since 1970s. But, as Chaudhuri and Vardi noticed in their classical 1993 paper this understanding is based on the assumption that query answers are sets of tuples, and it does not transfer to the situation when multi-set (bag) semantics is considered.\u0000 Now, 30 years later, decidability of QCP for bag semantics remains an open question, one of the most intriguing open questions in database theory.\u0000 \u0000 In this paper we show a series of undecidability results for some generalizations of this problem. We show, for example, that the problem whether, for given two boolean conjunctive queries φ\u0000 s\u0000 and φ\u0000 b\u0000 , and a linear function F, the inequality F(φ\u0000 s\u0000 (D)) =< φ\u0000 b\u0000 (D) holds for each database instance D, is undecidable.\u0000","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 92","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140990833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Víctor Gutiérrez-Basulto, Albert Gutowski, Yazmín Ibáñez-García, Filip Murlak
With multiple graph database systems on the market and a new Graph Query Language standard on the horizon, it is time to revisit some classic static analysis problems. Query containment, arguably the workhorse of static analysis, has already received a lot of attention in the context of graph databases, but not so in the presence of schemas. We aim to change this. Because there is no universal agreement yet on what graph schemas should be, we rely on an abstract formalism borrowed from the knowledge representation community: we assume that schemas are expressed in a description logic (DL). We identify a suitable DL that capture both basic constraints on the labels of incident nodes and edges, and more refined schema features such as participation, cardinality, and unary key constraints. Basing upon, and extending, the rich body of work on DLs, we solve the containment modulo schema problem for unions of conjunctive regular path queries (UCRPQs) and schemas whose descriptions do not mix inverses and counting. For two-way UCRPQs (UC2RPQs) we solve the problem under additional assumptions that tend to hold in practice: we restrict the use of concatenation in queries and participation constraints in schemas.
{"title":"Containment of Graph Queries Modulo Schema","authors":"Víctor Gutiérrez-Basulto, Albert Gutowski, Yazmín Ibáñez-García, Filip Murlak","doi":"10.1145/3651140","DOIUrl":"https://doi.org/10.1145/3651140","url":null,"abstract":"With multiple graph database systems on the market and a new Graph Query Language standard on the horizon, it is time to revisit some classic static analysis problems. Query containment, arguably the workhorse of static analysis, has already received a lot of attention in the context of graph databases, but not so in the presence of schemas. We aim to change this. Because there is no universal agreement yet on what graph schemas should be, we rely on an abstract formalism borrowed from the knowledge representation community: we assume that schemas are expressed in a description logic (DL). We identify a suitable DL that capture both basic constraints on the labels of incident nodes and edges, and more refined schema features such as participation, cardinality, and unary key constraints. Basing upon, and extending, the rich body of work on DLs, we solve the containment modulo schema problem for unions of conjunctive regular path queries (UCRPQs) and schemas whose descriptions do not mix inverses and counting. For two-way UCRPQs (UC2RPQs) we solve the problem under additional assumptions that tend to hold in practice: we restrict the use of concatenation in queries and participation constraints in schemas.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 35","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140990636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Local community search (LCS) finds a community in a given graph G local to a set R of seed nodes by optimizing an objective function. The objective function f(S) for an induced subgraph S encodes the set inclusion criteria of R to a classic community measurement of S such as the conductance and the density. An ideal algorithm for optimizing f(S) is strongly local, that is, the complexity is dependent on R as opposed to G. This paper formulates a general form of objective functions for LCS using configurations and then focuses on a set C of density-based configurations, each corresponding to a density-based LCS objective function. The paper has two main results. i) A constructive classification of C: a configuration in C has a strongly local algorithm for optimizing its corresponding objective function if and only if it is in C L ⊆ C. ii) A linear programming-based general solution for density-based LCS that is strongly local and practically efficient. This solution is different from the existing strongly local LCS algorithms, which are all based on flow networks.
局部群落搜索(LCS)通过优化目标函数,在给定图 G 中找到种子节点集 R 的局部群落。诱导子图 S 的目标函数 f(S) 将 R 的包含标准集编码为 S 的经典群落测量值,如传导率和密度。优化 f(S) 的理想算法是强局部算法,即复杂度取决于 R 而非 G。本文利用配置提出了 LCS 目标函数的一般形式,然后将重点放在一组基于密度的配置 C 上,每个配置对应一个基于密度的 LCS 目标函数。本文有两个主要结果:i) C 的构造分类:C 中的配置具有强局部算法来优化其相应的目标函数,当且仅当它在 C L ⊆ C 中。该方案不同于现有的强局部 LCS 算法,后者均基于流量网络。
{"title":"On Density-based Local Community Search","authors":"Yizhou Dai, Miao Qiao, Rong-Hua Li","doi":"10.1145/3651589","DOIUrl":"https://doi.org/10.1145/3651589","url":null,"abstract":"\u0000 Local community search (LCS) finds a community in a given graph G local to a set R of seed nodes by optimizing an objective function. The objective function f(S) for an induced subgraph S encodes the set inclusion criteria of R to a classic community measurement of S such as the conductance and the density. An ideal algorithm for optimizing f(S) is strongly local, that is, the complexity is dependent on R as opposed to G. This paper formulates a general form of objective functions for LCS using configurations and then focuses on a set C of density-based configurations, each corresponding to a density-based LCS objective function. The paper has two main results. i) A constructive classification of C: a configuration in C has a strongly local algorithm for optimizing its corresponding objective function if and only if it is in C\u0000 L\u0000 ⊆ C. ii) A linear programming-based general solution for density-based LCS that is strongly local and practically efficient. This solution is different from the existing strongly local LCS algorithms, which are all based on flow networks.\u0000","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 27","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140991914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Calautti, Ester Livshits, Andreas Pieris, Markus Schneider
Operational consistent query answering (CQA) is a recent framework for CQA based on revised definitions of repairs, which are built by applying a sequence of operations (e.g., fact deletions) starting from an inconsistent database until we reach a database that is consistent w.r.t. the given set of constraints. It has been recently shown that there is an efficient approximation for computing the percentage of repairs that entail a given query when we focus on primary keys, conjunctive queries, and assuming the query is fixed (i.e., in data complexity). However, it has been left open whether such an approximation exists when the query is part of the input (i.e., in combined complexity). We show that this is the case when we focus on self-join-free conjunctive queries of bounded generelized hypertreewidth. We also show that it is unlikely that efficient approximation schemes exist once we give up one of the adopted syntactic restrictions, i.e., self-join-freeness or bounding the generelized hypertreewidth. Towards the desired approximation, we introduce a counting complexity class, called SpanTL, show that each problem in it admits an efficient approximation scheme by using a recent approximability result about tree automata, and then place the problem of interest in SpanTL.
{"title":"Combined Approximations for Uniform Operational Consistent Query Answering","authors":"M. Calautti, Ester Livshits, Andreas Pieris, Markus Schneider","doi":"10.1145/3651600","DOIUrl":"https://doi.org/10.1145/3651600","url":null,"abstract":"Operational consistent query answering (CQA) is a recent framework for CQA based on revised definitions of repairs, which are built by applying a sequence of operations (e.g., fact deletions) starting from an inconsistent database until we reach a database that is consistent w.r.t. the given set of constraints. It has been recently shown that there is an efficient approximation for computing the percentage of repairs that entail a given query when we focus on primary keys, conjunctive queries, and assuming the query is fixed (i.e., in data complexity). However, it has been left open whether such an approximation exists when the query is part of the input (i.e., in combined complexity). We show that this is the case when we focus on self-join-free conjunctive queries of bounded generelized hypertreewidth. We also show that it is unlikely that efficient approximation schemes exist once we give up one of the adopted syntactic restrictions, i.e., self-join-freeness or bounding the generelized hypertreewidth. Towards the desired approximation, we introduce a counting complexity class, called SpanTL, show that each problem in it admits an efficient approximation scheme by using a recent approximability result about tree automata, and then place the problem of interest in SpanTL.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 8","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140993686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We are excited to announce the first issue dedicated to the PODS research track of the Proceedings of the ACM on Management of Data, or PACMMOD, journal. In its current form, this new journal hosts a SIGMOD and a PODS research track. The PODS research track aims to provide a solid scientific basis for methods, techniques, and solutions for the data management challenges that continually arise in our data-driven society. Articles for the PODS track of PACMMOD present principled contributions to modeling, application, system building, and both theoretical and experimental validation in the context of data management. Such articles might be based, among others, on establishing theoretical results, developing new concepts and frameworks that deserve further exploration, providing experimental work that sheds light on the scientific foundations of the discipline, or a rigorous analysis of both widely used and recently developed industry artifacts. At a time when computer science is increasingly data centric, it is essential to promote an active exchange of tools and techniques between principles of database systems and other communities focused on data management. The PODS track thus pays special attention to those papers that help in the urgent process of integrating data management techniques within broader computer science. Articles published in this track will be invited for presentation to the ACM Symposium on Principles of Database Systems (PODS), which is held jointly with SIGMOD each year.
{"title":"PACMMOD Volume 2 Issue 2: Editorial","authors":"F. Geerts, Wim Martens, Matthias Niewerth","doi":"10.1145/3651136","DOIUrl":"https://doi.org/10.1145/3651136","url":null,"abstract":"We are excited to announce the first issue dedicated to the PODS research track of the Proceedings of the ACM on Management of Data, or PACMMOD, journal. In its current form, this new journal hosts a SIGMOD and a PODS research track. The PODS research track aims to provide a solid scientific basis for methods, techniques, and solutions for the data management challenges that continually arise in our data-driven society. Articles for the PODS track of PACMMOD present principled contributions to modeling, application, system building, and both theoretical and experimental validation in the context of data management. Such articles might be based, among others, on establishing theoretical results, developing new concepts and frameworks that deserve further exploration, providing experimental work that sheds light on the scientific foundations of the discipline, or a rigorous analysis of both widely used and recently developed industry artifacts. At a time when computer science is increasingly data centric, it is essential to promote an active exchange of tools and techniques between principles of database systems and other communities focused on data management. The PODS track thus pays special attention to those papers that help in the urgent process of integrating data management techniques within broader computer science. Articles published in this track will be invited for presentation to the ACM Symposium on Principles of Database Systems (PODS), which is held jointly with SIGMOD each year.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 29","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140992771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper studies how to use fast matrix multiplication to speed up query processing. As observed, computing a two-table join and then projecting away the join attribute is essentially the Boolean matrix multiplication problem, which can be significantly improved with fast matrix multiplication. Moving beyond this basic two-table query, we introduce output-sensitive algorithms for general join-project queries using fast matrix multiplication. These algorithms have achieved a polynomially large improvement over the classic Yannakakis framework. To the best of our knowledge, this is the first theoretical improvement for general acyclic join-project queries since 1981.
{"title":"Fast Matrix Multiplication for Query Processing","authors":"Xiao Hu","doi":"10.1145/3651599","DOIUrl":"https://doi.org/10.1145/3651599","url":null,"abstract":"This paper studies how to use fast matrix multiplication to speed up query processing. As observed, computing a two-table join and then projecting away the join attribute is essentially the Boolean matrix multiplication problem, which can be significantly improved with fast matrix multiplication. Moving beyond this basic two-table query, we introduce output-sensitive algorithms for general join-project queries using fast matrix multiplication. These algorithms have achieved a polynomially large improvement over the classic Yannakakis framework. To the best of our knowledge, this is the first theoretical improvement for general acyclic join-project queries since 1981.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 45","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140992777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study the design and analysis of parallel join algorithms in a topology-aware computational model. In this model, the network is modeled as a directed graph, where each edge is associated with a cost function that depends on the data transferred between the two endpoints and the link bandwidth. The computation proceeds in synchronous rounds and the cost of each round is measured as the maximum cost over all the edges in the network. Our main result is an asymptotically optimal join algorithm over symmetric tree topologies. The algorithm generalizes prior topology-aware protocols for set intersection and cartesian product to a binary join over an arbitrary input distribution with possible data skew.
{"title":"Topology-aware Parallel Joins","authors":"Xiao Hu, Paraschos Koutris","doi":"10.1145/3651598","DOIUrl":"https://doi.org/10.1145/3651598","url":null,"abstract":"We study the design and analysis of parallel join algorithms in a topology-aware computational model. In this model, the network is modeled as a directed graph, where each edge is associated with a cost function that depends on the data transferred between the two endpoints and the link bandwidth. The computation proceeds in synchronous rounds and the cost of each round is measured as the maximum cost over all the edges in the network. Our main result is an asymptotically optimal join algorithm over symmetric tree topologies. The algorithm generalizes prior topology-aware protocols for set intersection and cartesian product to a binary join over an arbitrary input distribution with possible data skew.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140993901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Query optimizers have a limited arsenal of techniques for optimizing nested queries. In this paper, we develop a new approach for query optimization based on quantifier elimination. Quantifier elimination is a well-established tool for proving the decidability of logical theories. Here, however, we show that it can be turned into an effective query optimization technique that may yield asymptotic improvements in query processing efficiency. In addition, the technique establishes a foundation for certain well-known but previously little-understood aggregation based techniques for optimizing nested queries.
{"title":"Query Optimization by Quantifier Elimination","authors":"Christoph Koch, Peter Lindner","doi":"10.1145/3651607","DOIUrl":"https://doi.org/10.1145/3651607","url":null,"abstract":"Query optimizers have a limited arsenal of techniques for optimizing nested queries. In this paper, we develop a new approach for query optimization based on quantifier elimination. Quantifier elimination is a well-established tool for proving the decidability of logical theories. Here, however, we show that it can be turned into an effective query optimization technique that may yield asymptotic improvements in query processing efficiency. In addition, the technique establishes a foundation for certain well-known but previously little-understood aggregation based techniques for optimizing nested queries.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 17","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140993968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu Cheng, Max Li, Honghao Lin, Zi-Yi Tai, David P. Woodruff, Jason Zhang
In this paper, we consider two fundamental cut approximation problems on large graphs. We prove new lower bounds for both problems that are optimal up to logarithmic factors. The first problem is approximating cuts in balanced directed graphs. In this problem, we want to build a data structure that can provide (1 ± ε)-approximation of cut values on a graph with n vertices. For arbitrary directed graphs, such a data structure requires Ω(n 2 ) bits even for constant ε. To circumvent this, recent works study β-balanced graphs, meaning that for every directed cut, the total weight of edges in one direction is at most β times the total weight in the other direction. We consider the for-each model, where the goal is to approximate each cut with constant probability, and the for-all model, where all cuts must be preserved simultaneously. We improve the previous Ømega(n √β/ε) lower bound in the for-each model to ~Ω (n √β /ε) and we improve the previous Ω(n β/ε) lower bound in the for-all model to Ω(n β/ε 2 ). This resolves the main open questions of (Cen et al., ICALP, 2021). The second problem is approximating the global minimum cut in a local query model, where we can only access the graph via degree, edge, and adjacency queries. We prove an ΩL(min m, m/ε 2 k R) lower bound for this problem, which improves the previous ΩL(m/k R) lower bound, where m is the number of edges, k is the minimum cut size, and we seek a (1+ε)-approximation. In addition, we show that existing upper bounds with minor modifications match our lower bound up to logarithmic factors.
{"title":"Tight Lower Bounds for Directed Cut Sparsification and Distributed Min-Cut","authors":"Yu Cheng, Max Li, Honghao Lin, Zi-Yi Tai, David P. Woodruff, Jason Zhang","doi":"10.1145/3651148","DOIUrl":"https://doi.org/10.1145/3651148","url":null,"abstract":"In this paper, we consider two fundamental cut approximation problems on large graphs. We prove new lower bounds for both problems that are optimal up to logarithmic factors.\u0000 \u0000 The first problem is approximating cuts in balanced directed graphs. In this problem, we want to build a data structure that can provide (1 ± ε)-approximation of cut values on a graph with n vertices. For arbitrary directed graphs, such a data structure requires Ω(n\u0000 2\u0000 ) bits even for constant ε. To circumvent this, recent works study β-balanced graphs, meaning that for every directed cut, the total weight of edges in one direction is at most β times the total weight in the other direction. We consider the for-each model, where the goal is to approximate each cut with constant probability, and the for-all model, where all cuts must be preserved simultaneously. We improve the previous Ømega(n √β/ε) lower bound in the for-each model to ~Ω (n √β /ε) and we improve the previous Ω(n β/ε) lower bound in the for-all model to Ω(n β/ε\u0000 2\u0000 ). This resolves the main open questions of (Cen et al., ICALP, 2021).\u0000 \u0000 \u0000 The second problem is approximating the global minimum cut in a local query model, where we can only access the graph via degree, edge, and adjacency queries. We prove an ΩL(min m, m/ε\u0000 2\u0000 k R) lower bound for this problem, which improves the previous ΩL(m/k R) lower bound, where m is the number of edges, k is the minimum cut size, and we seek a (1+ε)-approximation. In addition, we show that existing upper bounds with minor modifications match our lower bound up to logarithmic factors.\u0000","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 20","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140990526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we study streaming algorithms that minimize the number of changes made to their internal state (i.e., memory contents). While the design of streaming algorithms typically focuses on minimizing space and update time, these metrics fail to capture the asymmetric costs, inherent in modern hardware and database systems, of reading versus writing to memory. In fact, most streaming algorithms write to their memory on every update, which is undesirable when writing is significantly more expensive than reading. This raises the question of whether streaming algorithms with small space and number of memory writes are possible. We first demonstrate that, for the fundamental F p moment estimation problem with p ≥ 1, any streaming algorithm that achieves a constant factor approximation must make Ω(n 1-1/p ) internal state changes, regardless of how much space it uses. Perhaps surprisingly, we show that this lower bound can be matched by an algorithm which also has near-optimal space complexity. Specifically, we give a (1+ε)-approximation algorithm for F p moment estimation that use a near-optimal ~O ε (n 1-1/p ) number of state changes, while simultaneously achieving near-optimal space, i.e., for p∈[1,2), our algorithm uses poly(log n,1/ε) bits of space for, while for p>2, the algorithm uses ~O ε (n 1-1/p ) space. We similarly design streaming algorithms that are simultaneously near-optimal in both space complexity and the number of state changes for the heavy-hitters problem, sparse support recovery, and entropy estimation. Our results demonstrate that an optimal number of state changes can be achieved without sacrificing space complexity.
本文研究的流式算法能最大限度地减少对内部状态(即内存内容)的更改次数。虽然流算法的设计通常侧重于最小化空间和更新时间,但这些指标未能捕捉到现代硬件和数据库系统固有的读取内存与写入内存的不对称成本。事实上,大多数流式算法在每次更新时都会向内存写入数据,当写入数据的成本明显高于读取数据时,这种做法是不可取的。这就提出了一个问题:写入内存的空间和次数较少的流式算法是否可行? 我们首先证明,对于 p ≥ 1 的基本 F p 矩估计问题,无论使用多少空间,任何能实现常数因子逼近的流算法都必须进行 Ω(n 1-1/p ) 内部状态变化。也许令人惊讶的是,我们证明了这种算法也能达到这个下限,而且空间复杂度接近最优。具体来说,我们给出了一种 (1+ε)-Approximation 算法,用于 F p 矩估计,该算法使用了接近最优的 ~O ε (n 1-1/p ) 状态变化次数,同时实现了接近最优的空间,即对于 p∈[1,2], 我们的算法使用了 poly(log n,1/ε) 位空间,而对于 p>2, 该算法使用了 ~O ε (n 1-1/p ) 空间。我们还设计了类似的流算法,这些算法同时在重载问题、稀疏支持恢复和熵估计的空间复杂度和状态变化次数上接近最优。我们的结果表明,可以在不牺牲空间复杂度的情况下实现最佳状态变化次数。
{"title":"Streaming Algorithms with Few State Changes","authors":"Rajesh Jayaram, David P. Woodruff, Samson Zhou","doi":"10.1145/3651145","DOIUrl":"https://doi.org/10.1145/3651145","url":null,"abstract":"In this paper, we study streaming algorithms that minimize the number of changes made to their internal state (i.e., memory contents). While the design of streaming algorithms typically focuses on minimizing space and update time, these metrics fail to capture the asymmetric costs, inherent in modern hardware and database systems, of reading versus writing to memory. In fact, most streaming algorithms write to their memory on every update, which is undesirable when writing is significantly more expensive than reading. This raises the question of whether streaming algorithms with small space and number of memory writes are possible.\u0000 \u0000 We first demonstrate that, for the fundamental F\u0000 p\u0000 moment estimation problem with p ≥ 1, any streaming algorithm that achieves a constant factor approximation must make Ω(n\u0000 1-1/p\u0000 ) internal state changes, regardless of how much space it uses. Perhaps surprisingly, we show that this lower bound can be matched by an algorithm which also has near-optimal space complexity. Specifically, we give a (1+ε)-approximation algorithm for F\u0000 p\u0000 moment estimation that use a near-optimal ~O\u0000 ε\u0000 (n\u0000 1-1/p\u0000 ) number of state changes, while simultaneously achieving near-optimal space, i.e., for p∈[1,2), our algorithm uses poly(log n,1/ε) bits of space for, while for p>2, the algorithm uses ~O\u0000 ε\u0000 (n\u0000 1-1/p\u0000 ) space. We similarly design streaming algorithms that are simultaneously near-optimal in both space complexity and the number of state changes for the heavy-hitters problem, sparse support recovery, and entropy estimation. Our results demonstrate that an optimal number of state changes can be achieved without sacrificing space complexity.\u0000","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 83","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140991085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}