Proceedings of the ACM on Management of Data最新文献

英文中文

Bag Semantics Conjunctive Query Containment. Four Small Steps Towards Undecidability. 包语义连接查询包含。迈向不可判定性的四小步

Proceedings of the ACM on Management of Data

Pub Date : 2024-05-10 DOI: 10.1145/3651604

Jerzy Marcinkowski, Mateusz Orda

Query Containment Problem (QCP) is one of the most fundamental decision problems in database query processing and optimization. Complexity of QCP for conjunctive queries has been fully understood since 1970s. But, as Chaudhuri and Vardi noticed in their classical 1993 paper this understanding is based on the assumption that query answers are sets of tuples, and it does not transfer to the situation when multi-set (bag) semantics is considered. Now, 30 years later, decidability of QCP for bag semantics remains an open question, one of the most intriguing open questions in database theory. In this paper we show a series of undecidability results for some generalizations of this problem. We show, for example, that the problem whether, for given two boolean conjunctive queries φ s and φ b , and a linear function F, the inequality F(φ s (D)) =< φ b (D) holds for each database instance D, is undecidable.

查询包含问题（QCP）是数据库查询处理和优化中最基本的决策问题之一。早在 20 世纪 70 年代，人们就已经完全理解了连接查询的 QCP 复杂性。但是，正如 Chaudhuri 和 Vardi 在他们 1993 年的经典论文中所指出的，这种理解是基于查询答案是元组集合的假设，并没有转移到考虑多集合（袋）语义的情况中。30 年后的今天，袋语义的 QCP 可判定性仍是一个未决问题，也是数据库理论中最引人入胜的未决问题之一。在本文中，我们展示了该问题某些广义化的一系列不可判定性结果。例如，我们证明了这样一个问题：对于给定的两个布尔连接查询φ s 和φ b，以及一个线性函数 F，对于每个数据库实例 D，不等式 F(φ s (D)) =< φ b (D) 是否成立？

引用次数: 0

Containment of Graph Queries Modulo Schema 图查询的包含模式

Proceedings of the ACM on Management of Data

Pub Date : 2024-05-10 DOI: 10.1145/3651140

Víctor Gutiérrez-Basulto, Albert Gutowski, Yazmín Ibáñez-García, Filip Murlak

With multiple graph database systems on the market and a new Graph Query Language standard on the horizon, it is time to revisit some classic static analysis problems. Query containment, arguably the workhorse of static analysis, has already received a lot of attention in the context of graph databases, but not so in the presence of schemas. We aim to change this. Because there is no universal agreement yet on what graph schemas should be, we rely on an abstract formalism borrowed from the knowledge representation community: we assume that schemas are expressed in a description logic (DL). We identify a suitable DL that capture both basic constraints on the labels of incident nodes and edges, and more refined schema features such as participation, cardinality, and unary key constraints. Basing upon, and extending, the rich body of work on DLs, we solve the containment modulo schema problem for unions of conjunctive regular path queries (UCRPQs) and schemas whose descriptions do not mix inverses and counting. For two-way UCRPQs (UC2RPQs) we solve the problem under additional assumptions that tend to hold in practice: we restrict the use of concatenation in queries and participation constraints in schemas.

随着市场上出现多个图形数据库系统以及新的图形查询语言标准即将出台，是时候重新审视一些经典的静态分析问题了。查询包含（Query containment）可以说是静态分析的主力，它在图数据库中已经受到了广泛关注，但在有模式的情况下却并不如此。我们的目标是改变这种状况。由于在图模式应该是什么的问题上还没有达成普遍共识，因此我们借鉴了知识表示领域的抽象形式主义：我们假定模式是用描述逻辑（DL）表达的。我们确定了一种合适的描述逻辑，它既能捕捉到对事件节点和边的标签的基本约束，也能捕捉到更精细的模式特征，如参与度、卡片性和单键约束。基于并扩展了有关 DL 的丰富研究成果，我们解决了连接正则路径查询（UCRPQ）和模式（其描述不混合反演和计数）联合的包含模模式问题。对于双向 UCRPQs（UC2RPQs），我们在额外的假设条件下解决了这个问题，这些假设条件在实践中往往是成立的：我们限制在查询中使用连接和在模式中使用参与约束。

{"title":"Containment of Graph Queries Modulo Schema","authors":"Víctor Gutiérrez-Basulto, Albert Gutowski, Yazmín Ibáñez-García, Filip Murlak","doi":"10.1145/3651140","DOIUrl":"https://doi.org/10.1145/3651140","url":null,"abstract":"With multiple graph database systems on the market and a new Graph Query Language standard on the horizon, it is time to revisit some classic static analysis problems. Query containment, arguably the workhorse of static analysis, has already received a lot of attention in the context of graph databases, but not so in the presence of schemas. We aim to change this. Because there is no universal agreement yet on what graph schemas should be, we rely on an abstract formalism borrowed from the knowledge representation community: we assume that schemas are expressed in a description logic (DL). We identify a suitable DL that capture both basic constraints on the labels of incident nodes and edges, and more refined schema features such as participation, cardinality, and unary key constraints. Basing upon, and extending, the rich body of work on DLs, we solve the containment modulo schema problem for unions of conjunctive regular path queries (UCRPQs) and schemas whose descriptions do not mix inverses and counting. For two-way UCRPQs (UC2RPQs) we solve the problem under additional assumptions that tend to hold in practice: we restrict the use of concatenation in queries and participation constraints in schemas.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 35","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140990636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On Density-based Local Community Search 基于密度的本地社区搜索

Proceedings of the ACM on Management of Data

Pub Date : 2024-05-10 DOI: 10.1145/3651589

Yizhou Dai, Miao Qiao, Rong-Hua Li

Local community search (LCS) finds a community in a given graph G local to a set R of seed nodes by optimizing an objective function. The objective function f(S) for an induced subgraph S encodes the set inclusion criteria of R to a classic community measurement of S such as the conductance and the density. An ideal algorithm for optimizing f(S) is strongly local, that is, the complexity is dependent on R as opposed to G. This paper formulates a general form of objective functions for LCS using configurations and then focuses on a set C of density-based configurations, each corresponding to a density-based LCS objective function. The paper has two main results. i) A constructive classification of C: a configuration in C has a strongly local algorithm for optimizing its corresponding objective function if and only if it is in C L ⊆ C. ii) A linear programming-based general solution for density-based LCS that is strongly local and practically efficient. This solution is different from the existing strongly local LCS algorithms, which are all based on flow networks.

局部群落搜索（LCS）通过优化目标函数，在给定图 G 中找到种子节点集 R 的局部群落。诱导子图 S 的目标函数 f(S) 将 R 的包含标准集编码为 S 的经典群落测量值，如传导率和密度。优化 f(S) 的理想算法是强局部算法，即复杂度取决于 R 而非 G。本文利用配置提出了 LCS 目标函数的一般形式，然后将重点放在一组基于密度的配置 C 上，每个配置对应一个基于密度的 LCS 目标函数。本文有两个主要结果：i) C 的构造分类：C 中的配置具有强局部算法来优化其相应的目标函数，当且仅当它在 C L ⊆ C 中。该方案不同于现有的强局部 LCS 算法，后者均基于流量网络。

引用次数: 0

Combined Approximations for Uniform Operational Consistent Query Answering 统一运算一致性查询回答的组合近似法

Proceedings of the ACM on Management of Data

Pub Date : 2024-05-10 DOI: 10.1145/3651600

M. Calautti, Ester Livshits, Andreas Pieris, Markus Schneider

Operational consistent query answering (CQA) is a recent framework for CQA based on revised definitions of repairs, which are built by applying a sequence of operations (e.g., fact deletions) starting from an inconsistent database until we reach a database that is consistent w.r.t. the given set of constraints. It has been recently shown that there is an efficient approximation for computing the percentage of repairs that entail a given query when we focus on primary keys, conjunctive queries, and assuming the query is fixed (i.e., in data complexity). However, it has been left open whether such an approximation exists when the query is part of the input (i.e., in combined complexity). We show that this is the case when we focus on self-join-free conjunctive queries of bounded generelized hypertreewidth. We also show that it is unlikely that efficient approximation schemes exist once we give up one of the adopted syntactic restrictions, i.e., self-join-freeness or bounding the generelized hypertreewidth. Towards the desired approximation, we introduce a counting complexity class, called SpanTL, show that each problem in it admits an efficient approximation scheme by using a recent approximability result about tree automata, and then place the problem of interest in SpanTL.

操作性一致查询应答（CQA）是一种最新的 CQA 框架，它基于修订后的修复定义，通过应用一系列操作（如事实删除）从不一致性数据库开始建立，直到我们到达一个在给定约束条件下一致的数据库。最近有研究表明，当我们关注主键、连接查询，并假设查询是固定的（即数据复杂性）时，计算包含给定查询的修复百分比有一个有效的近似值。然而，当查询是输入的一部分时（即综合复杂度），是否存在这样的近似值一直是个未知数。我们的研究表明，当我们把注意力集中在有界生成高带宽的无自连接连接查询时，情况就是这样。我们还证明，一旦我们放弃所采用的语法限制之一，即无自连接或有界代化高带宽，就不太可能存在高效的近似方案。为了实现理想的近似，我们引入了一个计数复杂度类别，称为 SpanTL，利用最近关于树自动机的一个近似性结果，证明其中的每个问题都允许一个高效的近似方案，然后将感兴趣的问题置于 SpanTL 中。

{"title":"Combined Approximations for Uniform Operational Consistent Query Answering","authors":"M. Calautti, Ester Livshits, Andreas Pieris, Markus Schneider","doi":"10.1145/3651600","DOIUrl":"https://doi.org/10.1145/3651600","url":null,"abstract":"Operational consistent query answering (CQA) is a recent framework for CQA based on revised definitions of repairs, which are built by applying a sequence of operations (e.g., fact deletions) starting from an inconsistent database until we reach a database that is consistent w.r.t. the given set of constraints. It has been recently shown that there is an efficient approximation for computing the percentage of repairs that entail a given query when we focus on primary keys, conjunctive queries, and assuming the query is fixed (i.e., in data complexity). However, it has been left open whether such an approximation exists when the query is part of the input (i.e., in combined complexity). We show that this is the case when we focus on self-join-free conjunctive queries of bounded generelized hypertreewidth. We also show that it is unlikely that efficient approximation schemes exist once we give up one of the adopted syntactic restrictions, i.e., self-join-freeness or bounding the generelized hypertreewidth. Towards the desired approximation, we introduce a counting complexity class, called SpanTL, show that each problem in it admits an efficient approximation scheme by using a recent approximability result about tree automata, and then place the problem of interest in SpanTL.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 8","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140993686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PACMMOD Volume 2 Issue 2: Editorial PACMMOD 第 2 卷第 2 期：社论

Proceedings of the ACM on Management of Data

Pub Date : 2024-05-10 DOI: 10.1145/3651136

F. Geerts, Wim Martens, Matthias Niewerth

We are excited to announce the first issue dedicated to the PODS research track of the Proceedings of the ACM on Management of Data, or PACMMOD, journal. In its current form, this new journal hosts a SIGMOD and a PODS research track. The PODS research track aims to provide a solid scientific basis for methods, techniques, and solutions for the data management challenges that continually arise in our data-driven society. Articles for the PODS track of PACMMOD present principled contributions to modeling, application, system building, and both theoretical and experimental validation in the context of data management. Such articles might be based, among others, on establishing theoretical results, developing new concepts and frameworks that deserve further exploration, providing experimental work that sheds light on the scientific foundations of the discipline, or a rigorous analysis of both widely used and recently developed industry artifacts. At a time when computer science is increasingly data centric, it is essential to promote an active exchange of tools and techniques between principles of database systems and other communities focused on data management. The PODS track thus pays special attention to those papers that help in the urgent process of integrating data management techniques within broader computer science. Articles published in this track will be invited for presentation to the ACM Symposium on Principles of Database Systems (PODS), which is held jointly with SIGMOD each year.

我们很高兴地宣布，《ACM 数据管理论文集》（或称 PACMMOD）期刊的 PODS 研究方向将迎来第一期。在目前的形式下，这本新期刊包含一个 SIGMOD 和一个 PODS 研究方向。PODS 研究方向旨在为数据驱动社会中不断出现的数据管理挑战提供方法、技术和解决方案的坚实科学基础。PACMMOD的PODS研究方向的文章介绍数据管理中的建模、应用、系统构建以及理论和实验验证方面的原则性贡献。此类文章可能基于以下方面：建立理论成果、开发值得进一步探索的新概念和框架、提供揭示学科科学基础的实验工作，或对广泛使用的和最新开发的行业工具进行严格分析。在计算机科学日益以数据为中心的今天，促进数据库系统原理与其他以数据管理为重点的社区之间积极交流工具和技术至关重要。因此，PODS 频道特别关注那些有助于将数据管理技术融入更广泛的计算机科学这一紧迫过程的论文。在该方向发表的文章将被邀请在 ACM 数据库系统原理（PODS）研讨会上发表，该研讨会每年与 SIGMOD 联合举办。

{"title":"PACMMOD Volume 2 Issue 2: Editorial","authors":"F. Geerts, Wim Martens, Matthias Niewerth","doi":"10.1145/3651136","DOIUrl":"https://doi.org/10.1145/3651136","url":null,"abstract":"We are excited to announce the first issue dedicated to the PODS research track of the Proceedings of the ACM on Management of Data, or PACMMOD, journal. In its current form, this new journal hosts a SIGMOD and a PODS research track. The PODS research track aims to provide a solid scientific basis for methods, techniques, and solutions for the data management challenges that continually arise in our data-driven society. Articles for the PODS track of PACMMOD present principled contributions to modeling, application, system building, and both theoretical and experimental validation in the context of data management. Such articles might be based, among others, on establishing theoretical results, developing new concepts and frameworks that deserve further exploration, providing experimental work that sheds light on the scientific foundations of the discipline, or a rigorous analysis of both widely used and recently developed industry artifacts. At a time when computer science is increasingly data centric, it is essential to promote an active exchange of tools and techniques between principles of database systems and other communities focused on data management. The PODS track thus pays special attention to those papers that help in the urgent process of integrating data management techniques within broader computer science. Articles published in this track will be invited for presentation to the ACM Symposium on Principles of Database Systems (PODS), which is held jointly with SIGMOD each year.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 29","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140992771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fast Matrix Multiplication for Query Processing 用于查询处理的快速矩阵乘法

Proceedings of the ACM on Management of Data

Pub Date : 2024-05-10 DOI: 10.1145/3651599

Xiao Hu

This paper studies how to use fast matrix multiplication to speed up query processing. As observed, computing a two-table join and then projecting away the join attribute is essentially the Boolean matrix multiplication problem, which can be significantly improved with fast matrix multiplication. Moving beyond this basic two-table query, we introduce output-sensitive algorithms for general join-project queries using fast matrix multiplication. These algorithms have achieved a polynomially large improvement over the classic Yannakakis framework. To the best of our knowledge, this is the first theoretical improvement for general acyclic join-project queries since 1981.

本文研究了如何利用快速矩阵乘法来加快查询处理速度。据观察，计算双表连接然后投影出连接属性本质上是布尔矩阵乘法问题，而快速矩阵乘法可以显著改善这一问题。除了这种基本的双表查询，我们还介绍了使用快速矩阵乘法进行一般连接-投影查询的输出敏感算法。与经典的 Yannakakis 框架相比，这些算法取得了多项式上的巨大进步。据我们所知，这是自 1981 年以来对一般无循环连接项目查询的首次理论改进。

引用次数: 0

Topology-aware Parallel Joins 拓扑感知并行连接

Proceedings of the ACM on Management of Data

Pub Date : 2024-05-10 DOI: 10.1145/3651598

Xiao Hu, Paraschos Koutris

We study the design and analysis of parallel join algorithms in a topology-aware computational model. In this model, the network is modeled as a directed graph, where each edge is associated with a cost function that depends on the data transferred between the two endpoints and the link bandwidth. The computation proceeds in synchronous rounds and the cost of each round is measured as the maximum cost over all the edges in the network. Our main result is an asymptotically optimal join algorithm over symmetric tree topologies. The algorithm generalizes prior topology-aware protocols for set intersection and cartesian product to a binary join over an arbitrary input distribution with possible data skew.

我们在拓扑感知计算模型中研究并行连接算法的设计和分析。在该模型中，网络被建模为有向图，每条边都与成本函数相关联，而成本函数取决于两个端点之间传输的数据和链路带宽。计算以同步轮进行，每一轮的成本以网络中所有边的最大成本来衡量。我们的主要成果是对称树拓扑上的渐进最优连接算法。该算法将先前针对集合相交和卡特积的拓扑感知协议推广到任意输入分布上的二进制连接，并可能存在数据倾斜。

引用次数: 0

Query Optimization by Quantifier Elimination 通过消除量词优化查询

Proceedings of the ACM on Management of Data

Pub Date : 2024-05-10 DOI: 10.1145/3651607

Christoph Koch, Peter Lindner

Query optimizers have a limited arsenal of techniques for optimizing nested queries. In this paper, we develop a new approach for query optimization based on quantifier elimination. Quantifier elimination is a well-established tool for proving the decidability of logical theories. Here, however, we show that it can be turned into an effective query optimization technique that may yield asymptotic improvements in query processing efficiency. In addition, the technique establishes a foundation for certain well-known but previously little-understood aggregation based techniques for optimizing nested queries.

查询优化器用于优化嵌套查询的技术非常有限。在本文中，我们开发了一种基于量词消除的查询优化新方法。量词消元是证明逻辑理论可解性的一种成熟工具。然而，我们在这里证明，它可以转化为一种有效的查询优化技术，从而在查询处理效率方面产生渐进式改进。此外，该技术还为某些众所周知但以前鲜为人知的基于聚合的嵌套查询优化技术奠定了基础。

引用次数: 0

Tight Lower Bounds for Directed Cut Sparsification and Distributed Min-Cut 有向切分稀疏化和分布式最小切分的严格下界

Proceedings of the ACM on Management of Data

Pub Date : 2024-05-10 DOI: 10.1145/3651148

Yu Cheng, Max Li, Honghao Lin, Zi-Yi Tai, David P. Woodruff, Jason Zhang

In this paper, we consider two fundamental cut approximation problems on large graphs. We prove new lower bounds for both problems that are optimal up to logarithmic factors. The first problem is approximating cuts in balanced directed graphs. In this problem, we want to build a data structure that can provide (1 ± ε)-approximation of cut values on a graph with n vertices. For arbitrary directed graphs, such a data structure requires Ω(n 2 ) bits even for constant ε. To circumvent this, recent works study β-balanced graphs, meaning that for every directed cut, the total weight of edges in one direction is at most β times the total weight in the other direction. We consider the for-each model, where the goal is to approximate each cut with constant probability, and the for-all model, where all cuts must be preserved simultaneously. We improve the previous Ømega(n √β/ε) lower bound in the for-each model to ~Ω (n √β /ε) and we improve the previous Ω(n β/ε) lower bound in the for-all model to Ω(n β/ε 2 ). This resolves the main open questions of (Cen et al., ICALP, 2021). The second problem is approximating the global minimum cut in a local query model, where we can only access the graph via degree, edge, and adjacency queries. We prove an ΩL(min m, m/ε 2 k R) lower bound for this problem, which improves the previous ΩL(m/k R) lower bound, where m is the number of edges, k is the minimum cut size, and we seek a (1+ε)-approximation. In addition, we show that existing upper bounds with minor modifications match our lower bound up to logarithmic factors.

在本文中，我们考虑了大型图上的两个基本切割近似问题。我们证明了这两个问题的新下界，它们都是对数因子以内的最优问题。第一个问题是近似平衡有向图中的切分。在这个问题中，我们希望建立一种数据结构，它能在一个有 n 个顶点的图上提供 (1 ± ε)- 切值的近似值。对于任意有向图，即使ε为常数，这样的数据结构也需要 Ω(n 2 ) 位。为了规避这一问题，最近的研究对 β 平衡图进行了研究，这意味着对于每个有向切分，一个方向上的边的总重量最多是另一个方向上的总重量的 β 倍。我们考虑了for-each模型和for-all模型，前者的目标是以恒定概率逼近每个切点，而后者则必须同时保留所有切点。我们将之前 for-each 模型中的Ømega(n √β/ε) 下界改进为 ~Ω (n √β /ε)，并将之前 for-all 模型中的Ω(n β/ε) 下界改进为 Ω(n β/ε 2 )。这解决了 (Cen 等，ICALP，2021) 中的主要未决问题。第二个问题是在局部查询模型中近似全局最小切点，在局部查询模型中，我们只能通过度、边和邻接查询访问图。我们证明了这个问题的 ΩL(min m, m/ε 2 k R) 下界，它改进了之前的 ΩL(m/k R) 下界，其中 m 是边的数量，k 是最小切割大小，我们寻求的是 (1+ε)- 近似值。此外，我们还证明，现有的上界只要稍加修改，就能与我们的下界对数相匹配。

{"title":"Tight Lower Bounds for Directed Cut Sparsification and Distributed Min-Cut","authors":"Yu Cheng, Max Li, Honghao Lin, Zi-Yi Tai, David P. Woodruff, Jason Zhang","doi":"10.1145/3651148","DOIUrl":"https://doi.org/10.1145/3651148","url":null,"abstract":"In this paper, we consider two fundamental cut approximation problems on large graphs. We prove new lower bounds for both problems that are optimal up to logarithmic factors.\u0000 \u0000 The first problem is approximating cuts in balanced directed graphs. In this problem, we want to build a data structure that can provide (1 ± ε)-approximation of cut values on a graph with n vertices. For arbitrary directed graphs, such a data structure requires Ω(n\u0000 2\u0000 ) bits even for constant ε. To circumvent this, recent works study β-balanced graphs, meaning that for every directed cut, the total weight of edges in one direction is at most β times the total weight in the other direction. We consider the for-each model, where the goal is to approximate each cut with constant probability, and the for-all model, where all cuts must be preserved simultaneously. We improve the previous Ømega(n √β/ε) lower bound in the for-each model to ~Ω (n √β /ε) and we improve the previous Ω(n β/ε) lower bound in the for-all model to Ω(n β/ε\u0000 2\u0000 ). This resolves the main open questions of (Cen et al., ICALP, 2021).\u0000 \u0000 \u0000 The second problem is approximating the global minimum cut in a local query model, where we can only access the graph via degree, edge, and adjacency queries. We prove an ΩL(min m, m/ε\u0000 2\u0000 k R) lower bound for this problem, which improves the previous ΩL(m/k R) lower bound, where m is the number of edges, k is the minimum cut size, and we seek a (1+ε)-approximation. In addition, we show that existing upper bounds with minor modifications match our lower bound up to logarithmic factors.\u0000","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 20","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140990526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Streaming Algorithms with Few State Changes 状态变化少的流算法

Proceedings of the ACM on Management of Data

Pub Date : 2024-05-10 DOI: 10.1145/3651145

Rajesh Jayaram, David P. Woodruff, Samson Zhou

In this paper, we study streaming algorithms that minimize the number of changes made to their internal state (i.e., memory contents). While the design of streaming algorithms typically focuses on minimizing space and update time, these metrics fail to capture the asymmetric costs, inherent in modern hardware and database systems, of reading versus writing to memory. In fact, most streaming algorithms write to their memory on every update, which is undesirable when writing is significantly more expensive than reading. This raises the question of whether streaming algorithms with small space and number of memory writes are possible. We first demonstrate that, for the fundamental F p moment estimation problem with p ≥ 1, any streaming algorithm that achieves a constant factor approximation must make Ω(n 1-1/p ) internal state changes, regardless of how much space it uses. Perhaps surprisingly, we show that this lower bound can be matched by an algorithm which also has near-optimal space complexity. Specifically, we give a (1+ε)-approximation algorithm for F p moment estimation that use a near-optimal ~O ε (n 1-1/p ) number of state changes, while simultaneously achieving near-optimal space, i.e., for p∈[1,2), our algorithm uses poly(log n,1/ε) bits of space for, while for p>2, the algorithm uses ~O ε (n 1-1/p ) space. We similarly design streaming algorithms that are simultaneously near-optimal in both space complexity and the number of state changes for the heavy-hitters problem, sparse support recovery, and entropy estimation. Our results demonstrate that an optimal number of state changes can be achieved without sacrificing space complexity.

本文研究的流式算法能最大限度地减少对内部状态（即内存内容）的更改次数。虽然流算法的设计通常侧重于最小化空间和更新时间，但这些指标未能捕捉到现代硬件和数据库系统固有的读取内存与写入内存的不对称成本。事实上，大多数流式算法在每次更新时都会向内存写入数据，当写入数据的成本明显高于读取数据时，这种做法是不可取的。这就提出了一个问题：写入内存的空间和次数较少的流式算法是否可行？我们首先证明，对于 p ≥ 1 的基本 F p 矩估计问题，无论使用多少空间，任何能实现常数因子逼近的流算法都必须进行 Ω(n 1-1/p ) 内部状态变化。也许令人惊讶的是，我们证明了这种算法也能达到这个下限，而且空间复杂度接近最优。具体来说，我们给出了一种 (1+ε)-Approximation 算法，用于 F p 矩估计，该算法使用了接近最优的 ~O ε (n 1-1/p ) 状态变化次数，同时实现了接近最优的空间，即对于 p∈[1,2], 我们的算法使用了 poly(log n,1/ε) 位空间，而对于 p>2, 该算法使用了 ~O ε (n 1-1/p ) 空间。我们还设计了类似的流算法，这些算法同时在重载问题、稀疏支持恢复和熵估计的空间复杂度和状态变化次数上接近最优。我们的结果表明，可以在不牺牲空间复杂度的情况下实现最佳状态变化次数。

{"title":"Streaming Algorithms with Few State Changes","authors":"Rajesh Jayaram, David P. Woodruff, Samson Zhou","doi":"10.1145/3651145","DOIUrl":"https://doi.org/10.1145/3651145","url":null,"abstract":"In this paper, we study streaming algorithms that minimize the number of changes made to their internal state (i.e., memory contents). While the design of streaming algorithms typically focuses on minimizing space and update time, these metrics fail to capture the asymmetric costs, inherent in modern hardware and database systems, of reading versus writing to memory. In fact, most streaming algorithms write to their memory on every update, which is undesirable when writing is significantly more expensive than reading. This raises the question of whether streaming algorithms with small space and number of memory writes are possible.\u0000 \u0000 We first demonstrate that, for the fundamental F\u0000 p\u0000 moment estimation problem with p ≥ 1, any streaming algorithm that achieves a constant factor approximation must make Ω(n\u0000 1-1/p\u0000 ) internal state changes, regardless of how much space it uses. Perhaps surprisingly, we show that this lower bound can be matched by an algorithm which also has near-optimal space complexity. Specifically, we give a (1+ε)-approximation algorithm for F\u0000 p\u0000 moment estimation that use a near-optimal ~O\u0000 ε\u0000 (n\u0000 1-1/p\u0000 ) number of state changes, while simultaneously achieving near-optimal space, i.e., for p∈[1,2), our algorithm uses poly(log n,1/ε) bits of space for, while for p>2, the algorithm uses ~O\u0000 ε\u0000 (n\u0000 1-1/p\u0000 ) space. We similarly design streaming algorithms that are simultaneously near-optimal in both space complexity and the number of state changes for the heavy-hitters problem, sparse support recovery, and entropy estimation. Our results demonstrate that an optimal number of state changes can be achieved without sacrificing space complexity.\u0000","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 83","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140991085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the ACM on Management of Data

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀