Proceedings of the ACM on Management of Data最新文献_第2页

The Moments Method for Approximate Data Cube Queries 用于近似数据立方体查询的矩量法

Proceedings of the ACM on Management of Data

Pub Date : 2024-05-10 DOI: 10.1145/3651147

Peter Lindner, Sachin Basil John, Christoph Koch, D. Suciu

We investigate an approximation algorithm for various aggregate queries on partially materialized data cubes. Data cubes are interpreted as probability distributions, and cuboids from a partial materialization populate the terms of a series expansion of the target query distribution. Unknown terms in the expansion are just assumed to be 0 in order to recover an approximate query result. We identify this method as a variant of related approaches from other fields of science, that is, the Bahadur representation and, more generally, (biased) Fourier expansions of Boolean functions. Existing literature indicates a rich but intricate theoretical landscape. Focusing on the data cube application, we start by investigating worst-case error bounds. We build upon prior work to obtain provably optimal materialization strategies with respect to query workloads. In addition, we propose a new heuristic method governing materialization decisions. Finally, we show that well-approximated queries are guaranteed to have well-approximated roll-ups.

我们研究了针对部分实体化数据立方体的各种聚合查询的近似计算算法。数据立方体被解释为概率分布，来自部分实体化的立方体填充了目标查询分布的系列扩展项。扩展中的未知项被假定为 0，以恢复近似查询结果。我们将这种方法视为其他科学领域相关方法的变体，即巴哈多表示法，以及更广泛的布尔函数的（偏置）傅里叶展开。现有文献显示了丰富但错综复杂的理论前景。我们将重点放在数据立方体应用上，首先研究最坏情况下的误差边界。在先前工作的基础上，我们获得了与查询工作量相关的可证明的最优实体化策略。此外，我们还提出了一种新的启发式方法来管理具体化决策。最后，我们证明了逼近度良好的查询一定会有逼近度良好的卷积。

引用次数: 0

History-Independent Dynamic Partitioning: Operation-Order Privacy in Ordered Data Structures 与历史无关的动态分区：有序数据结构中的操作顺序隐私

Proceedings of the ACM on Management of Data

Pub Date : 2024-05-10 DOI: 10.1145/3651609

Michael A. Bender, Martín Farach-Colton, Michael T. Goodrich, Hanna Komlós

A data structure is history independent if its internal representation reveals nothing about the history of operations beyond what can be determined from the current contents of the data structure. History independence is typically viewed as a security or privacy guarantee, with the intent being to minimize risks incurred by a security breach or audit. Despite widespread advances in history independence, there is an important data-structural primitive that previous work has been unable to replace with an equivalent history-independent alternative---dynamic partitioning. In dynamic partitioning, we are given a dynamic set S of ordered elements and a size-parameter B, and the objective is to maintain a partition of S into ordered groups, each of size Θ(B). Dynamic partitioning is important throughout computer science, with applications to B-tree rebalancing, write-optimized dictionaries, log-structured merge trees, other external-memory indexes, geometric and spatial data structures, cache-oblivious data structures, and order-maintenance data structures. The lack of a history-independent dynamic-partitioning primitive has meant that designers of history-independent data structures have had to resort to complex alternatives. In this paper, we achieve history-independent dynamic partitioning. Our algorithm runs asymptotically optimally against an oblivious adversary, processing each insert/delete with O(1) operations in expectation and O(B log N/loglog N) with high probability in set size N.

如果一个数据结构的内部表示没有揭示任何操作的历史，而只能根据数据结构的当前内容来确定，那么这个数据结构就是独立于历史的。历史独立性通常被视为一种安全或隐私保证，目的是最大限度地降低安全漏洞或审计带来的风险。尽管在历史独立性方面取得了广泛的进展，但有一种重要的数据结构基本原理，以往的工作却无法用与历史无关的等效替代方法来替代--动态分区。在动态分区中，我们给定一个由有序元素组成的动态集合 S 和一个大小参数 B，目标是将 S 划分为有序分组，每个分组的大小为 Θ(B)。动态分区在整个计算机科学中都很重要，在 B 树再平衡、写优化字典、日志结构合并树、其他外部内存索引、几何和空间数据结构、无缓存数据结构和有序维护数据结构中都有应用。由于缺乏独立于历史的动态分区原型，独立于历史的数据结构设计者不得不采用复杂的替代方案。在本文中，我们实现了与历史无关的动态分区。我们的算法在面对遗忘对手时运行渐近最优，每次插入/删除的期望运算量为 O(1)，在集合大小为 N 时的高概率运算量为 O(B log N/log N)。

{"title":"History-Independent Dynamic Partitioning: Operation-Order Privacy in Ordered Data Structures","authors":"Michael A. Bender, Martín Farach-Colton, Michael T. Goodrich, Hanna Komlós","doi":"10.1145/3651609","DOIUrl":"https://doi.org/10.1145/3651609","url":null,"abstract":"A data structure is history independent if its internal representation reveals nothing about the history of operations beyond what can be determined from the current contents of the data structure. History independence is typically viewed as a security or privacy guarantee, with the intent being to minimize risks incurred by a security breach or audit. Despite widespread advances in history independence, there is an important data-structural primitive that previous work has been unable to replace with an equivalent history-independent alternative---dynamic partitioning. In dynamic partitioning, we are given a dynamic set S of ordered elements and a size-parameter B, and the objective is to maintain a partition of S into ordered groups, each of size Θ(B). Dynamic partitioning is important throughout computer science, with applications to B-tree rebalancing, write-optimized dictionaries, log-structured merge trees, other external-memory indexes, geometric and spatial data structures, cache-oblivious data structures, and order-maintenance data structures. The lack of a history-independent dynamic-partitioning primitive has meant that designers of history-independent data structures have had to resort to complex alternatives. In this paper, we achieve history-independent dynamic partitioning. Our algorithm runs asymptotically optimally against an oblivious adversary, processing each insert/delete with O(1) operations in expectation and O(B log N/loglog N) with high probability in set size N.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 10","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140991400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TypeQL: A Type-Theoretic & Polymorphic Query Language TypeQL：类型理论与多态查询语言

Proceedings of the ACM on Management of Data

Pub Date : 2024-05-10 DOI: 10.1145/3651611

Christoph Dorn, Haikal Pribadi

Relational data modeling can often be restrictive as it provides no direct facility for modeling polymorphic types, reified relations, multi-valued attributes, and other common high-level structures in data. This creates many challenges in data modeling and engineering tasks, and has led to the rise of more flexible NoSQL databases, such as graph and document databases. In the absence of structured schemas, however, we can neither express nor validate the intention of data models, making long-term maintenance of databases substantially more difficult. To resolve this dilemma, we argue that, parallel to the role of classical predicate logic for relational algebra, contemporary foundations of mathematics rooted in type theory can guide us in the development of powerful new high-level data models and query languages. To this end, we introduce a new polymorphic entity-relation-attribute (PERA) data model, grounded in type-theoretic principles and accessible through classical conceptual modeling, with a near-natural query language: TypeQL. We illustrate the syntax of TypeQL as well as its denotation in the PERA model, formalize our model as an algebraic theory with dependent types, and describe its stratified semantics.

关系型数据建模往往具有局限性，因为它无法直接为多态类型、重构关系、多值属性和其他常见的高级数据结构建模。这给数据建模和工程任务带来了许多挑战，并导致了更灵活的 NoSQL 数据库（如图和文档数据库）的兴起。然而，在没有结构化模式的情况下，我们既无法表达也无法验证数据模型的意图，这大大增加了数据库长期维护的难度。为了解决这一难题，我们认为，与经典谓词逻辑在关系代数中的作用一样，植根于类型理论的当代数学基础可以指导我们开发强大的新高级数据模型和查询语言。为此，我们引入了一种新的多态实体-相关-属性（PERA）数据模型，它以类型理论原则为基础，可通过经典概念模型进行访问，并具有近乎自然的查询语言：TypeQL。我们说明了 TypeQL 的语法及其在 PERA 模型中的含义，将我们的模型形式化为具有依赖类型的代数理论，并描述了其分层语义。

{"title":"TypeQL: A Type-Theoretic & Polymorphic Query Language","authors":"Christoph Dorn, Haikal Pribadi","doi":"10.1145/3651611","DOIUrl":"https://doi.org/10.1145/3651611","url":null,"abstract":"Relational data modeling can often be restrictive as it provides no direct facility for modeling polymorphic types, reified relations, multi-valued attributes, and other common high-level structures in data. This creates many challenges in data modeling and engineering tasks, and has led to the rise of more flexible NoSQL databases, such as graph and document databases. In the absence of structured schemas, however, we can neither express nor validate the intention of data models, making long-term maintenance of databases substantially more difficult. To resolve this dilemma, we argue that, parallel to the role of classical predicate logic for relational algebra, contemporary foundations of mathematics rooted in type theory can guide us in the development of powerful new high-level data models and query languages. To this end, we introduce a new polymorphic entity-relation-attribute (PERA) data model, grounded in type-theoretic principles and accessible through classical conceptual modeling, with a near-natural query language: TypeQL. We illustrate the syntax of TypeQL as well as its denotation in the PERA model, formalize our model as an algebraic theory with dependent types, and describe its stratified semantics.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 24","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140992582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Simple & Optimal Quantile Sketch: Combining Greenwald-Khanna with Khanna-Greenwald 简单与最优量子草图：格林沃尔德-坎纳与坎纳-格林沃尔德的结合

Proceedings of the ACM on Management of Data

Pub Date : 2024-05-10 DOI: 10.1145/3651610

Elena Gribelyuk, Pachara Sawettamalya, Hongxun Wu, Huacheng Yu

Estimating the ε-approximate quantiles or ranks of a stream is a fundamental task in data monitoring. Given a stream x_1,..., x_n from a universe mathcalU with total order, an additive-error quantile sketch mathcalM allows us to approximate the rank of any query yin mathcalU up to additive ε n error. In 2001, Greenwald and Khanna gave a deterministic algorithm (GK sketch) that solves the ε-approximate quantiles estimation problem using O(ε^-1 łog(ε n)) space citegreenwald2001space ; recently, this algorithm was shown to be optimal by Cormode and Vesleý in 2020 citecormode2020tight. However, due to the intricacy of the GK sketch and its analysis, over-simplified versions of the algorithm are implemented in practical applications, often without any known theoretical guarantees. In fact, it has remained an open question whether the GK sketch can be simplified while maintaining the optimal space bound. In this paper, we resolve this open question by giving a simplified deterministic algorithm that stores at most (2 + o(1))ε^-1 łog (ε n) elements and solves the additive-error quantile estimation problem; as a side benefit, our algorithm achieves a smaller constant factor than the frac11 2 ε^-1 łog(ε n) space bound in the original GK sketch~citegreenwald2001space. Our algorithm features an easier analysis and still achieves the same optimal asymptotic space complexity as the original GK sketch. Lastly, our simplification enables an efficient data structure implementation, with a worst-case runtime of O(łog(1/ε) + łog łog (ε n)) per-element for the ordinary ε-approximate quantile estimation problem. Also, for the related "weighted'' quantile estimation problem, we give efficient data structures for our simplified algorithm which guarantee a worst-case per-element runtime of O(łog(1/ε) + łog łog (ε W_n/w_textrmmin )), achieving an improvement over the previous upper bound of citeassadi2023generalizing.

估计数据流的ε-近似量值或等级是数据监测的一项基本任务。给定来自宇宙 mathcalU 的具有总序的数据流 x_1，...，x_n，一个可加误差量值草图 mathcalM 允许我们近似估计 mathcalU 中任何查询的秩，误差不超过可加误差 ε n。2001 年，Greenwald 和 Khanna 给出了一个确定性算法（GK 草图），用 O(ε^-1 łog(ε n))空间 citegreenwald2001space 解决了 ε-approximate quantiles 估计问题；最近，Cormode 和 Vesleý 在 2020 年证明了这个算法是最优的 citecormode2020tight.然而，由于 GK 草图及其分析的复杂性，该算法的过度简化版本在实际应用中得以实现，但往往没有任何已知的理论保证。事实上，GK 草图能否在简化的同时保持最优空间约束一直是个悬而未决的问题。在本文中，我们通过给出一种简化的确定性算法来解决这个悬而未决的问题，该算法最多存储 (2 + o(1))ε^-1 łog (ε n) 个元素，并解决了加性误差量子估计问题；作为一个附带的好处，我们的算法比原始 GK 草图~citegreenwald2001 空间中的frac11 2 ε^-1 łog(ε n) 空间约束实现了更小的常数因子。我们的算法更易于分析，而且仍然能达到与原始 GK 草图相同的最优渐近空间复杂度。最后，我们的简化实现了高效的数据结构，对于普通的ε-近似量化估计问题，每元素的最坏运行时间为O(łog(1/ε) + łog łog (ε n))。此外，对于相关的 "加权''量化估计问题，我们给出了简化算法的高效数据结构，保证了最坏情况下的每元素运行时间为 O(łog(1/ε) + łog łog (ε W_n/w_textrmmin )) ，比之前的 citeassadi2023generalizing 上界有所提高。

{"title":"Simple & Optimal Quantile Sketch: Combining Greenwald-Khanna with Khanna-Greenwald","authors":"Elena Gribelyuk, Pachara Sawettamalya, Hongxun Wu, Huacheng Yu","doi":"10.1145/3651610","DOIUrl":"https://doi.org/10.1145/3651610","url":null,"abstract":"Estimating the ε-approximate quantiles or ranks of a stream is a fundamental task in data monitoring. Given a stream x_1,..., x_n from a universe mathcalU with total order, an additive-error quantile sketch mathcalM allows us to approximate the rank of any query yin mathcalU up to additive ε n error. In 2001, Greenwald and Khanna gave a deterministic algorithm (GK sketch) that solves the ε-approximate quantiles estimation problem using O(ε^-1 łog(ε n)) space citegreenwald2001space ; recently, this algorithm was shown to be optimal by Cormode and Vesleý in 2020 citecormode2020tight. However, due to the intricacy of the GK sketch and its analysis, over-simplified versions of the algorithm are implemented in practical applications, often without any known theoretical guarantees. In fact, it has remained an open question whether the GK sketch can be simplified while maintaining the optimal space bound. In this paper, we resolve this open question by giving a simplified deterministic algorithm that stores at most (2 + o(1))ε^-1 łog (ε n) elements and solves the additive-error quantile estimation problem; as a side benefit, our algorithm achieves a smaller constant factor than the frac11 2 ε^-1 łog(ε n) space bound in the original GK sketch~citegreenwald2001space. Our algorithm features an easier analysis and still achieves the same optimal asymptotic space complexity as the original GK sketch. Lastly, our simplification enables an efficient data structure implementation, with a worst-case runtime of O(łog(1/ε) + łog łog (ε n)) per-element for the ordinary ε-approximate quantile estimation problem. Also, for the related \"weighted'' quantile estimation problem, we give efficient data structures for our simplified algorithm which guarantee a worst-case per-element runtime of O(łog(1/ε) + łog łog (ε W_n/w_textrmmin )), achieving an improvement over the previous upper bound of citeassadi2023generalizing.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 11","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140991975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Postulates for Provenance: Instance-based provenance for first-order logic 证明的假设：基于实例的一阶逻辑出处

Proceedings of the ACM on Management of Data

Pub Date : 2024-05-10 DOI: 10.1145/3651596

Bart Bogaerts, Maxime Jakubowski, Jan Van den Bussche

Instance-based provenance is an explanation for a query result in the form of a subinstance of the database. We investigate different desiderata one may want to impose on these subinstances. Concretely we consider seven basic postulates for provenance. Six of them relate subinstances to provenance polynomials, three-valued semantics, and Halpern-Pearl causality. Determinism of the provenance mechanism is the seventh basic postulate. Moreover, we consider the postulate of minimality, which can be imposed with respect to any set of basic postulates. Our main technical contribution is an analysis and characterisation of which combinations of postulates are jointly satisfiable. Our main conceptual contribution is an approach to instance-based provenance through three-valued instances, which makes it applicable to first-order logic queries involving negation.

基于实例的出处是以数据库子实例的形式对查询结果的解释。我们研究了人们可能希望对这些子实例施加的不同考虑因素。具体来说，我们考虑了七个关于出处的基本假设。其中六条涉及子实例与出处多项式、三值语义和 Halpern-Pearl 因果关系。出处机制的确定性是第七个基本假设。此外，我们还考虑了最小性公设，它可以施加于任何一组基本公设。我们在技术上的主要贡献是分析和描述了哪些公设组合是可以共同满足的。我们在概念上的主要贡献是通过三值实例来实现基于实例的证明，这使得它适用于涉及否定的一阶逻辑查询。

引用次数: 0

Tight Bounds of Circuits for Sum-Product Queries 和积查询的电路紧界

Proceedings of the ACM on Management of Data

Pub Date : 2024-05-10 DOI: 10.1145/3651588

Austen Z. Fan, Paraschos Koutris, Hangdong Zhao

In this paper, we ask the following question: given a Boolean Conjunctive Query (CQ), what is the smallest circuit that computes the provenance polynomial of the query over a given semiring? We answer this question by giving upper and lower bounds. Notably, it is shown that any circuit F that computes a CQ over the tropical semiring must have size log |F| ≥ (1-ε) · da-entw for any ε >0, where da-entw is the degree-aware entropic width of the query. We show a circuit construction that matches this bound when the semiring is idempotent. The techniques we use combine several central notions in database theory: provenance polynomials, tree decompositions, and disjunctive Datalog programs. We extend our results to lower and upper bounds for formulas (i.e., circuits where each gate has outdegree one), and to bounds for non-Boolean CQs.

在本文中，我们提出了以下问题：给定一个布尔结语查询（CQ），在给定的语序上计算该查询的证明多项式的最小电路是什么？我们通过给出上限和下限来回答这个问题。值得注意的是，对于任意 ε >0 的情况，任何在热带配线上计算 CQ 的电路 F 的大小必须 log |F| ≥ (1-ε) - da-entw，其中 da-entw 是查询的度感知熵宽。我们展示了一种电路构造，当半线性是幂等的时候，它与这个约束相匹配。我们使用的技术结合了数据库理论中的几个核心概念：证明多项式、树分解和分条件 Datalog 程序。我们将结果扩展到公式的下界和上限（即每个门外度为 1 的电路），以及非布尔 CQ 的边界。

引用次数: 0

Verification of Unary Communicating Datalog Programs 验证一元通信 Datalog 程序

Proceedings of the ACM on Management of Data

Pub Date : 2024-05-10 DOI: 10.1145/3651590

C. Aiswarya, D. Calvanese, Francesco Di Cosmo, M. Montali

We study verification of reachability properties over Communicating Datalog Programs (CDPs), which are networks of relational nodes connected through unordered channels and running Datalog-like computations. Each node manipulates a local state database (DB), depending on incoming messages and additional input DBs from external services. Decidability of verification for CDPs has so far been established only under boundedness assumptions on the state and channel sizes, showing at the same time undecidability of reachability for unbounded states with only two unary relations or unbounded channels with a single binary relation. The goal of this paper is to study the open case of CDPs with bounded states and unbounded channels, under the assumption that channels carry unary relations only. We discuss the significance of the resulting model and prove the decidability of verification of variants of reachability, captured in fragments of first-order CTL. We do so through a novel reduction to coverability problems in a class of high-level Petri Nets that manipulate unordered data identifiers. We study the tightness of our results, showing that minor generalizations of the considered reachability properties yield undecidability of verification, both for CDPs and the corresponding Petri Net model.

通信 Datalog 程序（CDP）是通过无序通道连接的关系节点网络，运行类似 Datalog 的计算。每个节点操纵一个本地状态数据库（DB），并依赖于传入的消息和来自外部服务的附加输入数据库。迄今为止，CDP 验证的可判定性仅在状态和通道大小的有界性假设下建立，同时显示了仅有两个一元关系的无界状态或仅有一个二元关系的无界通道的不可判定性。本文的目标是在通道只携带一元关系的假设下，研究具有有界状态和无界通道的 CDP 的开放情况。我们讨论了由此产生的模型的意义，并证明了用一阶 CTL 片段捕捉的可达性变体验证的可解性。我们通过一种新颖的还原方法，证明了一类处理无序数据标识符的高级 Petri 网中的可覆盖性问题。我们对结果的严密性进行了研究，结果表明，对所考虑的可达性属性进行微小的概括，就会产生对 CDP 和相应 Petri 网模型的不可判定性。

{"title":"Verification of Unary Communicating Datalog Programs","authors":"C. Aiswarya, D. Calvanese, Francesco Di Cosmo, M. Montali","doi":"10.1145/3651590","DOIUrl":"https://doi.org/10.1145/3651590","url":null,"abstract":"We study verification of reachability properties over Communicating Datalog Programs (CDPs), which are networks of relational nodes connected through unordered channels and running Datalog-like computations. Each node manipulates a local state database (DB), depending on incoming messages and additional input DBs from external services. Decidability of verification for CDPs has so far been established only under boundedness assumptions on the state and channel sizes, showing at the same time undecidability of reachability for unbounded states with only two unary relations or unbounded channels with a single binary relation. The goal of this paper is to study the open case of CDPs with bounded states and unbounded channels, under the assumption that channels carry unary relations only. We discuss the significance of the resulting model and prove the decidability of verification of variants of reachability, captured in fragments of first-order CTL. We do so through a novel reduction to coverability problems in a class of high-level Petri Nets that manipulate unordered data identifiers. We study the tightness of our results, showing that minor generalizations of the considered reachability properties yield undecidability of verification, both for CDPs and the corresponding Petri Net model.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140990213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the Feasibility of Forgetting in Data Streams 论数据流中遗忘的可行性

Proceedings of the ACM on Management of Data

Pub Date : 2024-05-10 DOI: 10.1145/3651603

A. Pavan, Sourav Chakraborty, N. V. Vinodchandran, Kuldeep S. Meel

In today's digital age, it is becoming increasingly prevalent to retain digital footprints in the cloud indefinitely. Nonetheless, there is a valid argument that entities should have the authority to decide whether their personal data remains within a specific database or is expunged. Indeed, nations across the globe are increasingly enacting legislation to uphold the "Right To Be Forgotten" for individuals. Investigating computational challenges, including the formalization and implementation of this notion, is crucial due to its relevance in the domains of data privacy and management. This work introduces a new streaming model: the 'Right to be Forgotten Data Streaming Model' (RFDS model). The main feature of this model is that any element in the stream has the right to have its history removed from the stream. Formally, the input is a stream of updates of the form (a, Δ) where Δ ∈ {+, ⊥} and a is an element from a universe U. When the update Δ=+ occurs, the frequency of a, denoted as f a , is incremented to f a +1. When the update Δ=⊥, occurs, f a is set to 0. This feature, which represents the forget request, distinguishes the present model from existing data streaming models. This work systematically investigates computational challenges that arise while incorporating the notion of the right to be forgotten. Our initial considerations reveal that even estimating F 1 (sum of the frequencies of elements) of the stream is a non-trivial problem in this model. Based on the initial investigations, we focus on a modified model which we call α-RFDS where we limit the number of forget operations to be at most α fraction. In this modified model, we focus on estimating F 0 (number of distinct elements) and F 1 . We present algorithms and establish almost-matching lower bounds on the space complexity for these computational tasks.

在当今的数字时代，在云中无限期保留数字足迹的做法越来越普遍。然而，有一种合理的观点认为，实体应有权决定其个人数据是保留在特定数据库中还是被删除。事实上，全球越来越多的国家正在立法维护个人的 "被遗忘权"。研究计算方面的挑战，包括这一概念的形式化和实现，对数据隐私和管理领域至关重要。这项工作引入了一种新的流模型："被遗忘权数据流模型"（RFDS 模型）。该模型的主要特点是，数据流中的任何元素都有权将其历史记录从数据流中删除。形式上，输入是一个形式为 (a, Δ) 的更新流，其中 Δ∈ {+，⊥}，a 是一个宇宙 U 中的一个元素。当更新 Δ=+ 发生时，a 的频率（表示为 f a）会增加到 f a+1。当更新 Δ=⊥, 发生时，f a 被设为 0。这个代表遗忘请求的特征使本模型有别于现有的数据流模型。这项工作系统地研究了在纳入被遗忘权概念时出现的计算挑战。我们的初步研究表明，在该模型中，即使是估算数据流的 F 1（元素频率之和）也是一个非同小可的问题。在初步研究的基础上，我们重点研究了一个改进的模型，我们称之为 α-RFDS，在这个模型中，我们将遗忘操作的次数限制为最多α 次。在这个改进模型中，我们重点估算 F 0（不同元素的数量）和 F 1。我们提出了算法，并为这些计算任务的空间复杂度建立了几乎匹配的下限。

{"title":"On the Feasibility of Forgetting in Data Streams","authors":"A. Pavan, Sourav Chakraborty, N. V. Vinodchandran, Kuldeep S. Meel","doi":"10.1145/3651603","DOIUrl":"https://doi.org/10.1145/3651603","url":null,"abstract":"In today's digital age, it is becoming increasingly prevalent to retain digital footprints in the cloud indefinitely. Nonetheless, there is a valid argument that entities should have the authority to decide whether their personal data remains within a specific database or is expunged. Indeed, nations across the globe are increasingly enacting legislation to uphold the \"Right To Be Forgotten\" for individuals. Investigating computational challenges, including the formalization and implementation of this notion, is crucial due to its relevance in the domains of data privacy and management.\u0000 \u0000 This work introduces a new streaming model: the 'Right to be Forgotten Data Streaming Model' (RFDS model). The main feature of this model is that any element in the stream has the right to have its history removed from the stream. Formally, the input is a stream of updates of the form (a, Δ) where Δ ∈ {+, ⊥} and a is an element from a universe U. When the update Δ=+ occurs, the frequency of a, denoted as f\u0000 a\u0000 , is incremented to f\u0000 a\u0000 +1. When the update Δ=⊥, occurs, f\u0000 a\u0000 is set to 0. This feature, which represents the forget request, distinguishes the present model from existing data streaming models.\u0000 \u0000 \u0000 This work systematically investigates computational challenges that arise while incorporating the notion of the right to be forgotten. Our initial considerations reveal that even estimating F\u0000 1\u0000 (sum of the frequencies of elements) of the stream is a non-trivial problem in this model. Based on the initial investigations, we focus on a modified model which we call α-RFDS where we limit the number of forget operations to be at most α fraction. In this modified model, we focus on estimating F\u0000 0\u0000 (number of distinct elements) and F\u0000 1\u0000 . We present algorithms and establish almost-matching lower bounds on the space complexity for these computational tasks.\u0000","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 98","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140991585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SH2O: Efficient Data Access for Work-Sharing Databases SH2O:工作共享数据库的高效数据访问

Proceedings of the ACM on Management of Data

Pub Date : 2023-11-13 DOI: 10.1145/3617340

Panagiotis Sioulas, Ioannis Mytilinis, Anastasia Ailamaki

Interactive applications require processing tens to hundreds of concurrent analytical queries within tight time constraints. In such setups, where high concurrency causes contention, work-sharing databases are critical for improving scalability and for bounding the increase in response time. However, as such databases share data access using full scans and expensive shared filters, they suffer from a data-access bottleneck that jeopardizes interactivity. We present SH2O: a novel data-access operator that addresses the data-access bottleneck of work-sharing databases. SH2O is based on the idea that an access pattern based on judiciously selected multidimensional ranges can replace a set of shared filters. To exploit the idea in an efficient and scalable manner, SH2O uses a three-tier approach: i) it uses spatial indices to efficiently access the ranges without overfetching, ii) it uses an optimizer to choose which filters to replace such that it maximizes cost-benefit for index accesses, and iii) it exploits partitioning schemes and independently accesses each data partition to reduce the number of filters in the access pattern. Furthermore, we propose a tuning strategy that chooses a partitioning and indexing scheme that minimizes SH2O's cost for a target workload. Our evaluation shows a speedup of 1.8-22.2 for batches of hundreds of data-access-bound queries.

交互式应用程序需要在紧迫的时间限制内处理数十到数百个并发分析查询。在这样的设置中，高并发性会导致争用，工作共享数据库对于提高可伸缩性和限制响应时间的增加至关重要。但是，由于这些数据库使用完整扫描和昂贵的共享过滤器共享数据访问，因此它们会遇到数据访问瓶颈，从而危及交互性。我们提出了SH2O:一个新的数据访问运算符，解决了工作共享数据库的数据访问瓶颈。SH2O基于这样一种思想:基于明智选择的多维范围的访问模式可以替换一组共享筛选器。为了以高效和可扩展的方式利用这个想法，SH2O使用了三层方法:i)它使用空间索引来有效地访问范围，而不会过度抓取;ii)它使用优化器来选择要替换哪些过滤器，从而使索引访问的成本效益最大化;iii)它利用分区方案并独立访问每个数据分区，以减少访问模式中的过滤器数量。此外，我们提出了一种调优策略，该策略选择一个分区和索引方案，使目标工作负载的SH2O成本最小化。我们的评估显示，对于数百个数据访问绑定查询的批次，速度提高了1.8-22.2。

{"title":"SH2O: Efficient Data Access for Work-Sharing Databases","authors":"Panagiotis Sioulas, Ioannis Mytilinis, Anastasia Ailamaki","doi":"10.1145/3617340","DOIUrl":"https://doi.org/10.1145/3617340","url":null,"abstract":"Interactive applications require processing tens to hundreds of concurrent analytical queries within tight time constraints. In such setups, where high concurrency causes contention, work-sharing databases are critical for improving scalability and for bounding the increase in response time. However, as such databases share data access using full scans and expensive shared filters, they suffer from a data-access bottleneck that jeopardizes interactivity. We present SH2O: a novel data-access operator that addresses the data-access bottleneck of work-sharing databases. SH2O is based on the idea that an access pattern based on judiciously selected multidimensional ranges can replace a set of shared filters. To exploit the idea in an efficient and scalable manner, SH2O uses a three-tier approach: i) it uses spatial indices to efficiently access the ranges without overfetching, ii) it uses an optimizer to choose which filters to replace such that it maximizes cost-benefit for index accesses, and iii) it exploits partitioning schemes and independently accesses each data partition to reduce the number of filters in the access pattern. Furthermore, we propose a tuning strategy that chooses a partitioning and indexing scheme that minimizes SH2O's cost for a target workload. Our evaluation shows a speedup of 1.8-22.2 for batches of hundreds of data-access-bound queries.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"34 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136282515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Core Maintenance in Large Bipartite Graphs 大型二部图的高效核心维护

Proceedings of the ACM on Management of Data

Pub Date : 2023-11-13 DOI: 10.1145/3617329

Wensheng Luo, Qiaoyuan Yang, Yixiang Fang, Xu Zhou

As an important cohesive subgraph model in bipartite graphs, the (α, β)-core (a.k.a. bi-core) has found a wide spectrum of real-world applications, such as product recommendation, fraudster detection, and community search. In these applications, the bipartite graphs are often large and dynamic, where vertices and edges are inserted and deleted frequently, so it is costly to recompute (α, β)-cores from scratch when the graph has changed. Recently, a few works have attempted to study how to maintain (α, β)-cores in the dynamic bipartite graph, but their performance is still far from perfect, due to the huge size of graphs and their frequent changes. To alleviate this issue, in this paper we present efficient (α, β)-core maintenance algorithms over bipartite graphs. We first introduce a novel concept, called bi-core numbers, for the vertices of bipartite graphs. Based on this concept, we theoretically analyze the effect of inserting and deleting edges on the changes of vertices' bi-core numbers, which can be further used to narrow down the scope of the updates, thereby reducing the computational redundancy. We then propose efficient (α, β)-core maintenance algorithms for handling the edge insertion and edge deletion respectively, by exploiting the above theoretical analysis results. Finally, extensive experimental evaluations are performed on both real and synthetic datasets, and the results show that our proposed algorithms are up to two orders of magnitude faster than the state-of-the-art approaches.

作为二部图中重要的内聚子图模型，(α， β)-核(又称双核)在产品推荐、欺诈者检测和社区搜索等方面有着广泛的应用。在这些应用中，二部图通常是大而动态的，其中顶点和边被频繁地插入和删除，因此当图发生变化时，从头开始重新计算(α， β)核是昂贵的。近年来，已有一些研究试图研究如何在动态二部图中维持(α， β)-核，但由于图的规模庞大且变化频繁，其性能还远远不够完善。为了解决这个问题，本文提出了高效的二部图(α， β)核维护算法。我们首先为二部图的顶点引入了一个新的概念，称为双核数。基于这一概念，我们从理论上分析了插入和删除边对顶点双核数变化的影响，可以进一步缩小更新的范围，从而减少计算冗余。然后，利用上述理论分析结果，我们分别提出了有效的(α， β)核维护算法来处理边缘插入和边缘删除。最后，在真实和合成数据集上进行了广泛的实验评估，结果表明，我们提出的算法比最先进的方法快两个数量级。

{"title":"Efficient Core Maintenance in Large Bipartite Graphs","authors":"Wensheng Luo, Qiaoyuan Yang, Yixiang Fang, Xu Zhou","doi":"10.1145/3617329","DOIUrl":"https://doi.org/10.1145/3617329","url":null,"abstract":"As an important cohesive subgraph model in bipartite graphs, the (α, β)-core (a.k.a. bi-core) has found a wide spectrum of real-world applications, such as product recommendation, fraudster detection, and community search. In these applications, the bipartite graphs are often large and dynamic, where vertices and edges are inserted and deleted frequently, so it is costly to recompute (α, β)-cores from scratch when the graph has changed. Recently, a few works have attempted to study how to maintain (α, β)-cores in the dynamic bipartite graph, but their performance is still far from perfect, due to the huge size of graphs and their frequent changes. To alleviate this issue, in this paper we present efficient (α, β)-core maintenance algorithms over bipartite graphs. We first introduce a novel concept, called bi-core numbers, for the vertices of bipartite graphs. Based on this concept, we theoretically analyze the effect of inserting and deleting edges on the changes of vertices' bi-core numbers, which can be further used to narrow down the scope of the updates, thereby reducing the computational redundancy. We then propose efficient (α, β)-core maintenance algorithms for handling the edge insertion and edge deletion respectively, by exploiting the above theoretical analysis results. Finally, extensive experimental evaluations are performed on both real and synthetic datasets, and the results show that our proposed algorithms are up to two orders of magnitude faster than the state-of-the-art approaches.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"35 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136281449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0