Peter Lindner, Sachin Basil John, Christoph Koch, D. Suciu
We investigate an approximation algorithm for various aggregate queries on partially materialized data cubes. Data cubes are interpreted as probability distributions, and cuboids from a partial materialization populate the terms of a series expansion of the target query distribution. Unknown terms in the expansion are just assumed to be 0 in order to recover an approximate query result. We identify this method as a variant of related approaches from other fields of science, that is, the Bahadur representation and, more generally, (biased) Fourier expansions of Boolean functions. Existing literature indicates a rich but intricate theoretical landscape. Focusing on the data cube application, we start by investigating worst-case error bounds. We build upon prior work to obtain provably optimal materialization strategies with respect to query workloads. In addition, we propose a new heuristic method governing materialization decisions. Finally, we show that well-approximated queries are guaranteed to have well-approximated roll-ups.
{"title":"The Moments Method for Approximate Data Cube Queries","authors":"Peter Lindner, Sachin Basil John, Christoph Koch, D. Suciu","doi":"10.1145/3651147","DOIUrl":"https://doi.org/10.1145/3651147","url":null,"abstract":"We investigate an approximation algorithm for various aggregate queries on partially materialized data cubes. Data cubes are interpreted as probability distributions, and cuboids from a partial materialization populate the terms of a series expansion of the target query distribution. Unknown terms in the expansion are just assumed to be 0 in order to recover an approximate query result. We identify this method as a variant of related approaches from other fields of science, that is, the Bahadur representation and, more generally, (biased) Fourier expansions of Boolean functions. Existing literature indicates a rich but intricate theoretical landscape. Focusing on the data cube application, we start by investigating worst-case error bounds. We build upon prior work to obtain provably optimal materialization strategies with respect to query workloads. In addition, we propose a new heuristic method governing materialization decisions. Finally, we show that well-approximated queries are guaranteed to have well-approximated roll-ups.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 23","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140993290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael A. Bender, Martín Farach-Colton, Michael T. Goodrich, Hanna Komlós
A data structure is history independent if its internal representation reveals nothing about the history of operations beyond what can be determined from the current contents of the data structure. History independence is typically viewed as a security or privacy guarantee, with the intent being to minimize risks incurred by a security breach or audit. Despite widespread advances in history independence, there is an important data-structural primitive that previous work has been unable to replace with an equivalent history-independent alternative---dynamic partitioning. In dynamic partitioning, we are given a dynamic set S of ordered elements and a size-parameter B, and the objective is to maintain a partition of S into ordered groups, each of size Θ(B). Dynamic partitioning is important throughout computer science, with applications to B-tree rebalancing, write-optimized dictionaries, log-structured merge trees, other external-memory indexes, geometric and spatial data structures, cache-oblivious data structures, and order-maintenance data structures. The lack of a history-independent dynamic-partitioning primitive has meant that designers of history-independent data structures have had to resort to complex alternatives. In this paper, we achieve history-independent dynamic partitioning. Our algorithm runs asymptotically optimally against an oblivious adversary, processing each insert/delete with O(1) operations in expectation and O(B log N/loglog N) with high probability in set size N.
如果一个数据结构的内部表示没有揭示任何操作的历史,而只能根据数据结构的当前内容来确定,那么这个数据结构就是独立于历史的。历史独立性通常被视为一种安全或隐私保证,目的是最大限度地降低安全漏洞或审计带来的风险。尽管在历史独立性方面取得了广泛的进展,但有一种重要的数据结构基本原理,以往的工作却无法用与历史无关的等效替代方法来替代--动态分区。在动态分区中,我们给定一个由有序元素组成的动态集合 S 和一个大小参数 B,目标是将 S 划分为有序分组,每个分组的大小为 Θ(B)。动态分区在整个计算机科学中都很重要,在 B 树再平衡、写优化字典、日志结构合并树、其他外部内存索引、几何和空间数据结构、无缓存数据结构和有序维护数据结构中都有应用。由于缺乏独立于历史的动态分区原型,独立于历史的数据结构设计者不得不采用复杂的替代方案。在本文中,我们实现了与历史无关的动态分区。我们的算法在面对遗忘对手时运行渐近最优,每次插入/删除的期望运算量为 O(1),在集合大小为 N 时的高概率运算量为 O(B log N/log N)。
{"title":"History-Independent Dynamic Partitioning: Operation-Order Privacy in Ordered Data Structures","authors":"Michael A. Bender, Martín Farach-Colton, Michael T. Goodrich, Hanna Komlós","doi":"10.1145/3651609","DOIUrl":"https://doi.org/10.1145/3651609","url":null,"abstract":"A data structure is history independent if its internal representation reveals nothing about the history of operations beyond what can be determined from the current contents of the data structure. History independence is typically viewed as a security or privacy guarantee, with the intent being to minimize risks incurred by a security breach or audit. Despite widespread advances in history independence, there is an important data-structural primitive that previous work has been unable to replace with an equivalent history-independent alternative---dynamic partitioning. In dynamic partitioning, we are given a dynamic set S of ordered elements and a size-parameter B, and the objective is to maintain a partition of S into ordered groups, each of size Θ(B). Dynamic partitioning is important throughout computer science, with applications to B-tree rebalancing, write-optimized dictionaries, log-structured merge trees, other external-memory indexes, geometric and spatial data structures, cache-oblivious data structures, and order-maintenance data structures. The lack of a history-independent dynamic-partitioning primitive has meant that designers of history-independent data structures have had to resort to complex alternatives. In this paper, we achieve history-independent dynamic partitioning. Our algorithm runs asymptotically optimally against an oblivious adversary, processing each insert/delete with O(1) operations in expectation and O(B log N/loglog N) with high probability in set size N.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 10","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140991400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Relational data modeling can often be restrictive as it provides no direct facility for modeling polymorphic types, reified relations, multi-valued attributes, and other common high-level structures in data. This creates many challenges in data modeling and engineering tasks, and has led to the rise of more flexible NoSQL databases, such as graph and document databases. In the absence of structured schemas, however, we can neither express nor validate the intention of data models, making long-term maintenance of databases substantially more difficult. To resolve this dilemma, we argue that, parallel to the role of classical predicate logic for relational algebra, contemporary foundations of mathematics rooted in type theory can guide us in the development of powerful new high-level data models and query languages. To this end, we introduce a new polymorphic entity-relation-attribute (PERA) data model, grounded in type-theoretic principles and accessible through classical conceptual modeling, with a near-natural query language: TypeQL. We illustrate the syntax of TypeQL as well as its denotation in the PERA model, formalize our model as an algebraic theory with dependent types, and describe its stratified semantics.
{"title":"TypeQL: A Type-Theoretic & Polymorphic Query Language","authors":"Christoph Dorn, Haikal Pribadi","doi":"10.1145/3651611","DOIUrl":"https://doi.org/10.1145/3651611","url":null,"abstract":"Relational data modeling can often be restrictive as it provides no direct facility for modeling polymorphic types, reified relations, multi-valued attributes, and other common high-level structures in data. This creates many challenges in data modeling and engineering tasks, and has led to the rise of more flexible NoSQL databases, such as graph and document databases. In the absence of structured schemas, however, we can neither express nor validate the intention of data models, making long-term maintenance of databases substantially more difficult. To resolve this dilemma, we argue that, parallel to the role of classical predicate logic for relational algebra, contemporary foundations of mathematics rooted in type theory can guide us in the development of powerful new high-level data models and query languages. To this end, we introduce a new polymorphic entity-relation-attribute (PERA) data model, grounded in type-theoretic principles and accessible through classical conceptual modeling, with a near-natural query language: TypeQL. We illustrate the syntax of TypeQL as well as its denotation in the PERA model, formalize our model as an algebraic theory with dependent types, and describe its stratified semantics.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 24","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140992582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Elena Gribelyuk, Pachara Sawettamalya, Hongxun Wu, Huacheng Yu
Estimating the ε-approximate quantiles or ranks of a stream is a fundamental task in data monitoring. Given a stream x_1,..., x_n from a universe mathcalU with total order, an additive-error quantile sketch mathcalM allows us to approximate the rank of any query yin mathcalU up to additive ε n error. In 2001, Greenwald and Khanna gave a deterministic algorithm (GK sketch) that solves the ε-approximate quantiles estimation problem using O(ε^-1 łog(ε n)) space citegreenwald2001space ; recently, this algorithm was shown to be optimal by Cormode and Vesleý in 2020 citecormode2020tight. However, due to the intricacy of the GK sketch and its analysis, over-simplified versions of the algorithm are implemented in practical applications, often without any known theoretical guarantees. In fact, it has remained an open question whether the GK sketch can be simplified while maintaining the optimal space bound. In this paper, we resolve this open question by giving a simplified deterministic algorithm that stores at most (2 + o(1))ε^-1 łog (ε n) elements and solves the additive-error quantile estimation problem; as a side benefit, our algorithm achieves a smaller constant factor than the frac11 2 ε^-1 łog(ε n) space bound in the original GK sketch~citegreenwald2001space. Our algorithm features an easier analysis and still achieves the same optimal asymptotic space complexity as the original GK sketch. Lastly, our simplification enables an efficient data structure implementation, with a worst-case runtime of O(łog(1/ε) + łog łog (ε n)) per-element for the ordinary ε-approximate quantile estimation problem. Also, for the related "weighted'' quantile estimation problem, we give efficient data structures for our simplified algorithm which guarantee a worst-case per-element runtime of O(łog(1/ε) + łog łog (ε W_n/w_textrmmin )), achieving an improvement over the previous upper bound of citeassadi2023generalizing.
{"title":"Simple & Optimal Quantile Sketch: Combining Greenwald-Khanna with Khanna-Greenwald","authors":"Elena Gribelyuk, Pachara Sawettamalya, Hongxun Wu, Huacheng Yu","doi":"10.1145/3651610","DOIUrl":"https://doi.org/10.1145/3651610","url":null,"abstract":"Estimating the ε-approximate quantiles or ranks of a stream is a fundamental task in data monitoring. Given a stream x_1,..., x_n from a universe mathcalU with total order, an additive-error quantile sketch mathcalM allows us to approximate the rank of any query yin mathcalU up to additive ε n error. In 2001, Greenwald and Khanna gave a deterministic algorithm (GK sketch) that solves the ε-approximate quantiles estimation problem using O(ε^-1 łog(ε n)) space citegreenwald2001space ; recently, this algorithm was shown to be optimal by Cormode and Vesleý in 2020 citecormode2020tight. However, due to the intricacy of the GK sketch and its analysis, over-simplified versions of the algorithm are implemented in practical applications, often without any known theoretical guarantees. In fact, it has remained an open question whether the GK sketch can be simplified while maintaining the optimal space bound. In this paper, we resolve this open question by giving a simplified deterministic algorithm that stores at most (2 + o(1))ε^-1 łog (ε n) elements and solves the additive-error quantile estimation problem; as a side benefit, our algorithm achieves a smaller constant factor than the frac11 2 ε^-1 łog(ε n) space bound in the original GK sketch~citegreenwald2001space. Our algorithm features an easier analysis and still achieves the same optimal asymptotic space complexity as the original GK sketch. Lastly, our simplification enables an efficient data structure implementation, with a worst-case runtime of O(łog(1/ε) + łog łog (ε n)) per-element for the ordinary ε-approximate quantile estimation problem. Also, for the related \"weighted'' quantile estimation problem, we give efficient data structures for our simplified algorithm which guarantee a worst-case per-element runtime of O(łog(1/ε) + łog łog (ε W_n/w_textrmmin )), achieving an improvement over the previous upper bound of citeassadi2023generalizing.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 11","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140991975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bart Bogaerts, Maxime Jakubowski, Jan Van den Bussche
Instance-based provenance is an explanation for a query result in the form of a subinstance of the database. We investigate different desiderata one may want to impose on these subinstances. Concretely we consider seven basic postulates for provenance. Six of them relate subinstances to provenance polynomials, three-valued semantics, and Halpern-Pearl causality. Determinism of the provenance mechanism is the seventh basic postulate. Moreover, we consider the postulate of minimality, which can be imposed with respect to any set of basic postulates. Our main technical contribution is an analysis and characterisation of which combinations of postulates are jointly satisfiable. Our main conceptual contribution is an approach to instance-based provenance through three-valued instances, which makes it applicable to first-order logic queries involving negation.
{"title":"Postulates for Provenance: Instance-based provenance for first-order logic","authors":"Bart Bogaerts, Maxime Jakubowski, Jan Van den Bussche","doi":"10.1145/3651596","DOIUrl":"https://doi.org/10.1145/3651596","url":null,"abstract":"Instance-based provenance is an explanation for a query result in the form of a subinstance of the database. We investigate different desiderata one may want to impose on these subinstances. Concretely we consider seven basic postulates for provenance. Six of them relate subinstances to provenance polynomials, three-valued semantics, and Halpern-Pearl causality. Determinism of the provenance mechanism is the seventh basic postulate. Moreover, we consider the postulate of minimality, which can be imposed with respect to any set of basic postulates. Our main technical contribution is an analysis and characterisation of which combinations of postulates are jointly satisfiable. Our main conceptual contribution is an approach to instance-based provenance through three-valued instances, which makes it applicable to first-order logic queries involving negation.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 8","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140990512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we ask the following question: given a Boolean Conjunctive Query (CQ), what is the smallest circuit that computes the provenance polynomial of the query over a given semiring? We answer this question by giving upper and lower bounds. Notably, it is shown that any circuit F that computes a CQ over the tropical semiring must have size log |F| ≥ (1-ε) · da-entw for any ε >0, where da-entw is the degree-aware entropic width of the query. We show a circuit construction that matches this bound when the semiring is idempotent. The techniques we use combine several central notions in database theory: provenance polynomials, tree decompositions, and disjunctive Datalog programs. We extend our results to lower and upper bounds for formulas (i.e., circuits where each gate has outdegree one), and to bounds for non-Boolean CQs.
{"title":"Tight Bounds of Circuits for Sum-Product Queries","authors":"Austen Z. Fan, Paraschos Koutris, Hangdong Zhao","doi":"10.1145/3651588","DOIUrl":"https://doi.org/10.1145/3651588","url":null,"abstract":"In this paper, we ask the following question: given a Boolean Conjunctive Query (CQ), what is the smallest circuit that computes the provenance polynomial of the query over a given semiring? We answer this question by giving upper and lower bounds. Notably, it is shown that any circuit F that computes a CQ over the tropical semiring must have size log |F| ≥ (1-ε) · da-entw for any ε >0, where da-entw is the degree-aware entropic width of the query. We show a circuit construction that matches this bound when the semiring is idempotent. The techniques we use combine several central notions in database theory: provenance polynomials, tree decompositions, and disjunctive Datalog programs. We extend our results to lower and upper bounds for formulas (i.e., circuits where each gate has outdegree one), and to bounds for non-Boolean CQs.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 10","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140993058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Aiswarya, D. Calvanese, Francesco Di Cosmo, M. Montali
We study verification of reachability properties over Communicating Datalog Programs (CDPs), which are networks of relational nodes connected through unordered channels and running Datalog-like computations. Each node manipulates a local state database (DB), depending on incoming messages and additional input DBs from external services. Decidability of verification for CDPs has so far been established only under boundedness assumptions on the state and channel sizes, showing at the same time undecidability of reachability for unbounded states with only two unary relations or unbounded channels with a single binary relation. The goal of this paper is to study the open case of CDPs with bounded states and unbounded channels, under the assumption that channels carry unary relations only. We discuss the significance of the resulting model and prove the decidability of verification of variants of reachability, captured in fragments of first-order CTL. We do so through a novel reduction to coverability problems in a class of high-level Petri Nets that manipulate unordered data identifiers. We study the tightness of our results, showing that minor generalizations of the considered reachability properties yield undecidability of verification, both for CDPs and the corresponding Petri Net model.
通信 Datalog 程序(CDP)是通过无序通道连接的关系节点网络,运行类似 Datalog 的计算。每个节点操纵一个本地状态数据库(DB),并依赖于传入的消息和来自外部服务的附加输入数据库。迄今为止,CDP 验证的可判定性仅在状态和通道大小的有界性假设下建立,同时显示了仅有两个一元关系的无界状态或仅有一个二元关系的无界通道的不可判定性。本文的目标是在通道只携带一元关系的假设下,研究具有有界状态和无界通道的 CDP 的开放情况。我们讨论了由此产生的模型的意义,并证明了用一阶 CTL 片段捕捉的可达性变体验证的可解性。我们通过一种新颖的还原方法,证明了一类处理无序数据标识符的高级 Petri 网中的可覆盖性问题。我们对结果的严密性进行了研究,结果表明,对所考虑的可达性属性进行微小的概括,就会产生对 CDP 和相应 Petri 网模型的不可判定性。
{"title":"Verification of Unary Communicating Datalog Programs","authors":"C. Aiswarya, D. Calvanese, Francesco Di Cosmo, M. Montali","doi":"10.1145/3651590","DOIUrl":"https://doi.org/10.1145/3651590","url":null,"abstract":"We study verification of reachability properties over Communicating Datalog Programs (CDPs), which are networks of relational nodes connected through unordered channels and running Datalog-like computations. Each node manipulates a local state database (DB), depending on incoming messages and additional input DBs from external services. Decidability of verification for CDPs has so far been established only under boundedness assumptions on the state and channel sizes, showing at the same time undecidability of reachability for unbounded states with only two unary relations or unbounded channels with a single binary relation. The goal of this paper is to study the open case of CDPs with bounded states and unbounded channels, under the assumption that channels carry unary relations only. We discuss the significance of the resulting model and prove the decidability of verification of variants of reachability, captured in fragments of first-order CTL. We do so through a novel reduction to coverability problems in a class of high-level Petri Nets that manipulate unordered data identifiers. We study the tightness of our results, showing that minor generalizations of the considered reachability properties yield undecidability of verification, both for CDPs and the corresponding Petri Net model.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140990213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Pavan, Sourav Chakraborty, N. V. Vinodchandran, Kuldeep S. Meel
In today's digital age, it is becoming increasingly prevalent to retain digital footprints in the cloud indefinitely. Nonetheless, there is a valid argument that entities should have the authority to decide whether their personal data remains within a specific database or is expunged. Indeed, nations across the globe are increasingly enacting legislation to uphold the "Right To Be Forgotten" for individuals. Investigating computational challenges, including the formalization and implementation of this notion, is crucial due to its relevance in the domains of data privacy and management. This work introduces a new streaming model: the 'Right to be Forgotten Data Streaming Model' (RFDS model). The main feature of this model is that any element in the stream has the right to have its history removed from the stream. Formally, the input is a stream of updates of the form (a, Δ) where Δ ∈ {+, ⊥} and a is an element from a universe U. When the update Δ=+ occurs, the frequency of a, denoted as f a , is incremented to f a +1. When the update Δ=⊥, occurs, f a is set to 0. This feature, which represents the forget request, distinguishes the present model from existing data streaming models. This work systematically investigates computational challenges that arise while incorporating the notion of the right to be forgotten. Our initial considerations reveal that even estimating F 1 (sum of the frequencies of elements) of the stream is a non-trivial problem in this model. Based on the initial investigations, we focus on a modified model which we call α-RFDS where we limit the number of forget operations to be at most α fraction. In this modified model, we focus on estimating F 0 (number of distinct elements) and F 1 . We present algorithms and establish almost-matching lower bounds on the space complexity for these computational tasks.
在当今的数字时代,在云中无限期保留数字足迹的做法越来越普遍。然而,有一种合理的观点认为,实体应有权决定其个人数据是保留在特定数据库中还是被删除。事实上,全球越来越多的国家正在立法维护个人的 "被遗忘权"。研究计算方面的挑战,包括这一概念的形式化和实现,对数据隐私和管理领域至关重要。 这项工作引入了一种新的流模型:"被遗忘权数据流模型"(RFDS 模型)。该模型的主要特点是,数据流中的任何元素都有权将其历史记录从数据流中删除。形式上,输入是一个形式为 (a, Δ) 的更新流,其中 Δ∈ {+,⊥},a 是一个宇宙 U 中的一个元素。当更新 Δ=+ 发生时,a 的频率(表示为 f a)会增加到 f a+1。当更新 Δ=⊥, 发生时,f a 被设为 0。这个代表遗忘请求的特征使本模型有别于现有的数据流模型。 这项工作系统地研究了在纳入被遗忘权概念时出现的计算挑战。我们的初步研究表明,在该模型中,即使是估算数据流的 F 1(元素频率之和)也是一个非同小可的问题。在初步研究的基础上,我们重点研究了一个改进的模型,我们称之为 α-RFDS,在这个模型中,我们将遗忘操作的次数限制为最多α 次。在这个改进模型中,我们重点估算 F 0(不同元素的数量)和 F 1。我们提出了算法,并为这些计算任务的空间复杂度建立了几乎匹配的下限。
{"title":"On the Feasibility of Forgetting in Data Streams","authors":"A. Pavan, Sourav Chakraborty, N. V. Vinodchandran, Kuldeep S. Meel","doi":"10.1145/3651603","DOIUrl":"https://doi.org/10.1145/3651603","url":null,"abstract":"In today's digital age, it is becoming increasingly prevalent to retain digital footprints in the cloud indefinitely. Nonetheless, there is a valid argument that entities should have the authority to decide whether their personal data remains within a specific database or is expunged. Indeed, nations across the globe are increasingly enacting legislation to uphold the \"Right To Be Forgotten\" for individuals. Investigating computational challenges, including the formalization and implementation of this notion, is crucial due to its relevance in the domains of data privacy and management.\u0000 \u0000 This work introduces a new streaming model: the 'Right to be Forgotten Data Streaming Model' (RFDS model). The main feature of this model is that any element in the stream has the right to have its history removed from the stream. Formally, the input is a stream of updates of the form (a, Δ) where Δ ∈ {+, ⊥} and a is an element from a universe U. When the update Δ=+ occurs, the frequency of a, denoted as f\u0000 a\u0000 , is incremented to f\u0000 a\u0000 +1. When the update Δ=⊥, occurs, f\u0000 a\u0000 is set to 0. This feature, which represents the forget request, distinguishes the present model from existing data streaming models.\u0000 \u0000 \u0000 This work systematically investigates computational challenges that arise while incorporating the notion of the right to be forgotten. Our initial considerations reveal that even estimating F\u0000 1\u0000 (sum of the frequencies of elements) of the stream is a non-trivial problem in this model. Based on the initial investigations, we focus on a modified model which we call α-RFDS where we limit the number of forget operations to be at most α fraction. In this modified model, we focus on estimating F\u0000 0\u0000 (number of distinct elements) and F\u0000 1\u0000 . We present algorithms and establish almost-matching lower bounds on the space complexity for these computational tasks.\u0000","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":" 98","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140991585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Panagiotis Sioulas, Ioannis Mytilinis, Anastasia Ailamaki
Interactive applications require processing tens to hundreds of concurrent analytical queries within tight time constraints. In such setups, where high concurrency causes contention, work-sharing databases are critical for improving scalability and for bounding the increase in response time. However, as such databases share data access using full scans and expensive shared filters, they suffer from a data-access bottleneck that jeopardizes interactivity. We present SH2O: a novel data-access operator that addresses the data-access bottleneck of work-sharing databases. SH2O is based on the idea that an access pattern based on judiciously selected multidimensional ranges can replace a set of shared filters. To exploit the idea in an efficient and scalable manner, SH2O uses a three-tier approach: i) it uses spatial indices to efficiently access the ranges without overfetching, ii) it uses an optimizer to choose which filters to replace such that it maximizes cost-benefit for index accesses, and iii) it exploits partitioning schemes and independently accesses each data partition to reduce the number of filters in the access pattern. Furthermore, we propose a tuning strategy that chooses a partitioning and indexing scheme that minimizes SH2O's cost for a target workload. Our evaluation shows a speedup of 1.8-22.2 for batches of hundreds of data-access-bound queries.
{"title":"SH2O: Efficient Data Access for Work-Sharing Databases","authors":"Panagiotis Sioulas, Ioannis Mytilinis, Anastasia Ailamaki","doi":"10.1145/3617340","DOIUrl":"https://doi.org/10.1145/3617340","url":null,"abstract":"Interactive applications require processing tens to hundreds of concurrent analytical queries within tight time constraints. In such setups, where high concurrency causes contention, work-sharing databases are critical for improving scalability and for bounding the increase in response time. However, as such databases share data access using full scans and expensive shared filters, they suffer from a data-access bottleneck that jeopardizes interactivity. We present SH2O: a novel data-access operator that addresses the data-access bottleneck of work-sharing databases. SH2O is based on the idea that an access pattern based on judiciously selected multidimensional ranges can replace a set of shared filters. To exploit the idea in an efficient and scalable manner, SH2O uses a three-tier approach: i) it uses spatial indices to efficiently access the ranges without overfetching, ii) it uses an optimizer to choose which filters to replace such that it maximizes cost-benefit for index accesses, and iii) it exploits partitioning schemes and independently accesses each data partition to reduce the number of filters in the access pattern. Furthermore, we propose a tuning strategy that chooses a partitioning and indexing scheme that minimizes SH2O's cost for a target workload. Our evaluation shows a speedup of 1.8-22.2 for batches of hundreds of data-access-bound queries.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"34 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136282515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As an important cohesive subgraph model in bipartite graphs, the (α, β)-core (a.k.a. bi-core) has found a wide spectrum of real-world applications, such as product recommendation, fraudster detection, and community search. In these applications, the bipartite graphs are often large and dynamic, where vertices and edges are inserted and deleted frequently, so it is costly to recompute (α, β)-cores from scratch when the graph has changed. Recently, a few works have attempted to study how to maintain (α, β)-cores in the dynamic bipartite graph, but their performance is still far from perfect, due to the huge size of graphs and their frequent changes. To alleviate this issue, in this paper we present efficient (α, β)-core maintenance algorithms over bipartite graphs. We first introduce a novel concept, called bi-core numbers, for the vertices of bipartite graphs. Based on this concept, we theoretically analyze the effect of inserting and deleting edges on the changes of vertices' bi-core numbers, which can be further used to narrow down the scope of the updates, thereby reducing the computational redundancy. We then propose efficient (α, β)-core maintenance algorithms for handling the edge insertion and edge deletion respectively, by exploiting the above theoretical analysis results. Finally, extensive experimental evaluations are performed on both real and synthetic datasets, and the results show that our proposed algorithms are up to two orders of magnitude faster than the state-of-the-art approaches.
{"title":"Efficient Core Maintenance in Large Bipartite Graphs","authors":"Wensheng Luo, Qiaoyuan Yang, Yixiang Fang, Xu Zhou","doi":"10.1145/3617329","DOIUrl":"https://doi.org/10.1145/3617329","url":null,"abstract":"As an important cohesive subgraph model in bipartite graphs, the (α, β)-core (a.k.a. bi-core) has found a wide spectrum of real-world applications, such as product recommendation, fraudster detection, and community search. In these applications, the bipartite graphs are often large and dynamic, where vertices and edges are inserted and deleted frequently, so it is costly to recompute (α, β)-cores from scratch when the graph has changed. Recently, a few works have attempted to study how to maintain (α, β)-cores in the dynamic bipartite graph, but their performance is still far from perfect, due to the huge size of graphs and their frequent changes. To alleviate this issue, in this paper we present efficient (α, β)-core maintenance algorithms over bipartite graphs. We first introduce a novel concept, called bi-core numbers, for the vertices of bipartite graphs. Based on this concept, we theoretically analyze the effect of inserting and deleting edges on the changes of vertices' bi-core numbers, which can be further used to narrow down the scope of the updates, thereby reducing the computational redundancy. We then propose efficient (α, β)-core maintenance algorithms for handling the edge insertion and edge deletion respectively, by exploiting the above theoretical analysis results. Finally, extensive experimental evaluations are performed on both real and synthetic datasets, and the results show that our proposed algorithms are up to two orders of magnitude faster than the state-of-the-art approaches.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"35 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136281449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}