ACM Transactions on Database Systems最新文献

Automated Category Tree Construction: Hardness Bounds and Algorithms 自动分类树构建：硬度界限和算法

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2024-05-09 DOI: 10.1145/3664283

Shay Gershtein, Uri Avron, Ido Guy, Tova Milo, Slava Novgorodov

Category trees, or taxonomies, are rooted trees where each node, called a category, corresponds to a set of related items. The construction of taxonomies has been studied in various domains, including e-commerce, document management, and question answering. Multiple algorithms for automating construction have been proposed, employing a variety of clustering approaches and crowdsourcing. However, no formal model to capture such categorization problems has been devised, and their complexity has not been studied. To address this, we propose in this work a combinatorial model that captures many practical settings and show that the aforementioned empirical approach has been warranted, as we prove strong inapproximability bounds for various problem variants and special cases when the goal is to produce a categorization of the maximum utility.

In our model, the input is a set of n weighted item sets that the tree would ideally contain as categories. Each category, rather than perfectly match the corresponding input set, is allowed to exceed a given threshold for a given similarity function. The goal is to produce a tree that maximizes the total weight of the sets for which it contains a matching category. A key parameter is an upper bound on the number of categories an item may belong to, which produces the hardness of the problem, as initially each item may be contained in an arbitrary number of input sets.

For this model, we prove inapproximability bounds, of order (tilde{Theta }(sqrt {n}) ) or (tilde{Theta }(n) ), for various problem variants and special cases, loosely justifying the aforementioned heuristic approach. Our work includes reductions based on parameterized randomized constructions that highlight how various problem parameters and properties of the input may affect the hardness. Moreover, for the special case where the category must be identical to the corresponding input set, we devise an algorithm whose approximation guarantee depends solely on a more granular parameter, allowing improved worst-case guarantees, as well as the application of practical exact solvers. We further provide efficient algorithms with much improved approximation guarantees for practical special cases where the cardinalities of the input sets or the number of input sets each items belongs to are not too large. Finally, we also generalize our results to DAG-based and non-hierarchical categorization.

分类树或分类法是有根的树，其中每个节点（称为类别）对应一组相关项目。分类法的构建已在多个领域得到研究，包括电子商务、文档管理和问题解答。目前已经提出了多种自动构建算法，其中包括各种聚类方法和众包方法。然而，目前还没有设计出捕捉此类分类问题的正式模型，也没有对其复杂性进行过研究。为了解决这个问题，我们在这项工作中提出了一个组合模型，它能捕捉到许多实际情况，并证明上述经验方法是有道理的，因为当目标是产生最大效用的分类时，我们证明了各种问题变体和特例的强不可逼近性边界。在我们的模型中，输入是一组 n 个加权项目集，理想情况下，树会将这些项目集作为类别包含在内。每个类别不是完全匹配相应的输入集，而是允许超过给定相似度函数的给定阈值。我们的目标是生成一棵树，使其包含匹配类别的集合的总权重最大化。一个关键参数是一个项目可能属于的类别数量的上限，它决定了问题的难易程度，因为最初每个项目可能包含在任意数量的输入集合中。对于这个模型，我们证明了各种问题变体和特例的不可逼近性边界，阶数为(tilde/{Theta }(sqrt {n}) )或(tilde/{Theta }(n) )，这在一定程度上证明了上述启发式方法的合理性。我们的工作包括基于参数化随机构造的还原，这些还原突出了各种问题参数和输入属性可能如何影响硬度。此外，对于类别必须与相应输入集完全相同的特殊情况，我们设计了一种算法，其近似保证仅取决于一个更细化的参数，从而改进了最坏情况保证，并应用了实用的精确求解器。我们还进一步提供了高效算法，在输入集的万有引力或每个项所属的输入集数量不太大的实际特殊情况下，其近似保证得到了极大改善。最后，我们还将结果推广到了基于 DAG 的非层次分类。

{"title":"Automated Category Tree Construction: Hardness Bounds and Algorithms","authors":"Shay Gershtein, Uri Avron, Ido Guy, Tova Milo, Slava Novgorodov","doi":"10.1145/3664283","DOIUrl":"https://doi.org/10.1145/3664283","url":null,"abstract":"Category trees, or taxonomies, are rooted trees where each node, called a category, corresponds to a set of related items. The construction of taxonomies has been studied in various domains, including e-commerce, document management, and question answering. Multiple algorithms for automating construction have been proposed, employing a variety of clustering approaches and crowdsourcing. However, no formal model to capture such categorization problems has been devised, and their complexity has not been studied. To address this, we propose in this work a combinatorial model that captures many practical settings and show that the aforementioned empirical approach has been warranted, as we prove strong inapproximability bounds for various problem variants and special cases when the goal is to produce a categorization of the maximum utility. In our model, the input is a set of n weighted item sets that the tree would ideally contain as categories. Each category, rather than perfectly match the corresponding input set, is allowed to exceed a given threshold for a given similarity function. The goal is to produce a tree that maximizes the total weight of the sets for which it contains a matching category. A key parameter is an upper bound on the number of categories an item may belong to, which produces the hardness of the problem, as initially each item may be contained in an arbitrary number of input sets. For this model, we prove inapproximability bounds, of order (tilde{Theta }(sqrt {n}) ) or (tilde{Theta }(n) ), for various problem variants and special cases, loosely justifying the aforementioned heuristic approach. Our work includes reductions based on parameterized randomized constructions that highlight how various problem parameters and properties of the input may affect the hardness. Moreover, for the special case where the category must be identical to the corresponding input set, we devise an algorithm whose approximation guarantee depends solely on a more granular parameter, allowing improved worst-case guarantees, as well as the application of practical exact solvers. We further provide efficient algorithms with much improved approximation guarantees for practical special cases where the cardinalities of the input sets or the number of input sets each items belongs to are not too large. Finally, we also generalize our results to DAG-based and non-hierarchical categorization.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"2016 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140942057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Database Repairing with Soft Functional Dependencies 利用软功能依赖性修复数据库

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2024-03-04 DOI: 10.1145/3651156

Nofar Carmeli, Martin Grohe, Benny Kimelfeld, Ester Livshits, Muhammad Tibi

A common interpretation of soft constraints penalizes the database for every violation of every constraint, where the penalty is the cost (weight) of the constraint. A computational challenge is that of finding an optimal subset: a collection of database tuples that minimizes the total penalty when each tuple has a cost of being excluded. When the constraints are strict (i.e., have an infinite cost), this subset is a “cardinality repair” of an inconsistent database; in soft interpretations, this subset corresponds to a “most probable world” of a probabilistic database, a “most likely intention” of a probabilistic unclean database, and so on. Within the class of functional dependencies, the complexity of finding a cardinality repair is thoroughly understood. Yet, very little is known about the complexity of finding an optimal subset for the more general soft semantics. The work described in this manuscript makes significant progress in that direction. In addition to general insights about the hardness and approximability of the problem, we present algorithms for two special cases (and some generalizations thereof): a single functional dependency, and a bipartite matching. The latter is the problem of finding an optimal “almost matching” of a bipartite graph where a penalty is paid for every lost edge and every violation of monogamy. For these special cases, we also investigate the complexity of additional computational tasks that arise when the soft constraints are used as a means to represent a probabilistic database via a factor graph, as in the case of a probabilistic unclean database.

软约束的一种常见解释是，数据库会对违反每项约束的行为进行惩罚，而惩罚就是约束的成本（权重）。计算上的一个挑战是如何找到一个最优子集：当每个元组都有被排除的代价时，能使总惩罚最小化的数据库元组集合。当约束是严格的（即具有无限代价）时，这个子集就是不一致数据库的 "卡方修补"；在软解释中，这个子集对应于概率数据库的 "最可能世界"、概率不清洁数据库的 "最可能意图"，等等。在函数依赖性类别中，人们已经充分了解了寻找卡方修补的复杂性。然而，人们对为更一般的软语义寻找最优子集的复杂性知之甚少。本手稿中描述的工作在这方面取得了重大进展。除了对问题的难易度和近似性的一般认识外，我们还提出了两种特殊情况（及其一些概括）的算法：单一功能依赖和双元匹配。后者是寻找一个双方图的最优 "近似匹配 "的问题，其中对每一条丢失的边和每一次违反一夫一妻制的行为都要进行惩罚。对于这些特殊情况，我们还研究了软约束作为一种手段，通过因子图来表示概率数据库时所产生的额外计算任务的复杂性，如概率不清洁数据库的情况。

{"title":"Database Repairing with Soft Functional Dependencies","authors":"Nofar Carmeli, Martin Grohe, Benny Kimelfeld, Ester Livshits, Muhammad Tibi","doi":"10.1145/3651156","DOIUrl":"https://doi.org/10.1145/3651156","url":null,"abstract":"A common interpretation of soft constraints penalizes the database for every violation of every constraint, where the penalty is the cost (weight) of the constraint. A computational challenge is that of finding an optimal subset: a collection of database tuples that minimizes the total penalty when each tuple has a cost of being excluded. When the constraints are strict (i.e., have an infinite cost), this subset is a “cardinality repair” of an inconsistent database; in soft interpretations, this subset corresponds to a “most probable world” of a probabilistic database, a “most likely intention” of a probabilistic unclean database, and so on. Within the class of functional dependencies, the complexity of finding a cardinality repair is thoroughly understood. Yet, very little is known about the complexity of finding an optimal subset for the more general soft semantics. The work described in this manuscript makes significant progress in that direction. In addition to general insights about the hardness and approximability of the problem, we present algorithms for two special cases (and some generalizations thereof): a single functional dependency, and a bipartite matching. The latter is the problem of finding an optimal “almost matching” of a bipartite graph where a penalty is paid for every lost edge and every violation of monogamy. For these special cases, we also investigate the complexity of additional computational tasks that arise when the soft constraints are used as a means to represent a probabilistic database via a factor graph, as in the case of a probabilistic unclean database.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"26 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140037089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sharing Queries with Nonequivalent User-Defined Aggregate Functions 使用非等价用户定义的聚合函数共享查询

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2024-02-24 DOI: 10.1145/3649133

Chao Zhang, Farouk Toumani

This paper presents SUDAF, a declarative framework that allows users to write UDAF (User-Defined Aggregate Function) as mathematical expressions and use them in SQL statements. SUDAF rewrites partial aggregates of UDAFs using built-in aggregate functions and supports efficient dynamic caching and reusing of partial aggregates. Our experiments show that rewriting UDAFs using built-in functions can significantly speed up queries with UDAFs, and the proposed sharing approach can yield up to two orders of magnitude improvement in query execution time. The paper studies also an extension of SUDAF to support sharing partial results between arbitrary queries with UDAFs. We show a connection with the problem of query rewriting using views and introduce a new class of rewritings, called SUDAF rewritings, which enables to use views that have aggregate functions different from the ones used in the input query. We investigate the underlying rewriting-checking and rewriting-existing problem. Our main technical result is a reduction of these problems to respectively rewriting-checking and rewriting-existing of the so-called aggregate candidates, a class of rewritings that has been deeply investigated in the literature.

本文介绍的 SUDAF 是一个声明式框架，允许用户将 UDAF（用户定义聚合函数）写成数学表达式，并在 SQL 语句中使用它们。SUDAF 使用内置聚合函数重写 UDAF 的部分聚合，并支持部分聚合的高效动态缓存和重用。我们的实验表明，使用内置函数重写 UDAF 可以显著加快使用 UDAF 进行查询的速度，而且所提出的共享方法可以将查询执行时间提高两个数量级。本文还研究了 SUDAF 的扩展，以支持使用 UDAFs 的任意查询之间共享部分结果。我们展示了与使用视图的查询重写问题之间的联系，并引入了一类新的重写，称为 SUDAF 重写，它可以使用具有与输入查询中使用的聚合函数不同的聚合函数的视图。我们研究了底层的重写检查和重写存在问题。我们的主要技术成果是将这些问题分别简化为所谓聚合候选体的重写检查和重写存在问题。

{"title":"Sharing Queries with Nonequivalent User-Defined Aggregate Functions","authors":"Chao Zhang, Farouk Toumani","doi":"10.1145/3649133","DOIUrl":"https://doi.org/10.1145/3649133","url":null,"abstract":"This paper presents <sans-serif>SUDAF</sans-serif>, a declarative framework that allows users to write UDAF (User-Defined Aggregate Function) as mathematical expressions and use them in SQL statements. <sans-serif>SUDAF</sans-serif> rewrites partial aggregates of UDAFs using built-in aggregate functions and supports efficient dynamic caching and reusing of partial aggregates. Our experiments show that rewriting UDAFs using built-in functions can significantly speed up queries with UDAFs, and the proposed sharing approach can yield up to two orders of magnitude improvement in query execution time. The paper studies also an extension of <sans-serif>SUDAF</sans-serif> to support sharing partial results between arbitrary queries with UDAFs. We show a connection with the problem of query rewriting using views and introduce a new class of rewritings, called <sans-serif>SUDAF</sans-serif> rewritings, which enables to use views that have aggregate functions different from the ones used in the input query. We investigate the underlying rewriting-checking and rewriting-existing problem. Our main technical result is a reduction of these problems to respectively rewriting-checking and rewriting-existing of the so-called aggregate candidates, a class of rewritings that has been deeply investigated in the literature.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"170 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139968559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A family of centrality measures for graph data based on subgraphs 基于子图的图数据中心性度量系列

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2024-02-23 DOI: 10.1145/3649134

Sebastián Bugedo, Cristian Riveros, Jorge Salas

We present the theoretical foundations and first experimental study of a new approach in centrality measures for graph data. The main principle is straightforward: the more relevant subgraphs around a vertex, the more central it is in the network. We formalize the notion of “relevant subgraphs” by choosing a family of subgraphs that, given a graph G and a vertex v, assigns a subset of connected subgraphs of G that contains v. Any of such families defines a measure of centrality by counting the number of subgraphs assigned to the vertex, i.e., a vertex will be more important for the network if it belongs to more subgraphs in the family. We show several examples of this approach. In particular, we propose the All-Subgraphs (All-Trees) centrality, a centrality measure that considers every subgraph (tree). We study fundamental properties over families of subgraphs that guarantee desirable properties over the centrality measure. Interestingly, All-Subgraphs and All-Trees satisfy all these properties, showing their robustness as centrality notions. To conclude the theoretical analysis, we study the computational complexity of counting certain families of subgraphs and show a linear time algorithm to compute the All-Subgraphs and All-Trees centrality for graphs with bounded treewidth. Finally, we implemented these algorithms and computed these measures over more than one hundred real-world networks. With this data, we present an empirical comparison between well-known centrality measures and those proposed in this work.

我们介绍了图数据中心性度量新方法的理论基础和首次实验研究。其主要原理简单明了：一个顶点周围的相关子图越多，该顶点在网络中的中心地位就越高。我们将 "相关子图 "的概念正规化，即选择一个子图族，在给定一个图 G 和一个顶点 v 的情况下，分配一个包含 v 的 G 的连接子图子集。任何这样的族都通过计算分配给顶点的子图数量来定义中心性度量，也就是说，如果一个顶点属于族中更多的子图，那么它在网络中的重要性就会更高。我们展示了这种方法的几个例子。我们特别提出了全子图（全树）中心度，这是一种考虑到每个子图（树）的中心度量。我们研究了子图族的基本属性，这些属性保证了中心度量的理想属性。有趣的是，全子图和全树满足所有这些属性，这表明它们作为中心性概念的稳健性。在理论分析的最后，我们研究了计算某些子图族的计算复杂度，并展示了一种线性时间算法，用于计算具有有界树宽的图的 All-Subgraphs 和 All-Trees 中心性。最后，我们在一百多个真实世界的网络中实现了这些算法并计算了这些度量。通过这些数据，我们对知名的中心性度量和本文提出的度量进行了实证比较。

{"title":"A family of centrality measures for graph data based on subgraphs","authors":"Sebastián Bugedo, Cristian Riveros, Jorge Salas","doi":"10.1145/3649134","DOIUrl":"https://doi.org/10.1145/3649134","url":null,"abstract":"We present the theoretical foundations and first experimental study of a new approach in centrality measures for graph data. The main principle is straightforward: the more relevant subgraphs around a vertex, the more central it is in the network. We formalize the notion of “relevant subgraphs” by choosing a family of subgraphs that, given a graph G and a vertex v, assigns a subset of connected subgraphs of G that contains v. Any of such families defines a measure of centrality by counting the number of subgraphs assigned to the vertex, i.e., a vertex will be more important for the network if it belongs to more subgraphs in the family. We show several examples of this approach. In particular, we propose the All-Subgraphs (All-Trees) centrality, a centrality measure that considers every subgraph (tree). We study fundamental properties over families of subgraphs that guarantee desirable properties over the centrality measure. Interestingly, All-Subgraphs and All-Trees satisfy all these properties, showing their robustness as centrality notions. To conclude the theoretical analysis, we study the computational complexity of counting certain families of subgraphs and show a linear time algorithm to compute the All-Subgraphs and All-Trees centrality for graphs with bounded treewidth. Finally, we implemented these algorithms and computed these measures over more than one hundred real-world networks. With this data, we present an empirical comparison between well-known centrality measures and those proposed in this work.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"60 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139950700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GraphZeppelin: How to Find Connected Components (Even When Graphs Are Dense, Dynamic, and Massive) GraphZeppelin：如何查找连接的组件（即使图形密集、动态且庞大）

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2024-02-20 DOI: 10.1145/3643846

David Tench, Evan West, Victor Zhang, Michael A. Bender, Abiyaz Chowdhury, Daniel Delayo, J. Ahmed Dellas, Martín Farach-Colton, Tyler Seip, Kenny Zhang

Finding the connected components of a graph is a fundamental problem with uses throughout computer science and engineering. The task of computing connected components becomes more difficult when graphs are very large, or when they are dynamic, meaning the edge set changes over time subject to a stream of edge insertions and deletions. A natural approach to computing the connected components problem on a large, dynamic graph stream is to buy enough RAM to store the entire graph. However, the requirement that the graph fit in RAM is an inherent limitation of this approach and is prohibitive for very large graphs. Thus, there is an unmet need for systems that can process dense dynamic graphs, especially when those graphs are larger than available RAM.

We present a new high-performance streaming graph-processing system for computing the connected components of a graph. This system, which we call GraphZeppelin, uses new linear sketching data structures (CubeSketch) to solve the streaming connected components problem and as a result requires space asymptotically smaller than the space required for an lossless representation of the graph. GraphZeppelin is optimized for massive dense graphs: GraphZeppelin can process millions of edge updates (both insertions and deletions) per second, even when the underlying graph is far too large to fit in available RAM. As a result GraphZeppelin vastly increases the scale of graphs that can be processed.

查找图形的连通成分是一个基本问题，在计算机科学和工程学中都有应用。当图形非常大或图形是动态的（即边集会随着时间的推移而发生变化，受到边插入和删除流的影响）时，计算连通组件的任务就会变得更加困难。在大型动态图流上计算连通组件问题的一种自然方法是购买足够的 RAM 来存储整个图。但是，这种方法的固有限制是图形必须适合 RAM，而且对于超大型图形来说，这一要求过于苛刻。因此，对于能够处理密集动态图的系统的需求尚未得到满足，尤其是当这些图大于可用 RAM 时。我们提出了一种新的高性能流式图形处理系统，用于计算图形的连接组件。我们将该系统称为 GraphZeppelin，它使用新的线性草图数据结构（CubeSketch）来解决流式连通组件问题，因此所需的空间逐渐小于无损表示图所需的空间。GraphZeppelin 针对大规模密集图进行了优化：GraphZeppelin 每秒可处理数百万条边的更新（包括插入和删除），即使底层图的大小远远超出可用 RAM 的容量。因此，GraphZeppelin 大幅提高了可处理图形的规模。

{"title":"GraphZeppelin: How to Find Connected Components (Even When Graphs Are Dense, Dynamic, and Massive)","authors":"David Tench, Evan West, Victor Zhang, Michael A. Bender, Abiyaz Chowdhury, Daniel Delayo, J. Ahmed Dellas, Martín Farach-Colton, Tyler Seip, Kenny Zhang","doi":"10.1145/3643846","DOIUrl":"https://doi.org/10.1145/3643846","url":null,"abstract":"Finding the connected components of a graph is a fundamental problem with uses throughout computer science and engineering. The task of computing connected components becomes more difficult when graphs are very large, or when they are dynamic, meaning the edge set changes over time subject to a stream of edge insertions and deletions. A natural approach to computing the connected components problem on a large, dynamic graph stream is to buy enough RAM to store the entire graph. However, the requirement that the graph fit in RAM is an inherent limitation of this approach and is prohibitive for very large graphs. Thus, there is an unmet need for systems that can process dense dynamic graphs, especially when those graphs are larger than available RAM. We present a new high-performance streaming graph-processing system for computing the connected components of a graph. This system, which we call GraphZeppelin, uses new linear sketching data structures (CubeSketch) to solve the streaming connected components problem and as a result requires space asymptotically smaller than the space required for an lossless representation of the graph. GraphZeppelin is optimized for massive dense graphs: GraphZeppelin can process millions of edge updates (both insertions and deletions) per second, even when the underlying graph is far too large to fit in available RAM. As a result GraphZeppelin vastly increases the scale of graphs that can be processed.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"72 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139923689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance 利用细粒度证明支持更好地洞察数据科学管道

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2024-02-09 DOI: 10.1145/3644385

Adriane Chapman, Luca Lauro, Paolo Missier, Riccardo Torlone

Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and its effect on the data explained. In this framework, we aim to provide data scientists with facilities to gain an in-depth understanding of how each step in the pipeline affects the data, from the raw input to training sets ready to be used for learning. Starting from an extensible set of data preparation operators commonly used within a data science setting, in this work we present a provenance management infrastructure for generating, storing, and querying very granular accounts of data transformations, at the level of individual elements within datasets whenever possible. Then, from the formal definition of a core set of data science preprocessing operators, we derive a provenance semantics embodied by a collection of templates expressed in PROV, a standard model for data provenance. Using those templates as a reference, our provenance generation algorithm generalises to any operator with observable input/output pairs. We provide a prototype implementation of an application-level provenance capture library to produce, in a semi-automatic way, complete provenance documents that account for the entire pipeline. We report on the ability of that reference implementation to capture provenance in real ML benchmark pipelines and over TCP-DI synthetic data. We finally show how the collected provenance can be used to answer a suite of provenance benchmark queries that underpin some common pipeline inspection questions, as expressed on the Data Science Stack Exchange.

成功的数据驱动科学需要复杂的数据工程管道来清理、转换和改变数据，为机器学习做准备，而只有当管道中的每一步都有理有据，并能解释其对数据的影响时，才能取得稳健的结果。在这个框架中，我们的目标是为数据科学家提供设施，让他们深入了解从原始输入到准备用于学习的训练集这一过程中的每一步是如何影响数据的。从数据科学环境中常用的一组可扩展的数据准备操作符开始，我们在这项工作中提出了一种出处管理基础架构，用于生成、存储和查询非常细化的数据转换记录，尽可能在数据集内的单个元素级别上进行。然后，通过对一组核心数据科学预处理操作符的正式定义，我们推导出了一种出处语义，该语义由一系列以 PROV（一种数据出处的标准模型）表达的模板所体现。以这些模板为参考，我们的出处生成算法可以推广到任何具有可观测输入/输出对的操作符。我们提供了应用级出处捕获库的原型实现，以半自动的方式生成完整的出处文档，说明整个流水线的情况。我们报告了该参考实现在实际 ML 基准管道和 TCP-DI 合成数据中捕获出处的能力。最后，我们展示了如何利用收集到的出处来回答一系列出处基准查询，这些查询是数据科学堆栈交换（Data Science Stack Exchange）上一些常见管道检查问题的基础。

{"title":"Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance","authors":"Adriane Chapman, Luca Lauro, Paolo Missier, Riccardo Torlone","doi":"10.1145/3644385","DOIUrl":"https://doi.org/10.1145/3644385","url":null,"abstract":"Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and its effect on the data explained. In this framework, we aim to provide data scientists with facilities to gain an in-depth understanding of how each step in the pipeline affects the data, from the raw input to training sets ready to be used for learning. Starting from an extensible set of data preparation operators commonly used within a data science setting, in this work we present a provenance management infrastructure for generating, storing, and querying very granular accounts of data transformations, at the level of individual elements within datasets whenever possible. Then, from the formal definition of a core set of data science preprocessing operators, we derive a provenance semantics embodied by a collection of templates expressed in PROV, a standard model for data provenance. Using those templates as a reference, our provenance generation algorithm generalises to any operator with observable input/output pairs. We provide a prototype implementation of an application-level provenance capture library to produce, in a semi-automatic way, complete provenance documents that account for the entire pipeline. We report on the ability of that reference implementation to capture provenance in real ML benchmark pipelines and over TCP-DI synthetic data. We finally show how the collected provenance can be used to answer a suite of provenance benchmark queries that underpin some common pipeline inspection questions, as expressed on the Data Science Stack Exchange.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"107 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139757527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Ring: Worst-Case Optimal Joins in Graph Databases using (Almost) No Extra Space 环：利用（几乎）无额外空间实现图数据库中的最坏情况最优连接

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2024-02-08 DOI: 10.1145/3644824

Diego Arroyuelo, Adrián Gómez-Brandón, Aidan Hogan, Gonzalo Navarro, Juan Reutter, Javiel Rojas-Ledesma, Adrián Soto

We present an indexing scheme for triple-based graphs that supports join queries in worst-case optimal (wco) time within compact space. This scheme, called a ring, regards each triple as a cyclic string of length 3. Each rotation of the triples is lexicographically sorted and the values of the last attribute are stored as a column, so we obtain the order of the next column by stably re-sorting the triples by its attribute. We show that, by representing the columns with a compact data structure called a wavelet tree, this ordering enables forward and backward navigation between columns without needing pointers. These wavelet trees further support wco join algorithms and cardinality estimations for query planning. While traditional data structures such as B-Trees, tries, etc., require 6 index orders to support all possible wco joins over triples, we can use one ring to index them all. This ring replaces the graph and uses only sublinear extra space, thus supporting wco joins in almost no space beyond storing the graph itself. Experiments querying a large graph (Wikidata) in memory show that the ring offers nearly the best overall query times while using only a small fraction of the space required by several state-of-the-art approaches.

We then turn our attention to some theoretical results for indexing tables of arity d higher than 3 in such a way that supports wco joins. While a single ring of length d no longer suffices to cover all d! orders, we need much fewer rings to index them all: O(2^d) rings with a small constant. For example, we need 5 rings instead of 120 orders for d = 5. We show that our rings become a particular case of what we dub order graphs, whose nodes are attribute orders and where stably sorting by some attribute leads us from an order to another, thereby inducing an edge labeled by the attribute. The index is then the set of columns associated with the edges, and a set of rings is just one possible graph shape. We show that other shapes, like for example a single ring instead of several ones of length d, can lead us to even smaller indexes, and that other more general shapes are also possible. For example, we handle d = 5 attributes within space equivalent to 4 rings.

我们为基于三元组的图提出了一种索引方案，它支持在紧凑空间内以最坏情况最优（wco）时间进行连接查询。该方案被称为环，将每个三元组视为长度为 3 的循环字符串。三元组的每次旋转都按词典排序，最后一个属性的值被存储为一列，因此我们可以通过按属性对三元组进行稳定的重新排序来获得下一列的顺序。我们的研究表明，通过使用一种称为小波树的紧凑型数据结构来表示列，这种排序方式无需指针就能实现列之间的前后导航。这些小波树能进一步支持 wco 连接算法和用于查询规划的卡入度估计。传统的数据结构（如 B 树、tries 等）需要 6 个索引顺序才能支持三元组上所有可能的 wco 连接，而我们可以使用一个环来索引所有三元组。这个环取代了图，只使用了亚线性的额外空间，因此除了存储图本身之外，几乎不需要任何空间就能支持 wco 连接。在内存中查询大型图（维基数据）的实验表明，该环几乎提供了最佳的整体查询时间，而所需空间仅为几种最先进方法的一小部分。接下来，我们将注意力转移到以支持 wco 连接的方式为 arity d 大于 3 的表编制索引的一些理论结果上。虽然长度为 d 的单个环已不足以覆盖所有 d 的阶次，但我们需要更少的环来索引所有阶次：O(2d)个小常数的环。例如，当 d = 5 时，我们需要 5 个环，而不是 120 个顺序。我们将证明，我们的环将成为我们所称的阶次图的一种特殊情况，阶次图的节点是属性阶次，在阶次图中，通过对某些属性进行稳定排序，我们将从一个阶次到达另一个阶次，从而产生一条由属性标记的边。索引就是与边相关联的列集，而环集只是一种可能的图形。我们将证明，其他形状，例如长度为 d 的单个环而不是多个环，可以让我们得到更小的索引，而且其他更一般的形状也是可能的。例如，我们在相当于 4 个环的空间内处理了 d = 5 个属性。

{"title":"The Ring: Worst-Case Optimal Joins in Graph Databases using (Almost) No Extra Space","authors":"Diego Arroyuelo, Adrián Gómez-Brandón, Aidan Hogan, Gonzalo Navarro, Juan Reutter, Javiel Rojas-Ledesma, Adrián Soto","doi":"10.1145/3644824","DOIUrl":"https://doi.org/10.1145/3644824","url":null,"abstract":"We present an indexing scheme for triple-based graphs that supports join queries in worst-case optimal (wco) time within compact space. This scheme, called a ring, regards each triple as a cyclic string of length 3. Each rotation of the triples is lexicographically sorted and the values of the last attribute are stored as a column, so we obtain the order of the next column by stably re-sorting the triples by its attribute. We show that, by representing the columns with a compact data structure called a wavelet tree, this ordering enables forward and backward navigation between columns without needing pointers. These wavelet trees further support wco join algorithms and cardinality estimations for query planning. While traditional data structures such as B-Trees, tries, etc., require 6 index orders to support all possible wco joins over triples, we can use one ring to index them all. This ring replaces the graph and uses only sublinear extra space, thus supporting wco joins in almost no space beyond storing the graph itself. Experiments querying a large graph (Wikidata) in memory show that the ring offers nearly the best overall query times while using only a small fraction of the space required by several state-of-the-art approaches. We then turn our attention to some theoretical results for indexing tables of arity d higher than 3 in such a way that supports wco joins. While a single ring of length d no longer suffices to cover all d! orders, we need much fewer rings to index them all: O(2d) rings with a small constant. For example, we need 5 rings instead of 120 orders for d = 5. We show that our rings become a particular case of what we dub order graphs, whose nodes are attribute orders and where stably sorting by some attribute leads us from an order to another, thereby inducing an edge labeled by the attribute. The index is then the set of columns associated with the edges, and a set of rings is just one possible graph shape. We show that other shapes, like for example a single ring instead of several ones of length d, can lead us to even smaller indexes, and that other more general shapes are also possible. For example, we handle d = 5 attributes within space equivalent to 4 rings.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"28 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139757456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Identifying the Root Causes of DBMS Suboptimality 找出 DBMS 欠优化的根本原因

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2024-01-10 DOI: 10.1145/3636425

Sabah Currim, Richard T. Snodgrass, Young-Kyoon Suh

The query optimization phase within a database management system (DBMS) ostensibly finds the fastest query execution plan from a potentially large set of enumerated plans, all of which correctly compute the same result of the specified query. Sometimes the cost-based optimizer selects a slower plan, for a variety of reasons. Previous work has focused on increasing the performance of specific components, often a single operator, within an individual DBMS. However, that does not address the fundamental question: from where does this suboptimality arise, across DBMSes generally? In particular, the contribution of each of many possible factors to DBMS suboptimality is currently unknown. To identify the root causes of DBMS suboptimality, we first introduce the notion of empirical suboptimality of a query plan chosen by the DBMS, indicated by the existence of a query plan that performs more efficiently than the chosen plan, for the same query. A crucial aspect is that this can be measured externally to the DBMS, and thus does not require access to its source code. We then propose a novel predictive model to explain the relationship between various factors in query optimization and empirical suboptimality. Our model associates suboptimality with the factors of complexity of the schema, of the underlying data on which the query is evaluated, of the query itself, and of the DBMS optimizer. The model also characterizes concomitant interactions among these factors. This model induces a number of specific hypotheses that were tested on multiple DBMSes. We performed a series of experiments that examined the plans for thousands of queries run on four popular DBMSes. We tested the model on over a million of these query executions, using correlational analysis, regression analysis, and causal analysis, specifically Structural Equation Modeling (SEM). We observed that the dependent construct of empirical suboptimality prevalence correlates positively with nine specific constructs characterizing four identified factors that explain in concert much of the variance of suboptimality of two extensive benchmarks, across these disparate DBMSes. This predictive model shows that it is the common aspects of these DBMSes that predict suboptimality, not the particulars embedded in the inordinate complexity of each of these DBMSes. This paper thus provides a new methodology to study mature query optimizers, identifies underlying DBMS-independent causes for the observed suboptimality, and quantifies the relative contribution of each of these causes to the observed suboptimality. This work thus provides a roadmap for fundamental improvements of cost-based query optimizers.

数据库管理系统（DBMS）中的查询优化阶段表面上看是从潜在的大量枚举计划中找出最快的查询执行计划，所有这些计划都能正确计算指定查询的相同结果。出于各种原因，基于成本的优化器有时会选择速度较慢的计划。以前的工作主要集中在提高特定组件（通常是单个数据库管理系统中的单个运算符）的性能上。然而，这并没有解决一个根本问题：在整个 DBMS 中，这种次优化是从哪里产生的？特别是，目前尚不清楚众多可能因素中的每个因素对 DBMS 次优性的影响。为了找出 DBMS 次优化的根本原因，我们首先引入了 DBMS 所选查询计划的经验次优化概念，即对于相同的查询，存在比所选计划执行效率更高的查询计划。关键的一点是，这可以在 DBMS 外部进行测量，因此不需要访问其源代码。然后，我们提出了一个新颖的预测模型来解释查询优化中的各种因素与经验次优化之间的关系。我们的模型将次优化与模式的复杂性、评估查询的基础数据的复杂性、查询本身的复杂性以及 DBMS 优化器的复杂性等因素联系起来。该模型还描述了这些因素之间的相互作用。该模型提出了许多具体假设，并在多个 DBMS 上进行了测试。我们进行了一系列实验，检查了在四种流行 DBMS 上运行的数千次查询的计划。我们使用相关分析、回归分析和因果分析，特别是结构方程建模 (SEM)，在超过一百万次的查询执行中测试了该模型。我们发现，经验次优性的因果结构与九个具体结构正相关，这九个具体结构描述了四个已确定的因素，它们共同解释了这些不同 DBMS 中两个广泛基准的次优性差异。这个预测模型表明，预测次优化性的是这些 DBMS 的共性，而不是这些 DBMS 中每个都异常复杂的特殊性。因此，本文提供了一种研究成熟查询优化器的新方法，确定了所观察到的次优化性与 DBMS 无关的根本原因，并量化了这些原因对所观察到的次优化性的相对贡献。因此，这项工作为从根本上改进基于成本的查询优化器提供了路线图。

{"title":"Identifying the Root Causes of DBMS Suboptimality","authors":"Sabah Currim, Richard T. Snodgrass, Young-Kyoon Suh","doi":"10.1145/3636425","DOIUrl":"https://doi.org/10.1145/3636425","url":null,"abstract":"The query optimization phase within a database management system (DBMS) ostensibly finds the fastest query execution plan from a potentially large set of enumerated plans, all of which correctly compute the same result of the specified query. Sometimes the cost-based optimizer selects a slower plan, for a variety of reasons. Previous work has focused on increasing the performance of specific components, often a single operator, within an individual DBMS. However, that does not address the fundamental question: from where does this suboptimality arise, across DBMSes generally? In particular, the contribution of each of many possible factors to DBMS suboptimality is currently unknown. To identify the root causes of DBMS suboptimality, we first introduce the notion of empirical suboptimality of a query plan chosen by the DBMS, indicated by the existence of a query plan that performs more efficiently than the chosen plan, for the same query. A crucial aspect is that this can be measured externally to the DBMS, and thus does not require access to its source code. We then propose a novel predictive model to explain the relationship between various factors in query optimization and empirical suboptimality. Our model associates suboptimality with the factors of complexity of the schema, of the underlying data on which the query is evaluated, of the query itself, and of the DBMS optimizer. The model also characterizes concomitant interactions among these factors. This model induces a number of specific hypotheses that were tested on multiple DBMSes. We performed a series of experiments that examined the plans for thousands of queries run on four popular DBMSes. We tested the model on over a million of these query executions, using correlational analysis, regression analysis, and causal analysis, specifically Structural Equation Modeling (SEM). We observed that the dependent construct of empirical suboptimality prevalence correlates positively with nine specific constructs characterizing four identified factors that explain in concert much of the variance of suboptimality of two extensive benchmarks, across these disparate DBMSes. This predictive model shows that it is the common aspects of these DBMSes that predict suboptimality, not the particulars embedded in the inordinate complexity of each of these DBMSes. This paper thus provides a new methodology to study mature query optimizers, identifies underlying DBMS-independent causes for the observed suboptimality, and quantifies the relative contribution of each of these causes to the observed suboptimality. This work thus provides a roadmap for fundamental improvements of cost-based query optimizers.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"69 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139409660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Linking Entities across Relations and Graphs 跨关系和图表链接实体

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2024-01-03 DOI: 10.1145/3639363

Wenfei Fan, Ping Lu, Kehan Pang, Ruochun Jin

This paper proposes a notion of parametric simulation to link entities across a relational database (mathcal {D} ) and a graph G. Taking functions and thresholds for measuring vertex closeness, path associations and important properties as parameters, parametric simulation identifies tuples t in (mathcal {D} ) and vertices v in G that refer to the same real-world entity, based on both topological and semantic matching. We develop machine learning methods to learn the parameter functions and thresholds. We show that parametric simulation is in quadratic-time, by providing such an algorithm. Moreover, we develop an incremental algorithm for parametric simulation; we show that the incremental algorithm is bounded relative to its batch counterpart, i.e., it incurs the minimum cost for incrementalizing the batch algorithm. Putting these together, we develop HER, a parallel system to check whether (t, v) makes a match, find all vertex matches of t in G, and compute all matches across (mathcal {D} ) and G, all in quadratic-time; moreover, HER supports incremental computation of these in response to updates to (mathcal {D} ) and G. Using real-life and synthetic data, we empirically verify that HER is accurate with F-measure of 0.94 on average, and is able to scale with database (mathcal {D} ) and graph G for both batch and incremental computations.

本文提出了一个参数模拟的概念，用于连接关系数据库 (mathcal {D} )和图 G 中的实体。以测量顶点接近度、路径关联和重要属性的函数和阈值为参数，参数模拟根据拓扑和语义匹配，识别出 (mathcal {D} )中的图元 t 和图 G 中的顶点 v，它们指的是同一个现实世界中的实体。我们开发了机器学习方法来学习参数函数和阈值。通过提供这样一种算法，我们证明了参数模拟的二次方时间。此外，我们还为参数模拟开发了一种增量算法；我们证明，相对于批量算法，增量算法是有界的，也就是说，批量算法的增量成本最小。将这些结合起来，我们开发了 HER，这是一个并行系统，可以检查（t, v）是否匹配，在 G 中找到 t 的所有顶点匹配，并在(mathcal {D} )和 G 中计算所有匹配，所有这些都在二次时间内完成；此外，HER 支持根据(mathcal {D} )和 G 的更新增量计算。通过使用真实数据和合成数据，我们实证验证了 HER 的准确性，其平均 F-measure 值为 0.94，并且能够随着数据库 (mathcal {D} ) 和图 G 的批量计算和增量计算而扩展。

{"title":"Linking Entities across Relations and Graphs","authors":"Wenfei Fan, Ping Lu, Kehan Pang, Ruochun Jin","doi":"10.1145/3639363","DOIUrl":"https://doi.org/10.1145/3639363","url":null,"abstract":"This paper proposes a notion of parametric simulation to link entities across a relational database (mathcal {D} ) and a graph G. Taking functions and thresholds for measuring vertex closeness, path associations and important properties as parameters, parametric simulation identifies tuples t in (mathcal {D} ) and vertices v in G that refer to the same real-world entity, based on both topological and semantic matching. We develop machine learning methods to learn the parameter functions and thresholds. We show that parametric simulation is in quadratic-time, by providing such an algorithm. Moreover, we develop an incremental algorithm for parametric simulation; we show that the incremental algorithm is bounded relative to its batch counterpart, i.e., it incurs the minimum cost for incrementalizing the batch algorithm. Putting these together, we develop HER, a parallel system to check whether (t, v) makes a match, find all vertex matches of t in G, and compute all matches across (mathcal {D} ) and G, all in quadratic-time; moreover, HER supports incremental computation of these in response to updates to (mathcal {D} ) and G. Using real-life and synthetic data, we empirically verify that HER is accurate with F-measure of 0.94 on average, and is able to scale with database (mathcal {D} ) and graph G for both batch and incremental computations.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"6 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139102300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fast Parallel Hypertree Decompositions in Logarithmic Recursion Depth 对数递归深度下的快速并行超树分解

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2023-12-30 DOI: 10.1145/3638758

Georg Gottlob, Matthias Lanzinger, Cem Okulmus, Reinhard Pichler

Various classic reasoning problems with natural hypergraph representations are known to be tractable if a hypertree decomposition (HD) of low width exists. The resulting algorithms are attractive for practical use in fields like databases and constraint satisfaction. However, algorithmic use of HDs relies on the difficult task of first computing a decomposition of the hypergraph underlying a given problem instance, which is then used to guide the algorithm for this particular instance. The performance of purely sequential methods for computing HDs is inherently limited, yet the problem is, theoretically, amenable to parallelisation. In this paper we propose the first algorithm for computing hypertree decompositions that is well-suited for parallelisation. The newly proposed algorithm log-k-decomp requires only a logarithmic number of recursion levels and additionally allows for highly parallelised pruning of the search space by restriction to so-called balanced separators. We provide a detailed experimental evaluation over the HyperBench benchmark and demonstrate that log-k-decomp outperforms the current state of the art significantly.

众所周知，如果存在宽度较小的超图分解（HD），那么使用自然超图表示的各种经典推理问题都是可以解决的。由此产生的算法对数据库和约束满足等领域的实际应用很有吸引力。然而，HD 的算法使用依赖于一项艰巨的任务，即首先计算一个给定问题实例底层超图的分解，然后用它来指导该特定实例的算法。计算高清图的纯顺序方法的性能本身是有限的，但从理论上讲，这个问题是可以并行化的。在本文中，我们提出了第一个非常适合并行化的超树分解计算算法。新提出的算法 log-k-decomp 只需要对数级数的递归，而且通过限制所谓的平衡分离器，可以对搜索空间进行高度并行的剪枝。我们对 HyperBench 基准进行了详细的实验评估，结果表明 log-k-decomp 明显优于目前的技术水平。

{"title":"Fast Parallel Hypertree Decompositions in Logarithmic Recursion Depth","authors":"Georg Gottlob, Matthias Lanzinger, Cem Okulmus, Reinhard Pichler","doi":"10.1145/3638758","DOIUrl":"https://doi.org/10.1145/3638758","url":null,"abstract":"Various classic reasoning problems with natural hypergraph representations are known to be tractable if a hypertree decomposition (HD) of low width exists. The resulting algorithms are attractive for practical use in fields like databases and constraint satisfaction. However, algorithmic use of HDs relies on the difficult task of first computing a decomposition of the hypergraph underlying a given problem instance, which is then used to guide the algorithm for this particular instance. The performance of purely sequential methods for computing HDs is inherently limited, yet the problem is, theoretically, amenable to parallelisation. In this paper we propose the first algorithm for computing hypertree decompositions that is well-suited for parallelisation. The newly proposed algorithm log-k-decomp requires only a logarithmic number of recursion levels and additionally allows for highly parallelised pruning of the search space by restriction to so-called balanced separators. We provide a detailed experimental evaluation over the HyperBench benchmark and demonstrate that log-k-decomp outperforms the current state of the art significantly.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"52 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139065497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0