Shay Gershtein, Uri Avron, Ido Guy, Tova Milo, Slava Novgorodov
Category trees, or taxonomies, are rooted trees where each node, called a category, corresponds to a set of related items. The construction of taxonomies has been studied in various domains, including e-commerce, document management, and question answering. Multiple algorithms for automating construction have been proposed, employing a variety of clustering approaches and crowdsourcing. However, no formal model to capture such categorization problems has been devised, and their complexity has not been studied. To address this, we propose in this work a combinatorial model that captures many practical settings and show that the aforementioned empirical approach has been warranted, as we prove strong inapproximability bounds for various problem variants and special cases when the goal is to produce a categorization of the maximum utility.
In our model, the input is a set of n weighted item sets that the tree would ideally contain as categories. Each category, rather than perfectly match the corresponding input set, is allowed to exceed a given threshold for a given similarity function. The goal is to produce a tree that maximizes the total weight of the sets for which it contains a matching category. A key parameter is an upper bound on the number of categories an item may belong to, which produces the hardness of the problem, as initially each item may be contained in an arbitrary number of input sets.
For this model, we prove inapproximability bounds, of order (tilde{Theta }(sqrt {n}) ) or (tilde{Theta }(n) ), for various problem variants and special cases, loosely justifying the aforementioned heuristic approach. Our work includes reductions based on parameterized randomized constructions that highlight how various problem parameters and properties of the input may affect the hardness. Moreover, for the special case where the category must be identical to the corresponding input set, we devise an algorithm whose approximation guarantee depends solely on a more granular parameter, allowing improved worst-case guarantees, as well as the application of practical exact solvers. We further provide efficient algorithms with much improved approximation guarantees for practical special cases where the cardinalities of the input sets or the number of input sets each items belongs to are not too large. Finally, we also generalize our results to DAG-based and non-hierarchical categorization.
分类树或分类法是有根的树,其中每个节点(称为类别)对应一组相关项目。分类法的构建已在多个领域得到研究,包括电子商务、文档管理和问题解答。目前已经提出了多种自动构建算法,其中包括各种聚类方法和众包方法。然而,目前还没有设计出捕捉此类分类问题的正式模型,也没有对其复杂性进行过研究。为了解决这个问题,我们在这项工作中提出了一个组合模型,它能捕捉到许多实际情况,并证明上述经验方法是有道理的,因为当目标是产生最大效用的分类时,我们证明了各种问题变体和特例的强不可逼近性边界。在我们的模型中,输入是一组 n 个加权项目集,理想情况下,树会将这些项目集作为类别包含在内。每个类别不是完全匹配相应的输入集,而是允许超过给定相似度函数的给定阈值。我们的目标是生成一棵树,使其包含匹配类别的集合的总权重最大化。一个关键参数是一个项目可能属于的类别数量的上限,它决定了问题的难易程度,因为最初每个项目可能包含在任意数量的输入集合中。对于这个模型,我们证明了各种问题变体和特例的不可逼近性边界,阶数为(tilde/{Theta }(sqrt {n}) )或(tilde/{Theta }(n) ),这在一定程度上证明了上述启发式方法的合理性。我们的工作包括基于参数化随机构造的还原,这些还原突出了各种问题参数和输入属性可能如何影响硬度。此外,对于类别必须与相应输入集完全相同的特殊情况,我们设计了一种算法,其近似保证仅取决于一个更细化的参数,从而改进了最坏情况保证,并应用了实用的精确求解器。我们还进一步提供了高效算法,在输入集的万有引力或每个项所属的输入集数量不太大的实际特殊情况下,其近似保证得到了极大改善。最后,我们还将结果推广到了基于 DAG 的非层次分类。
{"title":"Automated Category Tree Construction: Hardness Bounds and Algorithms","authors":"Shay Gershtein, Uri Avron, Ido Guy, Tova Milo, Slava Novgorodov","doi":"10.1145/3664283","DOIUrl":"https://doi.org/10.1145/3664283","url":null,"abstract":"<p>Category trees, or taxonomies, are rooted trees where each node, called a category, corresponds to a set of related items. The construction of taxonomies has been studied in various domains, including e-commerce, document management, and question answering. Multiple algorithms for automating construction have been proposed, employing a variety of clustering approaches and crowdsourcing. However, no formal model to capture such categorization problems has been devised, and their complexity has not been studied. To address this, we propose in this work a combinatorial model that captures many practical settings and show that the aforementioned empirical approach has been warranted, as we prove strong inapproximability bounds for various problem variants and special cases when the goal is to produce a categorization of the maximum utility. </p><p>In our model, the input is a set of <i>n</i> weighted item sets that the tree would ideally contain as categories. Each category, rather than perfectly match the corresponding input set, is allowed to exceed a given threshold for a given similarity function. The goal is to produce a tree that maximizes the total weight of the sets for which it contains a matching category. A key parameter is an upper bound on the number of categories an item may belong to, which produces the hardness of the problem, as initially each item may be contained in an arbitrary number of input sets. </p><p>For this model, we prove inapproximability bounds, of order (tilde{Theta }(sqrt {n}) ) or (tilde{Theta }(n) ), for various problem variants and special cases, loosely justifying the aforementioned heuristic approach. Our work includes reductions based on parameterized randomized constructions that highlight how various problem parameters and properties of the input may affect the hardness. Moreover, for the special case where the category must be identical to the corresponding input set, we devise an algorithm whose approximation guarantee depends solely on a more granular parameter, allowing improved worst-case guarantees, as well as the application of practical exact solvers. We further provide efficient algorithms with much improved approximation guarantees for practical special cases where the cardinalities of the input sets or the number of input sets each items belongs to are not too large. Finally, we also generalize our results to DAG-based and non-hierarchical categorization.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"2016 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140942057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nofar Carmeli, Martin Grohe, Benny Kimelfeld, Ester Livshits, Muhammad Tibi
A common interpretation of soft constraints penalizes the database for every violation of every constraint, where the penalty is the cost (weight) of the constraint. A computational challenge is that of finding an optimal subset: a collection of database tuples that minimizes the total penalty when each tuple has a cost of being excluded. When the constraints are strict (i.e., have an infinite cost), this subset is a “cardinality repair” of an inconsistent database; in soft interpretations, this subset corresponds to a “most probable world” of a probabilistic database, a “most likely intention” of a probabilistic unclean database, and so on. Within the class of functional dependencies, the complexity of finding a cardinality repair is thoroughly understood. Yet, very little is known about the complexity of finding an optimal subset for the more general soft semantics. The work described in this manuscript makes significant progress in that direction. In addition to general insights about the hardness and approximability of the problem, we present algorithms for two special cases (and some generalizations thereof): a single functional dependency, and a bipartite matching. The latter is the problem of finding an optimal “almost matching” of a bipartite graph where a penalty is paid for every lost edge and every violation of monogamy. For these special cases, we also investigate the complexity of additional computational tasks that arise when the soft constraints are used as a means to represent a probabilistic database via a factor graph, as in the case of a probabilistic unclean database.
{"title":"Database Repairing with Soft Functional Dependencies","authors":"Nofar Carmeli, Martin Grohe, Benny Kimelfeld, Ester Livshits, Muhammad Tibi","doi":"10.1145/3651156","DOIUrl":"https://doi.org/10.1145/3651156","url":null,"abstract":"<p>A common interpretation of soft constraints penalizes the database for every violation of every constraint, where the penalty is the cost (weight) of the constraint. A computational challenge is that of finding an optimal subset: a collection of database tuples that minimizes the total penalty when each tuple has a cost of being excluded. When the constraints are strict (i.e., have an infinite cost), this subset is a “cardinality repair” of an inconsistent database; in soft interpretations, this subset corresponds to a “most probable world” of a probabilistic database, a “most likely intention” of a probabilistic unclean database, and so on. Within the class of functional dependencies, the complexity of finding a cardinality repair is thoroughly understood. Yet, very little is known about the complexity of finding an optimal subset for the more general soft semantics. The work described in this manuscript makes significant progress in that direction. In addition to general insights about the hardness and approximability of the problem, we present algorithms for two special cases (and some generalizations thereof): a single functional dependency, and a bipartite matching. The latter is the problem of finding an optimal “almost matching” of a bipartite graph where a penalty is paid for every lost edge and every violation of monogamy. For these special cases, we also investigate the complexity of additional computational tasks that arise when the soft constraints are used as a means to represent a probabilistic database via a factor graph, as in the case of a probabilistic unclean database.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"26 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140037089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents SUDAF, a declarative framework that allows users to write UDAF (User-Defined Aggregate Function) as mathematical expressions and use them in SQL statements. SUDAF rewrites partial aggregates of UDAFs using built-in aggregate functions and supports efficient dynamic caching and reusing of partial aggregates. Our experiments show that rewriting UDAFs using built-in functions can significantly speed up queries with UDAFs, and the proposed sharing approach can yield up to two orders of magnitude improvement in query execution time. The paper studies also an extension of SUDAF to support sharing partial results between arbitrary queries with UDAFs. We show a connection with the problem of query rewriting using views and introduce a new class of rewritings, called SUDAF rewritings, which enables to use views that have aggregate functions different from the ones used in the input query. We investigate the underlying rewriting-checking and rewriting-existing problem. Our main technical result is a reduction of these problems to respectively rewriting-checking and rewriting-existing of the so-called aggregate candidates, a class of rewritings that has been deeply investigated in the literature.
{"title":"Sharing Queries with Nonequivalent User-Defined Aggregate Functions","authors":"Chao Zhang, Farouk Toumani","doi":"10.1145/3649133","DOIUrl":"https://doi.org/10.1145/3649133","url":null,"abstract":"<p>This paper presents <sans-serif>SUDAF</sans-serif>, a declarative framework that allows users to write UDAF (User-Defined Aggregate Function) as mathematical expressions and use them in SQL statements. <sans-serif>SUDAF</sans-serif> rewrites partial aggregates of UDAFs using built-in aggregate functions and supports efficient dynamic caching and reusing of partial aggregates. Our experiments show that rewriting UDAFs using built-in functions can significantly speed up queries with UDAFs, and the proposed sharing approach can yield up to two orders of magnitude improvement in query execution time. The paper studies also an extension of <sans-serif>SUDAF</sans-serif> to support sharing partial results between arbitrary queries with UDAFs. We show a connection with the problem of query rewriting using views and introduce a new class of rewritings, called <sans-serif>SUDAF</sans-serif> rewritings, which enables to use views that have aggregate functions different from the ones used in the input query. We investigate the underlying rewriting-checking and rewriting-existing problem. Our main technical result is a reduction of these problems to respectively rewriting-checking and rewriting-existing of the so-called <i>aggregate candidates</i>, a class of rewritings that has been deeply investigated in the literature.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"170 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139968559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present the theoretical foundations and first experimental study of a new approach in centrality measures for graph data. The main principle is straightforward: the more relevant subgraphs around a vertex, the more central it is in the network. We formalize the notion of “relevant subgraphs” by choosing a family of subgraphs that, given a graph G and a vertex v, assigns a subset of connected subgraphs of G that contains v. Any of such families defines a measure of centrality by counting the number of subgraphs assigned to the vertex, i.e., a vertex will be more important for the network if it belongs to more subgraphs in the family. We show several examples of this approach. In particular, we propose the All-Subgraphs (All-Trees) centrality, a centrality measure that considers every subgraph (tree). We study fundamental properties over families of subgraphs that guarantee desirable properties over the centrality measure. Interestingly, All-Subgraphs and All-Trees satisfy all these properties, showing their robustness as centrality notions. To conclude the theoretical analysis, we study the computational complexity of counting certain families of subgraphs and show a linear time algorithm to compute the All-Subgraphs and All-Trees centrality for graphs with bounded treewidth. Finally, we implemented these algorithms and computed these measures over more than one hundred real-world networks. With this data, we present an empirical comparison between well-known centrality measures and those proposed in this work.
我们介绍了图数据中心性度量新方法的理论基础和首次实验研究。其主要原理简单明了:一个顶点周围的相关子图越多,该顶点在网络中的中心地位就越高。我们将 "相关子图 "的概念正规化,即选择一个子图族,在给定一个图 G 和一个顶点 v 的情况下,分配一个包含 v 的 G 的连接子图子集。任何这样的族都通过计算分配给顶点的子图数量来定义中心性度量,也就是说,如果一个顶点属于族中更多的子图,那么它在网络中的重要性就会更高。我们展示了这种方法的几个例子。我们特别提出了全子图(全树)中心度,这是一种考虑到每个子图(树)的中心度量。我们研究了子图族的基本属性,这些属性保证了中心度量的理想属性。有趣的是,全子图和全树满足所有这些属性,这表明它们作为中心性概念的稳健性。在理论分析的最后,我们研究了计算某些子图族的计算复杂度,并展示了一种线性时间算法,用于计算具有有界树宽的图的 All-Subgraphs 和 All-Trees 中心性。最后,我们在一百多个真实世界的网络中实现了这些算法并计算了这些度量。通过这些数据,我们对知名的中心性度量和本文提出的度量进行了实证比较。
{"title":"A family of centrality measures for graph data based on subgraphs","authors":"Sebastián Bugedo, Cristian Riveros, Jorge Salas","doi":"10.1145/3649134","DOIUrl":"https://doi.org/10.1145/3649134","url":null,"abstract":"<p>We present the theoretical foundations and first experimental study of a new approach in centrality measures for graph data. The main principle is straightforward: the more relevant subgraphs around a vertex, the more central it is in the network. We formalize the notion of “relevant subgraphs” by choosing a family of subgraphs that, given a graph <i>G</i> and a vertex <i>v</i>, assigns a subset of connected subgraphs of <i>G</i> that contains <i>v</i>. Any of such families defines a measure of centrality by counting the number of subgraphs assigned to the vertex, i.e., a vertex will be more important for the network if it belongs to more subgraphs in the family. We show several examples of this approach. In particular, we propose the All-Subgraphs (All-Trees) centrality, a centrality measure that considers every subgraph (tree). We study fundamental properties over families of subgraphs that guarantee desirable properties over the centrality measure. Interestingly, All-Subgraphs and All-Trees satisfy all these properties, showing their robustness as centrality notions. To conclude the theoretical analysis, we study the computational complexity of counting certain families of subgraphs and show a linear time algorithm to compute the All-Subgraphs and All-Trees centrality for graphs with bounded treewidth. Finally, we implemented these algorithms and computed these measures over more than one hundred real-world networks. With this data, we present an empirical comparison between well-known centrality measures and those proposed in this work.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"60 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139950700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Tench, Evan West, Victor Zhang, Michael A. Bender, Abiyaz Chowdhury, Daniel Delayo, J. Ahmed Dellas, Martín Farach-Colton, Tyler Seip, Kenny Zhang
Finding the connected components of a graph is a fundamental problem with uses throughout computer science and engineering. The task of computing connected components becomes more difficult when graphs are very large, or when they are dynamic, meaning the edge set changes over time subject to a stream of edge insertions and deletions. A natural approach to computing the connected components problem on a large, dynamic graph stream is to buy enough RAM to store the entire graph. However, the requirement that the graph fit in RAM is an inherent limitation of this approach and is prohibitive for very large graphs. Thus, there is an unmet need for systems that can process dense dynamic graphs, especially when those graphs are larger than available RAM.
We present a new high-performance streaming graph-processing system for computing the connected components of a graph. This system, which we call GraphZeppelin, uses new linear sketching data structures (CubeSketch) to solve the streaming connected components problem and as a result requires space asymptotically smaller than the space required for an lossless representation of the graph. GraphZeppelin is optimized for massive dense graphs: GraphZeppelin can process millions of edge updates (both insertions and deletions) per second, even when the underlying graph is far too large to fit in available RAM. As a result GraphZeppelin vastly increases the scale of graphs that can be processed.
{"title":"GraphZeppelin: How to Find Connected Components (Even When Graphs Are Dense, Dynamic, and Massive)","authors":"David Tench, Evan West, Victor Zhang, Michael A. Bender, Abiyaz Chowdhury, Daniel Delayo, J. Ahmed Dellas, Martín Farach-Colton, Tyler Seip, Kenny Zhang","doi":"10.1145/3643846","DOIUrl":"https://doi.org/10.1145/3643846","url":null,"abstract":"<p>Finding the connected components of a graph is a fundamental problem with uses throughout computer science and engineering. The task of computing connected components becomes more difficult when graphs are very large, or when they are dynamic, meaning the edge set changes over time subject to a stream of edge insertions and deletions. A natural approach to computing the connected components problem on a large, dynamic graph stream is to buy enough RAM to store the entire graph. However, the requirement that the graph fit in RAM is an inherent limitation of this approach and is prohibitive for very large graphs. Thus, there is an unmet need for systems that can process dense dynamic graphs, especially when those graphs are larger than available RAM. </p><p>We present a new high-performance streaming graph-processing system for computing the connected components of a graph. This system, which we call <span>GraphZeppelin</span>, uses new linear sketching data structures (<span>CubeSketch</span>) to solve the streaming connected components problem and as a result requires space asymptotically smaller than the space required for an lossless representation of the graph. <span>GraphZeppelin</span> is optimized for massive dense graphs: <span>GraphZeppelin</span> can process millions of edge updates (both insertions and deletions) per second, even when the underlying graph is far too large to fit in available RAM. As a result <span>GraphZeppelin</span> vastly increases the scale of graphs that can be processed.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"72 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139923689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adriane Chapman, Luca Lauro, Paolo Missier, Riccardo Torlone
Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and its effect on the data explained. In this framework, we aim to provide data scientists with facilities to gain an in-depth understanding of how each step in the pipeline affects the data, from the raw input to training sets ready to be used for learning. Starting from an extensible set of data preparation operators commonly used within a data science setting, in this work we present a provenance management infrastructure for generating, storing, and querying very granular accounts of data transformations, at the level of individual elements within datasets whenever possible. Then, from the formal definition of a core set of data science preprocessing operators, we derive a provenance semantics embodied by a collection of templates expressed in PROV, a standard model for data provenance. Using those templates as a reference, our provenance generation algorithm generalises to any operator with observable input/output pairs. We provide a prototype implementation of an application-level provenance capture library to produce, in a semi-automatic way, complete provenance documents that account for the entire pipeline. We report on the ability of that reference implementation to capture provenance in real ML benchmark pipelines and over TCP-DI synthetic data. We finally show how the collected provenance can be used to answer a suite of provenance benchmark queries that underpin some common pipeline inspection questions, as expressed on the Data Science Stack Exchange.
成功的数据驱动科学需要复杂的数据工程管道来清理、转换和改变数据,为机器学习做准备,而只有当管道中的每一步都有理有据,并能解释其对数据的影响时,才能取得稳健的结果。在这个框架中,我们的目标是为数据科学家提供设施,让他们深入了解从原始输入到准备用于学习的训练集这一过程中的每一步是如何影响数据的。从数据科学环境中常用的一组可扩展的数据准备操作符开始,我们在这项工作中提出了一种出处管理基础架构,用于生成、存储和查询非常细化的数据转换记录,尽可能在数据集内的单个元素级别上进行。然后,通过对一组核心数据科学预处理操作符的正式定义,我们推导出了一种出处语义,该语义由一系列以 PROV(一种数据出处的标准模型)表达的模板所体现。以这些模板为参考,我们的出处生成算法可以推广到任何具有可观测输入/输出对的操作符。我们提供了应用级出处捕获库的原型实现,以半自动的方式生成完整的出处文档,说明整个流水线的情况。我们报告了该参考实现在实际 ML 基准管道和 TCP-DI 合成数据中捕获出处的能力。最后,我们展示了如何利用收集到的出处来回答一系列出处基准查询,这些查询是数据科学堆栈交换(Data Science Stack Exchange)上一些常见管道检查问题的基础。
{"title":"Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance","authors":"Adriane Chapman, Luca Lauro, Paolo Missier, Riccardo Torlone","doi":"10.1145/3644385","DOIUrl":"https://doi.org/10.1145/3644385","url":null,"abstract":"<p>Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and its effect on the data explained. In this framework, we aim to provide data scientists with facilities to gain an in-depth understanding of how each step in the pipeline affects the data, from the raw input to training sets ready to be used for learning. Starting from an extensible set of data preparation operators commonly used within a data science setting, in this work we present a provenance management infrastructure for generating, storing, and querying very granular accounts of data transformations, at the level of individual elements within datasets whenever possible. Then, from the formal definition of a core set of data science preprocessing operators, we derive a <i>provenance semantics</i> embodied by a collection of templates expressed in PROV, a standard model for data provenance. Using those templates as a reference, our provenance generation algorithm generalises to any operator with observable input/output pairs. We provide a prototype implementation of an application-level provenance capture library to produce, in a semi-automatic way, complete provenance documents that account for the entire pipeline. We report on the ability of that reference implementation to capture provenance in real ML benchmark pipelines and over TCP-DI synthetic data. We finally show how the collected provenance can be used to answer a suite of provenance benchmark queries that underpin some common pipeline inspection questions, as expressed on the Data Science Stack Exchange.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"107 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139757527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Diego Arroyuelo, Adrián Gómez-Brandón, Aidan Hogan, Gonzalo Navarro, Juan Reutter, Javiel Rojas-Ledesma, Adrián Soto
We present an indexing scheme for triple-based graphs that supports join queries in worst-case optimal (wco) time within compact space. This scheme, called a ring, regards each triple as a cyclic string of length 3. Each rotation of the triples is lexicographically sorted and the values of the last attribute are stored as a column, so we obtain the order of the next column by stably re-sorting the triples by its attribute. We show that, by representing the columns with a compact data structure called a wavelet tree, this ordering enables forward and backward navigation between columns without needing pointers. These wavelet trees further support wco join algorithms and cardinality estimations for query planning. While traditional data structures such as B-Trees, tries, etc., require 6 index orders to support all possible wco joins over triples, we can use one ring to index them all. This ring replaces the graph and uses only sublinear extra space, thus supporting wco joins in almost no space beyond storing the graph itself. Experiments querying a large graph (Wikidata) in memory show that the ring offers nearly the best overall query times while using only a small fraction of the space required by several state-of-the-art approaches.
We then turn our attention to some theoretical results for indexing tables of arity d higher than 3 in such a way that supports wco joins. While a single ring of length d no longer suffices to cover all d! orders, we need much fewer rings to index them all: O(2d) rings with a small constant. For example, we need 5 rings instead of 120 orders for d = 5. We show that our rings become a particular case of what we dub order graphs, whose nodes are attribute orders and where stably sorting by some attribute leads us from an order to another, thereby inducing an edge labeled by the attribute. The index is then the set of columns associated with the edges, and a set of rings is just one possible graph shape. We show that other shapes, like for example a single ring instead of several ones of length d, can lead us to even smaller indexes, and that other more general shapes are also possible. For example, we handle d = 5 attributes within space equivalent to 4 rings.
我们为基于三元组的图提出了一种索引方案,它支持在紧凑空间内以最坏情况最优(wco)时间进行连接查询。该方案被称为环,将每个三元组视为长度为 3 的循环字符串。三元组的每次旋转都按词典排序,最后一个属性的值被存储为一列,因此我们可以通过按属性对三元组进行稳定的重新排序来获得下一列的顺序。我们的研究表明,通过使用一种称为小波树的紧凑型数据结构来表示列,这种排序方式无需指针就能实现列之间的前后导航。这些小波树能进一步支持 wco 连接算法和用于查询规划的卡入度估计。传统的数据结构(如 B 树、tries 等)需要 6 个索引顺序才能支持三元组上所有可能的 wco 连接,而我们可以使用一个环来索引所有三元组。这个环取代了图,只使用了亚线性的额外空间,因此除了存储图本身之外,几乎不需要任何空间就能支持 wco 连接。在内存中查询大型图(维基数据)的实验表明,该环几乎提供了最佳的整体查询时间,而所需空间仅为几种最先进方法的一小部分。接下来,我们将注意力转移到以支持 wco 连接的方式为 arity d 大于 3 的表编制索引的一些理论结果上。虽然长度为 d 的单个环已不足以覆盖所有 d 的阶次,但我们需要更少的环来索引所有阶次:O(2d)个小常数的环。例如,当 d = 5 时,我们需要 5 个环,而不是 120 个顺序。我们将证明,我们的环将成为我们所称的阶次图的一种特殊情况,阶次图的节点是属性阶次,在阶次图中,通过对某些属性进行稳定排序,我们将从一个阶次到达另一个阶次,从而产生一条由属性标记的边。索引就是与边相关联的列集,而环集只是一种可能的图形。我们将证明,其他形状,例如长度为 d 的单个环而不是多个环,可以让我们得到更小的索引,而且其他更一般的形状也是可能的。例如,我们在相当于 4 个环的空间内处理了 d = 5 个属性。
{"title":"The Ring: Worst-Case Optimal Joins in Graph Databases using (Almost) No Extra Space","authors":"Diego Arroyuelo, Adrián Gómez-Brandón, Aidan Hogan, Gonzalo Navarro, Juan Reutter, Javiel Rojas-Ledesma, Adrián Soto","doi":"10.1145/3644824","DOIUrl":"https://doi.org/10.1145/3644824","url":null,"abstract":"<p>We present an indexing scheme for triple-based graphs that supports join queries in worst-case optimal (wco) time within compact space. This scheme, called a <i>ring</i>, regards each triple as a cyclic string of length 3. Each rotation of the triples is lexicographically sorted and the values of the last attribute are stored as a column, so we obtain the order of the next column by stably re-sorting the triples by its attribute. We show that, by representing the columns with a compact data structure called a wavelet tree, this ordering enables forward and backward navigation between columns without needing pointers. These wavelet trees further support wco join algorithms and cardinality estimations for query planning. While traditional data structures such as B-Trees, tries, etc., require 6 index orders to support all possible wco joins over triples, we can use one ring to index them all. This ring replaces the graph and uses only sublinear extra space, thus supporting wco joins in almost no space beyond storing the graph itself. Experiments querying a large graph (Wikidata) in memory show that the ring offers nearly the best overall query times while using only a small fraction of the space required by several state-of-the-art approaches. </p><p>We then turn our attention to some theoretical results for indexing tables of arity <i>d</i> higher than 3 in such a way that supports wco joins. While a single ring of length <i>d</i> no longer suffices to cover all <i>d</i>! orders, we need much fewer rings to index them all: <i>O</i>(2<sup><i>d</i></sup>) rings with a small constant. For example, we need 5 rings instead of 120 orders for <i>d</i> = 5. We show that our rings become a particular case of what we dub <i>order graphs, whose nodes are attribute orders and where stably sorting by some attribute leads us from an order to another, thereby inducing an edge labeled by the attribute. The index is then the set of columns associated with the edges, and a set of rings is just one possible graph shape. We show that other shapes, like for example a single ring instead of several ones of length <i>d</i>, can lead us to even smaller indexes, and that other more general shapes are also possible. For example, we handle <i>d</i> = 5 attributes within space equivalent to 4 rings.</i></p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"28 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139757456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sabah Currim, Richard T. Snodgrass, Young-Kyoon Suh
The query optimization phase within a database management system (DBMS) ostensibly finds the fastest query execution plan from a potentially large set of enumerated plans, all of which correctly compute the same result of the specified query. Sometimes the cost-based optimizer selects a slower plan, for a variety of reasons. Previous work has focused on increasing the performance of specific components, often a single operator, within an individual DBMS. However, that does not address the fundamental question: from where does this suboptimality arise, across DBMSes generally? In particular, the contribution of each of many possible factors to DBMS suboptimality is currently unknown. To identify the root causes of DBMS suboptimality, we first introduce the notion of empirical suboptimality of a query plan chosen by the DBMS, indicated by the existence of a query plan that performs more efficiently than the chosen plan, for the same query. A crucial aspect is that this can be measured externally to the DBMS, and thus does not require access to its source code. We then propose a novel predictive model to explain the relationship between various factors in query optimization and empirical suboptimality. Our model associates suboptimality with the factors of complexity of the schema, of the underlying data on which the query is evaluated, of the query itself, and of the DBMS optimizer. The model also characterizes concomitant interactions among these factors. This model induces a number of specific hypotheses that were tested on multiple DBMSes. We performed a series of experiments that examined the plans for thousands of queries run on four popular DBMSes. We tested the model on over a million of these query executions, using correlational analysis, regression analysis, and causal analysis, specifically Structural Equation Modeling (SEM). We observed that the dependent construct of empirical suboptimality prevalence correlates positively with nine specific constructs characterizing four identified factors that explain in concert much of the variance of suboptimality of two extensive benchmarks, across these disparate DBMSes. This predictive model shows that it is the common aspects of these DBMSes that predict suboptimality, not the particulars embedded in the inordinate complexity of each of these DBMSes. This paper thus provides a new methodology to study mature query optimizers, identifies underlying DBMS-independent causes for the observed suboptimality, and quantifies the relative contribution of each of these causes to the observed suboptimality. This work thus provides a roadmap for fundamental improvements of cost-based query optimizers.
{"title":"Identifying the Root Causes of DBMS Suboptimality","authors":"Sabah Currim, Richard T. Snodgrass, Young-Kyoon Suh","doi":"10.1145/3636425","DOIUrl":"https://doi.org/10.1145/3636425","url":null,"abstract":"<p>The query optimization phase within a database management system (DBMS) ostensibly finds the fastest query execution plan from a potentially large set of enumerated plans, all of which correctly compute the same result of the specified query. Sometimes the cost-based optimizer selects a slower plan, for a variety of reasons. Previous work has focused on increasing the performance of specific components, often a single operator, within an individual DBMS. However, that does not address the fundamental question: from where does this suboptimality arise, across DBMSes generally? In particular, the contribution of each of many possible factors to DBMS suboptimality is currently unknown. To identify the root causes of DBMS suboptimality, we first introduce the notion of <i>empirical suboptimality</i> of a query plan chosen by the DBMS, indicated by the existence of a query plan that performs more efficiently than the chosen plan, for the same query. A crucial aspect is that this can be measured externally to the DBMS, and thus does not require access to its source code. We then propose a novel <i>predictive model</i> to explain the relationship between various factors in query optimization and empirical suboptimality. Our model associates suboptimality with the factors of complexity of the schema, of the underlying data on which the query is evaluated, of the query itself, and of the DBMS optimizer. The model also characterizes concomitant interactions among these factors. This model induces a number of specific hypotheses that were tested on multiple DBMSes. We performed a series of experiments that examined the plans for thousands of queries run on four popular DBMSes. We tested the model on over a million of these query executions, using correlational analysis, regression analysis, and causal analysis, specifically Structural Equation Modeling (SEM). We observed that the dependent construct of empirical suboptimality prevalence correlates positively with nine specific constructs characterizing four identified factors that explain in concert much of the variance of suboptimality of two extensive benchmarks, across these disparate DBMSes. This predictive model shows that it is the common aspects of these DBMSes that predict suboptimality, <i>not</i> the particulars embedded in the inordinate complexity of each of these DBMSes. This paper thus provides a new methodology to study mature query optimizers, identifies underlying DBMS-independent causes for the observed suboptimality, and quantifies the relative contribution of each of these causes to the observed suboptimality. This work thus provides a roadmap for fundamental improvements of cost-based query optimizers.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"69 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139409660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a notion of parametric simulation to link entities across a relational database (mathcal {D} ) and a graph G. Taking functions and thresholds for measuring vertex closeness, path associations and important properties as parameters, parametric simulation identifies tuples t in (mathcal {D} ) and vertices v in G that refer to the same real-world entity, based on both topological and semantic matching. We develop machine learning methods to learn the parameter functions and thresholds. We show that parametric simulation is in quadratic-time, by providing such an algorithm. Moreover, we develop an incremental algorithm for parametric simulation; we show that the incremental algorithm is bounded relative to its batch counterpart, i.e., it incurs the minimum cost for incrementalizing the batch algorithm. Putting these together, we develop HER, a parallel system to check whether (t, v) makes a match, find all vertex matches of t in G, and compute all matches across (mathcal {D} ) and G, all in quadratic-time; moreover, HER supports incremental computation of these in response to updates to (mathcal {D} ) and G. Using real-life and synthetic data, we empirically verify that HER is accurate with F-measure of 0.94 on average, and is able to scale with database (mathcal {D} ) and graph G for both batch and incremental computations.
本文提出了一个参数模拟的概念,用于连接关系数据库 (mathcal {D} )和图 G 中的实体。以测量顶点接近度、路径关联和重要属性的函数和阈值为参数,参数模拟根据拓扑和语义匹配,识别出 (mathcal {D} )中的图元 t 和图 G 中的顶点 v,它们指的是同一个现实世界中的实体。我们开发了机器学习方法来学习参数函数和阈值。通过提供这样一种算法,我们证明了参数模拟的二次方时间。此外,我们还为参数模拟开发了一种增量算法;我们证明,相对于批量算法,增量算法是有界的,也就是说,批量算法的增量成本最小。将这些结合起来,我们开发了 HER,这是一个并行系统,可以检查(t, v)是否匹配,在 G 中找到 t 的所有顶点匹配,并在(mathcal {D} )和 G 中计算所有匹配,所有这些都在二次时间内完成;此外,HER 支持根据(mathcal {D} )和 G 的更新增量计算。通过使用真实数据和合成数据,我们实证验证了 HER 的准确性,其平均 F-measure 值为 0.94,并且能够随着数据库 (mathcal {D} ) 和图 G 的批量计算和增量计算而扩展。
{"title":"Linking Entities across Relations and Graphs","authors":"Wenfei Fan, Ping Lu, Kehan Pang, Ruochun Jin","doi":"10.1145/3639363","DOIUrl":"https://doi.org/10.1145/3639363","url":null,"abstract":"<p>This paper proposes a notion of parametric simulation to link entities across a relational database (mathcal {D} ) and a graph <i>G</i>. Taking functions and thresholds for measuring vertex closeness, path associations and important properties as parameters, parametric simulation identifies tuples <i>t</i> in (mathcal {D} ) and vertices <i>v</i> in <i>G</i> that refer to the same real-world entity, based on both topological and semantic matching. We develop machine learning methods to learn the parameter functions and thresholds. We show that parametric simulation is in quadratic-time, by providing such an algorithm. Moreover, we develop an incremental algorithm for parametric simulation; we show that the incremental algorithm is bounded relative to its batch counterpart, <i>i.e.,</i> it incurs the minimum cost for incrementalizing the batch algorithm. Putting these together, we develop HER, a parallel system to check whether (<i>t</i>, <i>v</i>) makes a match, find all vertex matches of <i>t</i> in <i>G</i>, and compute all matches across (mathcal {D} ) and <i>G</i>, all in quadratic-time; moreover, HER supports incremental computation of these in response to updates to (mathcal {D} ) and <i>G</i>. Using real-life and synthetic data, we empirically verify that HER is accurate with F-measure of 0.94 on average, and is able to scale with database (mathcal {D} ) and graph <i>G</i> for both batch and incremental computations.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"6 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139102300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Georg Gottlob, Matthias Lanzinger, Cem Okulmus, Reinhard Pichler
Various classic reasoning problems with natural hypergraph representations are known to be tractable if a hypertree decomposition (HD) of low width exists. The resulting algorithms are attractive for practical use in fields like databases and constraint satisfaction. However, algorithmic use of HDs relies on the difficult task of first computing a decomposition of the hypergraph underlying a given problem instance, which is then used to guide the algorithm for this particular instance. The performance of purely sequential methods for computing HDs is inherently limited, yet the problem is, theoretically, amenable to parallelisation. In this paper we propose the first algorithm for computing hypertree decompositions that is well-suited for parallelisation. The newly proposed algorithm log-k-decomp requires only a logarithmic number of recursion levels and additionally allows for highly parallelised pruning of the search space by restriction to so-called balanced separators. We provide a detailed experimental evaluation over the HyperBench benchmark and demonstrate that log-k-decomp outperforms the current state of the art significantly.
{"title":"Fast Parallel Hypertree Decompositions in Logarithmic Recursion Depth","authors":"Georg Gottlob, Matthias Lanzinger, Cem Okulmus, Reinhard Pichler","doi":"10.1145/3638758","DOIUrl":"https://doi.org/10.1145/3638758","url":null,"abstract":"<p>Various classic reasoning problems with natural hypergraph representations are known to be tractable if a hypertree decomposition (HD) of low width exists. The resulting algorithms are attractive for practical use in fields like databases and constraint satisfaction. However, algorithmic use of HDs relies on the difficult task of first computing a decomposition of the hypergraph underlying a given problem instance, which is then used to guide the algorithm for this particular instance. The performance of purely sequential methods for computing HDs is inherently limited, yet the problem is, theoretically, amenable to parallelisation. In this paper we propose the first algorithm for computing hypertree decompositions that is well-suited for parallelisation. The newly proposed algorithm log-<i>k</i>-decomp requires only a logarithmic number of recursion levels and additionally allows for highly parallelised pruning of the search space by restriction to so-called balanced separators. We provide a detailed experimental evaluation over the HyperBench benchmark and demonstrate that log-<i>k</i>-decomp outperforms the current state of the art significantly.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"52 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139065497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}