H. Aamer, J. Hidders, J. Paredaens, J. V. D. Bussche
Motivated by old and new applications, we investigate Datalog as a language for sequence databases. We reconsider classical features of Datalog programs, such as negation, recursion, intermediate predicates, and relations of higher arities. We also consider new features that are useful for sequences, notably, equations between path expressions, and "packing''. Our goal is to clarify the relative expressiveness of all these different features, in the context of sequences. Towards our goal, we establish a number of redundancy and primitivity results, showing that certain features can, or cannot, be expressed in terms of other features. These results paint a complete picture of the expressiveness relationships among all possible Sequence Datalog fragments that can be formed using the six features that we consider.
{"title":"Expressiveness within Sequence Datalog","authors":"H. Aamer, J. Hidders, J. Paredaens, J. V. D. Bussche","doi":"10.1145/3452021.3458327","DOIUrl":"https://doi.org/10.1145/3452021.3458327","url":null,"abstract":"Motivated by old and new applications, we investigate Datalog as a language for sequence databases. We reconsider classical features of Datalog programs, such as negation, recursion, intermediate predicates, and relations of higher arities. We also consider new features that are useful for sequences, notably, equations between path expressions, and \"packing''. Our goal is to clarify the relative expressiveness of all these different features, in the context of sequences. Towards our goal, we establish a number of redundancy and primitivity results, showing that certain features can, or cannot, be expressed in terms of other features. These results paint a complete picture of the expressiveness relationships among all possible Sequence Datalog fragments that can be formed using the six features that we consider.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126399335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Pavan, N. V. Vinodchandran, Arnab Bhattacharyya, Kuldeep S. Meel
Constraint satisfaction problems (CSP's) and data stream models are two powerful abstractions to capture a wide variety of problems arising in different domains of computer science. Developments in the two communities have mostly occurred independently and with little interaction between them. In this work, we seek to investigate whether bridging the seeming communication gap between the two communities may pave the way to richer fundamental insights. To this end, we focus on two foundational problems: model counting for CSP's and computation of zeroth frequency moments F0 for data streams. Our investigations lead us to observe striking similarity in the core techniques employed in the algorithmic frameworks that have evolved separately for model counting and F0 computation. We design a recipe for translation of algorithms developed for F0 estimation to that of model counting, resulting in new algorithms for model counting. We then observe that algorithms in the context of distributed streaming can be transformed to distributed algorithms for model counting. We next turn our attention to viewing streaming from the lens of counting and show that framing F0 estimation as a special case of #DNF counting allows us to obtain a general recipe for a rich class of streaming problems, which had been subjected to case-specific analysis in prior works. In particular, our view yields a state-of-the art algorithm for multidimensional range efficient F0 estimation with a simpler analysis.
{"title":"Model Counting meets F0 Estimation","authors":"A. Pavan, N. V. Vinodchandran, Arnab Bhattacharyya, Kuldeep S. Meel","doi":"10.1145/3452021.3458311","DOIUrl":"https://doi.org/10.1145/3452021.3458311","url":null,"abstract":"Constraint satisfaction problems (CSP's) and data stream models are two powerful abstractions to capture a wide variety of problems arising in different domains of computer science. Developments in the two communities have mostly occurred independently and with little interaction between them. In this work, we seek to investigate whether bridging the seeming communication gap between the two communities may pave the way to richer fundamental insights. To this end, we focus on two foundational problems: model counting for CSP's and computation of zeroth frequency moments F0 for data streams. Our investigations lead us to observe striking similarity in the core techniques employed in the algorithmic frameworks that have evolved separately for model counting and F0 computation. We design a recipe for translation of algorithms developed for F0 estimation to that of model counting, resulting in new algorithms for model counting. We then observe that algorithms in the context of distributed streaming can be transformed to distributed algorithms for model counting. We next turn our attention to viewing streaming from the lens of counting and show that framing F0 estimation as a special case of #DNF counting allows us to obtain a general recipe for a rich class of streaming problems, which had been subjected to case-specific analysis in prior works. In particular, our view yields a state-of-the art algorithm for multidimensional range efficient F0 estimation with a simpler analysis.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125526110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider the problem of evaluating regular spanners over compressed documents, i.e., we wish to solve evaluation tasks directly on the compressed data, without decompression. As compressed forms of the documents we use straight-line programs (SLPs) --- a lossless compression scheme for textual data widely used in different areas of theoretical computer science and particularly well-suited for algorithmics on compressed data. In data complexity, our results are as follows. For a regular spanner M and an SLP $mathcalS $ of size $mathbfs $ that represents a document D, we can solve the tasks of model checking and of checking non-emptiness in time $O(mathbfs )$. Computing the set $łlbracket M rrbracket(D)$ of all span-tuples extracted from D can be done in time $Ø(mathbfs |łlbracket M rrbracket(D)|)$, and enumeration of $łlbracket M rrbracket(D)$ can be done with linear preprocessing $O(mathbfs )$ and a delay of $O(depthmathcalS )$, where $depthmathcalS $ is the depth of $mathcalS $'s derivation tree. Note that $mathbfs $ can be exponentially smaller than the document's size $|D|$; and, due to known balancing results for SLPs, we can always assume that $depthmathcalS = O(log(|D|))$ independent of D's compressibility. Hence, our enumeration algorithm has a delay logarithmic in the size of the non-compressed data and a preprocessing time that is at best (i.e., in the case of highly compressible documents) also logarithmic, but at worst still linear. Therefore, in a big-data perspective, our enumeration algorithm for SLP-compressed documents may nevertheless beat the known linear preprocessing and constant delay algorithms for non-compressed documents.
我们考虑在压缩文档上评估常规生成器的问题,即,我们希望直接在压缩数据上解决评估任务,而不需要解压缩。作为文件的压缩形式,我们使用直线程序(slp)——一种文本数据的无损压缩方案,广泛应用于理论计算机科学的不同领域,特别适合压缩数据的算法。在数据复杂度方面,我们的结果如下。对于一个普通的扳手M和一个大小为$mathbfs $的SLP $mathcal $表示文档D,我们可以解决模型检查和在时间$O(mathbfs)$检查非空的任务。计算从D中提取的所有跨度元组的集合$łlbracket M rr括号(D)$可以在时间$Ø(mathbfs |łlbracket M rr括号(D)|)$中完成,而$łlbracket M rr括号(D)$的枚举可以通过线性预处理$O(mathbfs)$和延迟$O(depthmathcalS)$来完成,其中$depthmathcalS $是$mathcalS $的派生树的深度。注意$mathbfs $可以比文档的大小指数小$|D|$;并且,由于已知slp的平衡结果,我们总是可以假设$depthmathcal = O(log(|D|))$独立于D的可压缩性。因此,我们的枚举算法在未压缩数据的大小上具有对数级的延迟,而预处理时间在最好的情况下(即在高度可压缩的文档的情况下)也是对数级的,但在最坏的情况下仍然是线性的。因此,从大数据的角度来看,我们针对slp压缩文档的枚举算法可能优于针对非压缩文档的已知线性预处理和恒定延迟算法。
{"title":"Spanner Evaluation over SLP-Compressed Documents","authors":"Markus L. Schmid, Nicole Schweikardt","doi":"10.1145/3452021.3458325","DOIUrl":"https://doi.org/10.1145/3452021.3458325","url":null,"abstract":"We consider the problem of evaluating regular spanners over compressed documents, i.e., we wish to solve evaluation tasks directly on the compressed data, without decompression. As compressed forms of the documents we use straight-line programs (SLPs) --- a lossless compression scheme for textual data widely used in different areas of theoretical computer science and particularly well-suited for algorithmics on compressed data. In data complexity, our results are as follows. For a regular spanner M and an SLP $mathcalS $ of size $mathbfs $ that represents a document D, we can solve the tasks of model checking and of checking non-emptiness in time $O(mathbfs )$. Computing the set $łlbracket M rrbracket(D)$ of all span-tuples extracted from D can be done in time $Ø(mathbfs |łlbracket M rrbracket(D)|)$, and enumeration of $łlbracket M rrbracket(D)$ can be done with linear preprocessing $O(mathbfs )$ and a delay of $O(depthmathcalS )$, where $depthmathcalS $ is the depth of $mathcalS $'s derivation tree. Note that $mathbfs $ can be exponentially smaller than the document's size $|D|$; and, due to known balancing results for SLPs, we can always assume that $depthmathcalS = O(log(|D|))$ independent of D's compressibility. Hence, our enumeration algorithm has a delay logarithmic in the size of the non-compressed data and a preprocessing time that is at best (i.e., in the case of highly compressible documents) also logarithmic, but at worst still linear. Therefore, in a big-data perspective, our enumeration algorithm for SLP-compressed documents may nevertheless beat the known linear preprocessing and constant delay algorithms for non-compressed documents.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133120276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, B. Kimelfeld, Mirek Riedewald
We study the question of when we can provide logarithmic-time direct access to the k-th answer to a Conjunctive Query (CQ) with a specified ordering over the answers, following a preprocessing step that constructs a data structure in time quasilinear in the size of the database. Specifically, we embark on the challenge of identifying the tractable answer orderings that allow for ranked direct access with such complexity guarantees. We begin with lexicographic orderings and give a decidable characterization (under conventional complexity assumptions) of the class of tractable lexicographic orderings for every CQ without self-joins. We then continue to the more general orderings by the sum of attribute weights and show for it that ranked direct access is tractable only in trivial cases. Hence, to better understand the computational challenge at hand, we consider the more modest task of providing access to only a single answer (i.e., finding the answer at a given position) - a task that we refer to as the selection problem. We indeed achieve a quasilinear-time algorithm for a subset of the class of full CQs without self-joins, by adopting a solution of Frederickson and Johnson to the classic problem of selection over sorted matrices. We further prove that none of the other queries in this class admit such an algorithm.
{"title":"Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries","authors":"Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, B. Kimelfeld, Mirek Riedewald","doi":"10.1145/3452021.3458331","DOIUrl":"https://doi.org/10.1145/3452021.3458331","url":null,"abstract":"We study the question of when we can provide logarithmic-time direct access to the k-th answer to a Conjunctive Query (CQ) with a specified ordering over the answers, following a preprocessing step that constructs a data structure in time quasilinear in the size of the database. Specifically, we embark on the challenge of identifying the tractable answer orderings that allow for ranked direct access with such complexity guarantees. We begin with lexicographic orderings and give a decidable characterization (under conventional complexity assumptions) of the class of tractable lexicographic orderings for every CQ without self-joins. We then continue to the more general orderings by the sum of attribute weights and show for it that ranked direct access is tractable only in trivial cases. Hence, to better understand the computational challenge at hand, we consider the more modest task of providing access to only a single answer (i.e., finding the answer at a given position) - a task that we refer to as the selection problem. We indeed achieve a quasilinear-time algorithm for a subset of the class of full CQs without self-joins, by adopting a solution of Frederickson and Johnson to the classic problem of selection over sorted matrices. We further prove that none of the other queries in this class admit such an algorithm.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"365 11","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134426077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Since the early days of relational databases, it was realized that acyclic hypergraphs give rise to database schemas with desirable structural and algorithmic properties. In a by-now classical paper, Beeri, Fagin, Maier, and Yannakakis established several different equivalent characterizations of acyclicity; in particular, they showed that the sets of attributes of a schema form an acyclic hypergraph if and only if the local-to-global consistency property for relations over that schema holds, which means that every collection of pairwise consistent relations over the schema is globally consistent. Even though real-life databases consist of bags (multisets), there has not been a study of the interplay between local consistency and global consistency for bags. We embark on such a study here and we first show that the sets of attributes of a schema form an acyclic hypergraph if and only if the local-to-global consistency property for bags over that schema holds. After this, we explore algorithmic aspects of global consistency for bags by analyzing the computational complexity of the global consistency problem for bags: given a collection of bags, are these bags globally consistent? We show that this problem is in NP, even when the schema is part of the input. We then establish the following dichotomy theorem for fixed schemas: if the schema is acyclic, then the global consistency problem for bags is solvable in polynomial time, while if the schema is cyclic, then the global consistency problem for bags is NP-complete. The latter result contrasts sharply with the state of affairs for relations, where, for each fixed schema, the global consistency problem for relations is solvable in polynomial time.
{"title":"Structure and Complexity of Bag Consistency","authors":"Albert Atserias, Phokion G. Kolaitis","doi":"10.1145/3452021.3458329","DOIUrl":"https://doi.org/10.1145/3452021.3458329","url":null,"abstract":"Since the early days of relational databases, it was realized that acyclic hypergraphs give rise to database schemas with desirable structural and algorithmic properties. In a by-now classical paper, Beeri, Fagin, Maier, and Yannakakis established several different equivalent characterizations of acyclicity; in particular, they showed that the sets of attributes of a schema form an acyclic hypergraph if and only if the local-to-global consistency property for relations over that schema holds, which means that every collection of pairwise consistent relations over the schema is globally consistent. Even though real-life databases consist of bags (multisets), there has not been a study of the interplay between local consistency and global consistency for bags. We embark on such a study here and we first show that the sets of attributes of a schema form an acyclic hypergraph if and only if the local-to-global consistency property for bags over that schema holds. After this, we explore algorithmic aspects of global consistency for bags by analyzing the computational complexity of the global consistency problem for bags: given a collection of bags, are these bags globally consistent? We show that this problem is in NP, even when the schema is part of the input. We then establish the following dichotomy theorem for fixed schemas: if the schema is acyclic, then the global consistency problem for bags is solvable in polynomial time, while if the schema is cyclic, then the global consistency problem for bags is NP-complete. The latter result contrasts sharply with the state of affairs for relations, where, for each fixed schema, the global consistency problem for relations is solvable in polynomial time.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134050691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Geerts, Thomas Muñoz, Cristian Riveros, D. Vrgoc
Linear algebra algorithms often require some sort of iteration or recursion as is illustrated by standard algorithms for Gaussian elimination, matrix inversion, and transitive closure. A key characteristic shared by these algorithms is that they allow looping for a number of steps that is bounded by the matrix dimension. In this paper we extend the matrix query language MATLANG with this type of recursion, and show that this suffices to express classical linear algebra algorithms. We study the expressive power of this language and show that it naturally corresponds to arithmetic circuit families, which are often said to capture linear algebra. Furthermore, we analyze several sub-fragments of our language, and show that their expressive power is closely tied to logical formalisms on semiring-annotated relations.
{"title":"Expressive Power of Linear Algebra Query Languages","authors":"F. Geerts, Thomas Muñoz, Cristian Riveros, D. Vrgoc","doi":"10.1145/3452021.3458314","DOIUrl":"https://doi.org/10.1145/3452021.3458314","url":null,"abstract":"Linear algebra algorithms often require some sort of iteration or recursion as is illustrated by standard algorithms for Gaussian elimination, matrix inversion, and transitive closure. A key characteristic shared by these algorithms is that they allow looping for a number of steps that is bounded by the matrix dimension. In this paper we extend the matrix query language MATLANG with this type of recursion, and show that this suffices to express classical linear algebra algorithms. We study the expressive power of this language and show that it naturally corresponds to arithmetic circuit families, which are often said to capture linear algebra. Furthermore, we analyze several sub-fragments of our language, and show that their expressive power is closely tied to logical formalisms on semiring-annotated relations.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130514722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most of the prior work in massively parallel data processing assumes homogeneity, i.e., every computing unit has the same computational capability and can communicate with every other unit with the same latency and bandwidth. However, this strong assumption of a uniform topology rarely holds in practical settings, where computing units are connected through complex networks. To address this issue, Blanas et al. citeblanas2020topology recently proposed a topology-aware massively parallel computation model that integrates the network structure and heterogeneity in the modeling cost. The network is modeled as a directed graph, where each edge is associated with a cost function that depends on the data transferred between the two endpoints. The computation proceeds in synchronous rounds and the cost of each round is measured as the maximum cost over all the edges in the network. In this work, we take the first step into investigating three fundamental data processing tasks in this topology-aware parallel model: set intersection, cartesian product, and sorting. We focus on network topologies that are tree topologies, and present both lower bounds as well as (asymptotically) matching upper bounds. Instead of assuming a worst-case distribution as in previous results, the optimality of our algorithms is with respect to the initial data distribution among the network nodes. Apart from the theoretical optimality of our results, our protocols are simple, use a constant number of rounds, and we believe can be implemented in practical settings as well.
以前的大部分大规模并行数据处理工作都假设了同质性,即每个计算单元具有相同的计算能力,并且可以在相同的延迟和带宽下与其他每个计算单元进行通信。然而,这种统一拓扑的假设在实际环境中很少成立,因为计算单元是通过复杂的网络连接起来的。为了解决这个问题,Blanas et al. citeblanas2020topology最近提出了一种拓扑感知的大规模并行计算模型,该模型在建模成本中集成了网络结构和异构性。该网络被建模为一个有向图,其中每条边都与一个成本函数相关联,该函数取决于在两个端点之间传输的数据。计算以同步轮进行,每轮的代价以网络中所有边的最大代价来衡量。在这项工作中,我们首先研究了拓扑感知并行模型中的三个基本数据处理任务:集合交集、笛卡尔积和排序。我们关注的是树形拓扑的网络拓扑,并给出了下界和(渐近)匹配的上界。我们的算法的最优性与网络节点之间的初始数据分布有关,而不是像以前的结果那样假设最坏情况分布。除了我们的结果在理论上是最优的,我们的协议很简单,使用恒定的轮数,我们相信也可以在实际环境中实施。
{"title":"Algorithms for a Topology-aware Massively Parallel Computation Model","authors":"Xiao Hu, Paraschos Koutris, Spyros Blanas","doi":"10.1145/3452021.3458318","DOIUrl":"https://doi.org/10.1145/3452021.3458318","url":null,"abstract":"Most of the prior work in massively parallel data processing assumes homogeneity, i.e., every computing unit has the same computational capability and can communicate with every other unit with the same latency and bandwidth. However, this strong assumption of a uniform topology rarely holds in practical settings, where computing units are connected through complex networks. To address this issue, Blanas et al. citeblanas2020topology recently proposed a topology-aware massively parallel computation model that integrates the network structure and heterogeneity in the modeling cost. The network is modeled as a directed graph, where each edge is associated with a cost function that depends on the data transferred between the two endpoints. The computation proceeds in synchronous rounds and the cost of each round is measured as the maximum cost over all the edges in the network. In this work, we take the first step into investigating three fundamental data processing tasks in this topology-aware parallel model: set intersection, cartesian product, and sorting. We focus on network topologies that are tree topologies, and present both lower bounds as well as (asymptotically) matching upper bounds. Instead of assuming a worst-case distribution as in previous results, the optimality of our algorithms is with respect to the initial data distribution among the network nodes. Apart from the theoretical optimality of our results, our protocols are simple, use a constant number of rounds, and we believe can be implemented in practical settings as well.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133964868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nofar Carmeli, Martin Grohe, P. Lindner, Christoph Standke
Probabilistic databases (PDBs) are probability spaces over database instances. They provide a framework for handling uncertainty in databases, as occurs due to data integration, noisy data, data from unreliable sources or randomized processes. Most of the existing theory literature investigated finite, tuple-independent PDBs (TI-PDBs) where the occurrences of tuples are independent events. Only recently, Grohe and Lindner (PODS '19) introduced independence assumptions for PDBs beyond the finite domain assumption. In the finite, a major argument for discussing the theoretical properties of TI-PDBs is that they can be used to represent any finite PDB via views. This is no longer the case once the number of tuples is countably infinite. In this paper, we systematically study the representability of infinite PDBs in terms of TI-PDBs and the related block-independent disjoint PDBs. The central question is which infinite PDBs are representable as first-order views over tuple-independent PDBs. We give a necessary condition for the representability of PDBs and provide a sufficient criterion for representability in terms of the probability distribution of a PDB. With various examples, we explore the limits of our criteria. We show that conditioning on first order properties yields no additional power in terms of expressivity. Finally, we discuss the relation between purely logical and arithmetic reasons for (non-)representability.
{"title":"Tuple-Independent Representations of Infinite Probabilistic Databases","authors":"Nofar Carmeli, Martin Grohe, P. Lindner, Christoph Standke","doi":"10.1145/3452021.3458315","DOIUrl":"https://doi.org/10.1145/3452021.3458315","url":null,"abstract":"Probabilistic databases (PDBs) are probability spaces over database instances. They provide a framework for handling uncertainty in databases, as occurs due to data integration, noisy data, data from unreliable sources or randomized processes. Most of the existing theory literature investigated finite, tuple-independent PDBs (TI-PDBs) where the occurrences of tuples are independent events. Only recently, Grohe and Lindner (PODS '19) introduced independence assumptions for PDBs beyond the finite domain assumption. In the finite, a major argument for discussing the theoretical properties of TI-PDBs is that they can be used to represent any finite PDB via views. This is no longer the case once the number of tuples is countably infinite. In this paper, we systematically study the representability of infinite PDBs in terms of TI-PDBs and the related block-independent disjoint PDBs. The central question is which infinite PDBs are representable as first-order views over tuple-independent PDBs. We give a necessary condition for the representability of PDBs and provide a sufficient criterion for representability in terms of the probability distribution of a PDB. With various examples, we explore the limits of our criteria. We show that conditioning on first order properties yields no additional power in terms of expressivity. Finally, we discuss the relation between purely logical and arithmetic reasons for (non-)representability.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130236953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study the em generalized model counting problem, defined as follows: given a database, and a set of deterministic tuples, count the number of subsets of the database that include all deterministic tuples and satisfy the query. This problem is computationally equivalent to the evaluation of the query over a tuple-independent probabilistic database where all tuples have probabilities in $set0,frac1 2, 1 $. Previous work has established a dichotomy for Unions of Conjunctive Queries (UCQ) when the probabilities are arbitrary rational numbers, showing that, for each query, its complexity is either in polynomial time or #P-hard. The query is called em safe in the first case, and em unsafe in the second case. Here, we strengthen the hardness proof, by proving that an unsafe UCQ query remains #P-hard even if the probabilities are restricted to $set0,frac1 2, 1 $. This requires a complete redesign of the hardness proof, using new techniques. A related problem is the em model counting problem, which asks for the probability of the query when the input probabilities are restricted to $set0,frac1 2 $. While our result does not extend to model counting for all unsafe UCQs, we prove that model counting is #P-hard for a class of unsafe queries called Type-I forbidden queries.
{"title":"A Dichotomy for the Generalized Model Counting Problem for Unions of Conjunctive Queries","authors":"Batya Kenig, Dan Suciu","doi":"10.1145/3452021.3458313","DOIUrl":"https://doi.org/10.1145/3452021.3458313","url":null,"abstract":"We study the em generalized model counting problem, defined as follows: given a database, and a set of deterministic tuples, count the number of subsets of the database that include all deterministic tuples and satisfy the query. This problem is computationally equivalent to the evaluation of the query over a tuple-independent probabilistic database where all tuples have probabilities in $set0,frac1 2, 1 $. Previous work has established a dichotomy for Unions of Conjunctive Queries (UCQ) when the probabilities are arbitrary rational numbers, showing that, for each query, its complexity is either in polynomial time or #P-hard. The query is called em safe in the first case, and em unsafe in the second case. Here, we strengthen the hardness proof, by proving that an unsafe UCQ query remains #P-hard even if the probabilities are restricted to $set0,frac1 2, 1 $. This requires a complete redesign of the hardness proof, using new techniques. A related problem is the em model counting problem, which asks for the probability of the query when the input probabilities are restricted to $set0,frac1 2 $. While our result does not extend to model counting for all unsafe UCQs, we prove that model counting is #P-hard for a class of unsafe queries called Type-I forbidden queries.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130210269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Numerous fundamental database and reasoning problems are known to be NP-hard in general but tractable on instances where the underlying hypergraph structure is β-acyclic. Despite the importance of many of these problems, there has been little success in generalizing these results beyond acyclicity. In this paper, we take on this challenge and propose nest-set width, a novel generalization of hypergraph β-acyclicity. We demonstrate that nest-set width has desirable properties and algorithmic significance. In particular, evaluation of boolean conjunctive queries with negation is tractable for classes with bounded nest-set width. Furthermore, propositional satisfiability is fixed-parameter tractable when parameterized by nest-set width.
{"title":"Tractability Beyond ß-Acyclicity for Conjunctive Queries with Negation","authors":"Matthias Lanzinger","doi":"10.1145/3452021.3458308","DOIUrl":"https://doi.org/10.1145/3452021.3458308","url":null,"abstract":"Numerous fundamental database and reasoning problems are known to be NP-hard in general but tractable on instances where the underlying hypergraph structure is β-acyclic. Despite the importance of many of these problems, there has been little success in generalizing these results beyond acyclicity. In this paper, we take on this challenge and propose nest-set width, a novel generalization of hypergraph β-acyclicity. We demonstrate that nest-set width has desirable properties and algorithmic significance. In particular, evaluation of boolean conjunctive queries with negation is tractable for classes with bounded nest-set width. Furthermore, propositional satisfiability is fixed-parameter tractable when parameterized by nest-set width.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125403810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}