Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems最新文献_第3页

Expressiveness within Sequence Datalog 序列数据表中的表达性

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458327

H. Aamer, J. Hidders, J. Paredaens, J. V. D. Bussche

Motivated by old and new applications, we investigate Datalog as a language for sequence databases. We reconsider classical features of Datalog programs, such as negation, recursion, intermediate predicates, and relations of higher arities. We also consider new features that are useful for sequences, notably, equations between path expressions, and "packing''. Our goal is to clarify the relative expressiveness of all these different features, in the context of sequences. Towards our goal, we establish a number of redundancy and primitivity results, showing that certain features can, or cannot, be expressed in terms of other features. These results paint a complete picture of the expressiveness relationships among all possible Sequence Datalog fragments that can be formed using the six features that we consider.

受新旧应用的启发，我们研究了Datalog作为序列数据库的语言。我们重新考虑了Datalog程序的经典特征，如否定、递归、中间谓词和更高的关系。我们还考虑了对序列有用的新特征，特别是路径表达式之间的方程和“打包”。我们的目标是在序列的背景下阐明所有这些不同特征的相对表达性。为了实现我们的目标，我们建立了许多冗余和原始结果，表明某些特征可以或不可以用其他特征来表示。这些结果描绘了使用我们考虑的六个特征可以形成的所有可能的序列Datalog片段之间的表达关系的完整画面。

引用次数: 1

Model Counting meets F0 Estimation 模型计数满足F0估计

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-05-03 DOI: 10.1145/3452021.3458311

A. Pavan, N. V. Vinodchandran, Arnab Bhattacharyya, Kuldeep S. Meel

Constraint satisfaction problems (CSP's) and data stream models are two powerful abstractions to capture a wide variety of problems arising in different domains of computer science. Developments in the two communities have mostly occurred independently and with little interaction between them. In this work, we seek to investigate whether bridging the seeming communication gap between the two communities may pave the way to richer fundamental insights. To this end, we focus on two foundational problems: model counting for CSP's and computation of zeroth frequency moments F0 for data streams. Our investigations lead us to observe striking similarity in the core techniques employed in the algorithmic frameworks that have evolved separately for model counting and F0 computation. We design a recipe for translation of algorithms developed for F0 estimation to that of model counting, resulting in new algorithms for model counting. We then observe that algorithms in the context of distributed streaming can be transformed to distributed algorithms for model counting. We next turn our attention to viewing streaming from the lens of counting and show that framing F0 estimation as a special case of #DNF counting allows us to obtain a general recipe for a rich class of streaming problems, which had been subjected to case-specific analysis in prior works. In particular, our view yields a state-of-the art algorithm for multidimensional range efficient F0 estimation with a simpler analysis.

约束满足问题(CSP)和数据流模型是捕获计算机科学不同领域中出现的各种问题的两个强大的抽象。两个社区的发展大多是独立发生的，它们之间很少相互作用。在这项工作中，我们试图调查弥合两个社区之间表面上的沟通差距是否可以为更丰富的基本见解铺平道路。为此，我们关注两个基本问题:CSP的模型计数和数据流的第零频率矩F0的计算。我们的调查使我们观察到在算法框架中采用的核心技术惊人的相似性，这些算法框架分别为模型计数和F0计算而发展。我们设计了一种将用于F0估计的算法转换为模型计数的算法的配方，从而产生了用于模型计数的新算法。然后我们观察到分布式流上下文中的算法可以转换为用于模型计数的分布式算法。接下来，我们将注意力转向从计数的角度来观察流，并表明将F0估计作为#DNF计数的特殊情况，使我们能够获得一类丰富的流问题的通用配方，这些问题在以前的工作中已经受到具体情况的分析。特别是，我们的观点产生了一种最先进的算法，可以通过更简单的分析进行多维范围有效的F0估计。

{"title":"Model Counting meets F0 Estimation","authors":"A. Pavan, N. V. Vinodchandran, Arnab Bhattacharyya, Kuldeep S. Meel","doi":"10.1145/3452021.3458311","DOIUrl":"https://doi.org/10.1145/3452021.3458311","url":null,"abstract":"Constraint satisfaction problems (CSP's) and data stream models are two powerful abstractions to capture a wide variety of problems arising in different domains of computer science. Developments in the two communities have mostly occurred independently and with little interaction between them. In this work, we seek to investigate whether bridging the seeming communication gap between the two communities may pave the way to richer fundamental insights. To this end, we focus on two foundational problems: model counting for CSP's and computation of zeroth frequency moments F0 for data streams. Our investigations lead us to observe striking similarity in the core techniques employed in the algorithmic frameworks that have evolved separately for model counting and F0 computation. We design a recipe for translation of algorithms developed for F0 estimation to that of model counting, resulting in new algorithms for model counting. We then observe that algorithms in the context of distributed streaming can be transformed to distributed algorithms for model counting. We next turn our attention to viewing streaming from the lens of counting and show that framing F0 estimation as a special case of #DNF counting allows us to obtain a general recipe for a rich class of streaming problems, which had been subjected to case-specific analysis in prior works. In particular, our view yields a state-of-the art algorithm for multidimensional range efficient F0 estimation with a simpler analysis.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125526110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Spanner Evaluation over SLP-Compressed Documents 对slp压缩文档的扳手评估

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-01-25 DOI: 10.1145/3452021.3458325

Markus L. Schmid, Nicole Schweikardt

We consider the problem of evaluating regular spanners over compressed documents, i.e., we wish to solve evaluation tasks directly on the compressed data, without decompression. As compressed forms of the documents we use straight-line programs (SLPs) --- a lossless compression scheme for textual data widely used in different areas of theoretical computer science and particularly well-suited for algorithmics on compressed data. In data complexity, our results are as follows. For a regular spanner M and an SLP $mathcalS $ of size $mathbfs $ that represents a document D, we can solve the tasks of model checking and of checking non-emptiness in time $O(mathbfs )$. Computing the set $łlbracket M rrbracket(D)$ of all span-tuples extracted from D can be done in time $Ø(mathbfs |łlbracket M rrbracket(D)|)$, and enumeration of $łlbracket M rrbracket(D)$ can be done with linear preprocessing $O(mathbfs )$ and a delay of $O(depthmathcalS )$, where $depthmathcalS $ is the depth of $mathcalS $'s derivation tree. Note that $mathbfs $ can be exponentially smaller than the document's size $|D|$; and, due to known balancing results for SLPs, we can always assume that $depthmathcalS = O(log(|D|))$ independent of D's compressibility. Hence, our enumeration algorithm has a delay logarithmic in the size of the non-compressed data and a preprocessing time that is at best (i.e., in the case of highly compressible documents) also logarithmic, but at worst still linear. Therefore, in a big-data perspective, our enumeration algorithm for SLP-compressed documents may nevertheless beat the known linear preprocessing and constant delay algorithms for non-compressed documents.

我们考虑在压缩文档上评估常规生成器的问题，即，我们希望直接在压缩数据上解决评估任务，而不需要解压缩。作为文件的压缩形式，我们使用直线程序(slp)——一种文本数据的无损压缩方案，广泛应用于理论计算机科学的不同领域，特别适合压缩数据的算法。在数据复杂度方面，我们的结果如下。对于一个普通的扳手M和一个大小为$mathbfs $的SLP $mathcal $表示文档D，我们可以解决模型检查和在时间$O(mathbfs)$检查非空的任务。计算从D中提取的所有跨度元组的集合$łlbracket M rr括号(D)$可以在时间$Ø(mathbfs |łlbracket M rr括号(D)|)$中完成，而$łlbracket M rr括号(D)$的枚举可以通过线性预处理$O(mathbfs)$和延迟$O(depthmathcalS)$来完成，其中$depthmathcalS $是$mathcalS $的派生树的深度。注意$mathbfs $可以比文档的大小指数小$|D|$;并且，由于已知slp的平衡结果，我们总是可以假设$depthmathcal = O(log(|D|))$独立于D的可压缩性。因此，我们的枚举算法在未压缩数据的大小上具有对数级的延迟，而预处理时间在最好的情况下(即在高度可压缩的文档的情况下)也是对数级的，但在最坏的情况下仍然是线性的。因此，从大数据的角度来看，我们针对slp压缩文档的枚举算法可能优于针对非压缩文档的已知线性预处理和恒定延迟算法。

{"title":"Spanner Evaluation over SLP-Compressed Documents","authors":"Markus L. Schmid, Nicole Schweikardt","doi":"10.1145/3452021.3458325","DOIUrl":"https://doi.org/10.1145/3452021.3458325","url":null,"abstract":"We consider the problem of evaluating regular spanners over compressed documents, i.e., we wish to solve evaluation tasks directly on the compressed data, without decompression. As compressed forms of the documents we use straight-line programs (SLPs) --- a lossless compression scheme for textual data widely used in different areas of theoretical computer science and particularly well-suited for algorithmics on compressed data. In data complexity, our results are as follows. For a regular spanner M and an SLP $mathcalS $ of size $mathbfs $ that represents a document D, we can solve the tasks of model checking and of checking non-emptiness in time $O(mathbfs )$. Computing the set $łlbracket M rrbracket(D)$ of all span-tuples extracted from D can be done in time $Ø(mathbfs |łlbracket M rrbracket(D)|)$, and enumeration of $łlbracket M rrbracket(D)$ can be done with linear preprocessing $O(mathbfs )$ and a delay of $O(depthmathcalS )$, where $depthmathcalS $ is the depth of $mathcalS $'s derivation tree. Note that $mathbfs $ can be exponentially smaller than the document's size $|D|$; and, due to known balancing results for SLPs, we can always assume that $depthmathcalS = O(log(|D|))$ independent of D's compressibility. Hence, our enumeration algorithm has a delay logarithmic in the size of the non-compressed data and a preprocessing time that is at best (i.e., in the case of highly compressible documents) also logarithmic, but at worst still linear. Therefore, in a big-data perspective, our enumeration algorithm for SLP-compressed documents may nevertheless beat the known linear preprocessing and constant delay algorithms for non-compressed documents.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133120276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries 直接访问连接查询排序答案的可处理顺序

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2020-12-22 DOI: 10.1145/3452021.3458331

Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, B. Kimelfeld, Mirek Riedewald

We study the question of when we can provide logarithmic-time direct access to the k-th answer to a Conjunctive Query (CQ) with a specified ordering over the answers, following a preprocessing step that constructs a data structure in time quasilinear in the size of the database. Specifically, we embark on the challenge of identifying the tractable answer orderings that allow for ranked direct access with such complexity guarantees. We begin with lexicographic orderings and give a decidable characterization (under conventional complexity assumptions) of the class of tractable lexicographic orderings for every CQ without self-joins. We then continue to the more general orderings by the sum of attribute weights and show for it that ranked direct access is tractable only in trivial cases. Hence, to better understand the computational challenge at hand, we consider the more modest task of providing access to only a single answer (i.e., finding the answer at a given position) - a task that we refer to as the selection problem. We indeed achieve a quasilinear-time algorithm for a subset of the class of full CQs without self-joins, by adopting a solution of Frederickson and Johnson to the classic problem of selection over sorted matrices. We further prove that none of the other queries in this class admit such an algorithm.

我们研究的问题是，什么时候我们可以提供对数时间直接访问连接查询(CQ)的第k个答案，并对答案进行指定的排序，在预处理步骤之后，在数据库的大小上构建一个时间拟线性的数据结构。具体地说，我们着手于识别可处理的答案排序的挑战，该排序允许具有这种复杂性保证的排序直接访问。我们从字典排序开始，给出了一个可处理的字典排序类的可判定特征(在传统的复杂性假设下)，对于每个没有自连接的CQ。然后，我们通过属性权重的总和继续讨论更一般的排序，并证明排序直接访问仅在平凡的情况下是可处理的。因此，为了更好地理解手头的计算挑战，我们考虑更温和的任务，即只提供对单个答案的访问(即，在给定位置找到答案)-我们称之为选择问题的任务。通过采用Frederickson和Johnson对排序矩阵的经典选择问题的一个解，我们确实实现了一类没有自连接的完整cq类子集的拟线性时间算法。我们进一步证明，该类中的其他查询都不允许使用这种算法。

{"title":"Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries","authors":"Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, B. Kimelfeld, Mirek Riedewald","doi":"10.1145/3452021.3458331","DOIUrl":"https://doi.org/10.1145/3452021.3458331","url":null,"abstract":"We study the question of when we can provide logarithmic-time direct access to the k-th answer to a Conjunctive Query (CQ) with a specified ordering over the answers, following a preprocessing step that constructs a data structure in time quasilinear in the size of the database. Specifically, we embark on the challenge of identifying the tractable answer orderings that allow for ranked direct access with such complexity guarantees. We begin with lexicographic orderings and give a decidable characterization (under conventional complexity assumptions) of the class of tractable lexicographic orderings for every CQ without self-joins. We then continue to the more general orderings by the sum of attribute weights and show for it that ranked direct access is tractable only in trivial cases. Hence, to better understand the computational challenge at hand, we consider the more modest task of providing access to only a single answer (i.e., finding the answer at a given position) - a task that we refer to as the selection problem. We indeed achieve a quasilinear-time algorithm for a subset of the class of full CQs without self-joins, by adopting a solution of Frederickson and Johnson to the classic problem of selection over sorted matrices. We further prove that none of the other queries in this class admit such an algorithm.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"365 11","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134426077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Structure and Complexity of Bag Consistency 袋一致性的结构和复杂性

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2020-12-22 DOI: 10.1145/3452021.3458329

Albert Atserias, Phokion G. Kolaitis

Since the early days of relational databases, it was realized that acyclic hypergraphs give rise to database schemas with desirable structural and algorithmic properties. In a by-now classical paper, Beeri, Fagin, Maier, and Yannakakis established several different equivalent characterizations of acyclicity; in particular, they showed that the sets of attributes of a schema form an acyclic hypergraph if and only if the local-to-global consistency property for relations over that schema holds, which means that every collection of pairwise consistent relations over the schema is globally consistent. Even though real-life databases consist of bags (multisets), there has not been a study of the interplay between local consistency and global consistency for bags. We embark on such a study here and we first show that the sets of attributes of a schema form an acyclic hypergraph if and only if the local-to-global consistency property for bags over that schema holds. After this, we explore algorithmic aspects of global consistency for bags by analyzing the computational complexity of the global consistency problem for bags: given a collection of bags, are these bags globally consistent? We show that this problem is in NP, even when the schema is part of the input. We then establish the following dichotomy theorem for fixed schemas: if the schema is acyclic, then the global consistency problem for bags is solvable in polynomial time, while if the schema is cyclic, then the global consistency problem for bags is NP-complete. The latter result contrasts sharply with the state of affairs for relations, where, for each fixed schema, the global consistency problem for relations is solvable in polynomial time.

从关系数据库的早期开始，人们就意识到，无循环超图会产生具有理想结构和算法特性的数据库模式。在一篇经典的论文中，Beeri、Fagin、Maier和Yannakakis建立了几种不同的非环性等效表征;特别地，他们证明了模式的属性集形成一个非循环超图当且仅当该模式上的关系的局部到全局一致性属性成立，这意味着该模式上的成对一致关系的每个集合都是全局一致的。尽管现实生活中的数据库由袋子(多集)组成，但还没有对袋子的局部一致性和全局一致性之间的相互作用进行研究。我们在这里开始了这样的研究，我们首先证明了一个模式的属性集形成一个非循环超图当且仅当该模式上的包的局部到全局一致性属性成立。在此之后，我们通过分析袋子的全局一致性问题的计算复杂性来探索袋子的全局一致性的算法方面:给定一组袋子，这些袋子是否全局一致?我们证明了这个问题是NP的，即使模式是输入的一部分。然后，我们建立了固定模式的二分定理:如果模式是无循环的，那么袋的全局一致性问题在多项式时间内可解;如果模式是循环的，那么袋的全局一致性问题是np完全的。后一种结果与关系的状态形成鲜明对比，其中，对于每个固定模式，关系的全局一致性问题在多项式时间内可解。

{"title":"Structure and Complexity of Bag Consistency","authors":"Albert Atserias, Phokion G. Kolaitis","doi":"10.1145/3452021.3458329","DOIUrl":"https://doi.org/10.1145/3452021.3458329","url":null,"abstract":"Since the early days of relational databases, it was realized that acyclic hypergraphs give rise to database schemas with desirable structural and algorithmic properties. In a by-now classical paper, Beeri, Fagin, Maier, and Yannakakis established several different equivalent characterizations of acyclicity; in particular, they showed that the sets of attributes of a schema form an acyclic hypergraph if and only if the local-to-global consistency property for relations over that schema holds, which means that every collection of pairwise consistent relations over the schema is globally consistent. Even though real-life databases consist of bags (multisets), there has not been a study of the interplay between local consistency and global consistency for bags. We embark on such a study here and we first show that the sets of attributes of a schema form an acyclic hypergraph if and only if the local-to-global consistency property for bags over that schema holds. After this, we explore algorithmic aspects of global consistency for bags by analyzing the computational complexity of the global consistency problem for bags: given a collection of bags, are these bags globally consistent? We show that this problem is in NP, even when the schema is part of the input. We then establish the following dichotomy theorem for fixed schemas: if the schema is acyclic, then the global consistency problem for bags is solvable in polynomial time, while if the schema is cyclic, then the global consistency problem for bags is NP-complete. The latter result contrasts sharply with the state of affairs for relations, where, for each fixed schema, the global consistency problem for relations is solvable in polynomial time.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134050691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Expressive Power of Linear Algebra Query Languages 线性代数查询语言的表达能力

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2020-10-26 DOI: 10.1145/3452021.3458314

F. Geerts, Thomas Muñoz, Cristian Riveros, D. Vrgoc

Linear algebra algorithms often require some sort of iteration or recursion as is illustrated by standard algorithms for Gaussian elimination, matrix inversion, and transitive closure. A key characteristic shared by these algorithms is that they allow looping for a number of steps that is bounded by the matrix dimension. In this paper we extend the matrix query language MATLANG with this type of recursion, and show that this suffices to express classical linear algebra algorithms. We study the expressive power of this language and show that it naturally corresponds to arithmetic circuit families, which are often said to capture linear algebra. Furthermore, we analyze several sub-fragments of our language, and show that their expressive power is closely tied to logical formalisms on semiring-annotated relations.

线性代数算法通常需要某种形式的迭代或递归，如高斯消去、矩阵反转和传递闭包的标准算法所示。这些算法共有的一个关键特征是，它们允许对受矩阵维度限制的许多步骤进行循环。本文将矩阵查询语言MATLANG扩展为这种类型的递归，并证明它足以表示经典的线性代数算法。我们研究了这种语言的表达能力，并表明它自然地对应于算术电路族，这通常被认为是捕捉线性代数。此外，我们分析了语言的几个子片段，并表明它们的表达能力与半环注释关系上的逻辑形式密切相关。

引用次数: 7

Algorithms for a Topology-aware Massively Parallel Computation Model 拓扑感知的大规模并行计算模型算法

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2020-09-24 DOI: 10.1145/3452021.3458318

Xiao Hu, Paraschos Koutris, Spyros Blanas

Most of the prior work in massively parallel data processing assumes homogeneity, i.e., every computing unit has the same computational capability and can communicate with every other unit with the same latency and bandwidth. However, this strong assumption of a uniform topology rarely holds in practical settings, where computing units are connected through complex networks. To address this issue, Blanas et al. citeblanas2020topology recently proposed a topology-aware massively parallel computation model that integrates the network structure and heterogeneity in the modeling cost. The network is modeled as a directed graph, where each edge is associated with a cost function that depends on the data transferred between the two endpoints. The computation proceeds in synchronous rounds and the cost of each round is measured as the maximum cost over all the edges in the network. In this work, we take the first step into investigating three fundamental data processing tasks in this topology-aware parallel model: set intersection, cartesian product, and sorting. We focus on network topologies that are tree topologies, and present both lower bounds as well as (asymptotically) matching upper bounds. Instead of assuming a worst-case distribution as in previous results, the optimality of our algorithms is with respect to the initial data distribution among the network nodes. Apart from the theoretical optimality of our results, our protocols are simple, use a constant number of rounds, and we believe can be implemented in practical settings as well.

以前的大部分大规模并行数据处理工作都假设了同质性，即每个计算单元具有相同的计算能力，并且可以在相同的延迟和带宽下与其他每个计算单元进行通信。然而，这种统一拓扑的假设在实际环境中很少成立，因为计算单元是通过复杂的网络连接起来的。为了解决这个问题，Blanas et al. citeblanas2020topology最近提出了一种拓扑感知的大规模并行计算模型，该模型在建模成本中集成了网络结构和异构性。该网络被建模为一个有向图，其中每条边都与一个成本函数相关联，该函数取决于在两个端点之间传输的数据。计算以同步轮进行，每轮的代价以网络中所有边的最大代价来衡量。在这项工作中，我们首先研究了拓扑感知并行模型中的三个基本数据处理任务:集合交集、笛卡尔积和排序。我们关注的是树形拓扑的网络拓扑，并给出了下界和(渐近)匹配的上界。我们的算法的最优性与网络节点之间的初始数据分布有关，而不是像以前的结果那样假设最坏情况分布。除了我们的结果在理论上是最优的，我们的协议很简单，使用恒定的轮数，我们相信也可以在实际环境中实施。

{"title":"Algorithms for a Topology-aware Massively Parallel Computation Model","authors":"Xiao Hu, Paraschos Koutris, Spyros Blanas","doi":"10.1145/3452021.3458318","DOIUrl":"https://doi.org/10.1145/3452021.3458318","url":null,"abstract":"Most of the prior work in massively parallel data processing assumes homogeneity, i.e., every computing unit has the same computational capability and can communicate with every other unit with the same latency and bandwidth. However, this strong assumption of a uniform topology rarely holds in practical settings, where computing units are connected through complex networks. To address this issue, Blanas et al. citeblanas2020topology recently proposed a topology-aware massively parallel computation model that integrates the network structure and heterogeneity in the modeling cost. The network is modeled as a directed graph, where each edge is associated with a cost function that depends on the data transferred between the two endpoints. The computation proceeds in synchronous rounds and the cost of each round is measured as the maximum cost over all the edges in the network. In this work, we take the first step into investigating three fundamental data processing tasks in this topology-aware parallel model: set intersection, cartesian product, and sorting. We focus on network topologies that are tree topologies, and present both lower bounds as well as (asymptotically) matching upper bounds. Instead of assuming a worst-case distribution as in previous results, the optimality of our algorithms is with respect to the initial data distribution among the network nodes. Apart from the theoretical optimality of our results, our protocols are simple, use a constant number of rounds, and we believe can be implemented in practical settings as well.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133964868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Tuple-Independent Representations of Infinite Probabilistic Databases 无限概率数据库的元独立表示

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2020-08-21 DOI: 10.1145/3452021.3458315

Nofar Carmeli, Martin Grohe, P. Lindner, Christoph Standke

Probabilistic databases (PDBs) are probability spaces over database instances. They provide a framework for handling uncertainty in databases, as occurs due to data integration, noisy data, data from unreliable sources or randomized processes. Most of the existing theory literature investigated finite, tuple-independent PDBs (TI-PDBs) where the occurrences of tuples are independent events. Only recently, Grohe and Lindner (PODS '19) introduced independence assumptions for PDBs beyond the finite domain assumption. In the finite, a major argument for discussing the theoretical properties of TI-PDBs is that they can be used to represent any finite PDB via views. This is no longer the case once the number of tuples is countably infinite. In this paper, we systematically study the representability of infinite PDBs in terms of TI-PDBs and the related block-independent disjoint PDBs. The central question is which infinite PDBs are representable as first-order views over tuple-independent PDBs. We give a necessary condition for the representability of PDBs and provide a sufficient criterion for representability in terms of the probability distribution of a PDB. With various examples, we explore the limits of our criteria. We show that conditioning on first order properties yields no additional power in terms of expressivity. Finally, we discuss the relation between purely logical and arithmetic reasons for (non-)representability.

概率数据库(pdb)是数据库实例上的概率空间。它们为处理数据库中的不确定性提供了一个框架，如由于数据集成、噪声数据、来自不可靠来源的数据或随机过程而发生的不确定性。大多数现有的理论文献研究有限的，元组独立的PDBs (TI-PDBs)，其中元组的出现是独立的事件。直到最近，Grohe和Lindner (PODS’19)才在有限域假设之外引入了PDBs的独立性假设。在有限情况下，讨论ti -PDB理论性质的一个主要论点是，它们可以通过视图来表示任何有限的PDB。一旦元组的数量是可数无限的，情况就不再是这样了。本文用TI-PDBs和相关的块无关不相交PDBs系统地研究了无限PDBs的可表示性。核心问题是哪些无限的pdb可以表示为元独立pdb上的一阶视图。给出了PDB可表示性的必要条件，并从概率分布的角度给出了PDB可表示性的充分判据。通过不同的例子，我们探索我们的标准的局限性。我们证明了在一阶性质上的条件作用在表达性方面不会产生额外的能力。最后，我们讨论了(非)可表征性的纯逻辑和算术原因之间的关系。

{"title":"Tuple-Independent Representations of Infinite Probabilistic Databases","authors":"Nofar Carmeli, Martin Grohe, P. Lindner, Christoph Standke","doi":"10.1145/3452021.3458315","DOIUrl":"https://doi.org/10.1145/3452021.3458315","url":null,"abstract":"Probabilistic databases (PDBs) are probability spaces over database instances. They provide a framework for handling uncertainty in databases, as occurs due to data integration, noisy data, data from unreliable sources or randomized processes. Most of the existing theory literature investigated finite, tuple-independent PDBs (TI-PDBs) where the occurrences of tuples are independent events. Only recently, Grohe and Lindner (PODS '19) introduced independence assumptions for PDBs beyond the finite domain assumption. In the finite, a major argument for discussing the theoretical properties of TI-PDBs is that they can be used to represent any finite PDB via views. This is no longer the case once the number of tuples is countably infinite. In this paper, we systematically study the representability of infinite PDBs in terms of TI-PDBs and the related block-independent disjoint PDBs. The central question is which infinite PDBs are representable as first-order views over tuple-independent PDBs. We give a necessary condition for the representability of PDBs and provide a sufficient criterion for representability in terms of the probability distribution of a PDB. With various examples, we explore the limits of our criteria. We show that conditioning on first order properties yields no additional power in terms of expressivity. Finally, we discuss the relation between purely logical and arithmetic reasons for (non-)representability.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130236953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Dichotomy for the Generalized Model Counting Problem for Unions of Conjunctive Queries 合取查询并集的广义模型计数问题的二分法

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2020-08-03 DOI: 10.1145/3452021.3458313

Batya Kenig, Dan Suciu

We study the em generalized model counting problem, defined as follows: given a database, and a set of deterministic tuples, count the number of subsets of the database that include all deterministic tuples and satisfy the query. This problem is computationally equivalent to the evaluation of the query over a tuple-independent probabilistic database where all tuples have probabilities in $set0,frac1 2, 1 $. Previous work has established a dichotomy for Unions of Conjunctive Queries (UCQ) when the probabilities are arbitrary rational numbers, showing that, for each query, its complexity is either in polynomial time or #P-hard. The query is called em safe in the first case, and em unsafe in the second case. Here, we strengthen the hardness proof, by proving that an unsafe UCQ query remains #P-hard even if the probabilities are restricted to $set0,frac1 2, 1 $. This requires a complete redesign of the hardness proof, using new techniques. A related problem is the em model counting problem, which asks for the probability of the query when the input probabilities are restricted to $set0,frac1 2 $. While our result does not extend to model counting for all unsafe UCQs, we prove that model counting is #P-hard for a class of unsafe queries called Type-I forbidden queries.

我们研究了em广义模型计数问题，定义如下:给定一个数据库和一组确定性元组，计算数据库中包含所有确定性元组并满足查询的子集的个数。这个问题在计算上等同于对元组独立概率数据库的查询进行评估，其中所有元组的概率都在$set0，frac1 2,1 $中。先前的工作已经为概率为任意有理数的联合查询(UCQ)建立了一个二分法，表明对于每个查询，其复杂性要么是多项式时间，要么是#P-hard。在第一种情况下，查询称为em安全，而在第二种情况下，查询称为em不安全。在这里，我们通过证明即使概率被限制为$set0，frac1 2,1 $，不安全的UCQ查询仍然是#P-hard来加强硬度证明。这需要使用新技术对硬度证明进行彻底的重新设计。一个相关的问题是em模型计数问题，该问题要求在输入概率被限制为$set0，frac1 2 $时查询的概率。虽然我们的结果没有扩展到所有不安全ucq的模型计数，但我们证明了模型计数对于一类称为Type-I禁止查询的不安全查询是#P-hard的。

{"title":"A Dichotomy for the Generalized Model Counting Problem for Unions of Conjunctive Queries","authors":"Batya Kenig, Dan Suciu","doi":"10.1145/3452021.3458313","DOIUrl":"https://doi.org/10.1145/3452021.3458313","url":null,"abstract":"We study the em generalized model counting problem, defined as follows: given a database, and a set of deterministic tuples, count the number of subsets of the database that include all deterministic tuples and satisfy the query. This problem is computationally equivalent to the evaluation of the query over a tuple-independent probabilistic database where all tuples have probabilities in $set0,frac1 2, 1 $. Previous work has established a dichotomy for Unions of Conjunctive Queries (UCQ) when the probabilities are arbitrary rational numbers, showing that, for each query, its complexity is either in polynomial time or #P-hard. The query is called em safe in the first case, and em unsafe in the second case. Here, we strengthen the hardness proof, by proving that an unsafe UCQ query remains #P-hard even if the probabilities are restricted to $set0,frac1 2, 1 $. This requires a complete redesign of the hardness proof, using new techniques. A related problem is the em model counting problem, which asks for the probability of the query when the input probabilities are restricted to $set0,frac1 2 $. While our result does not extend to model counting for all unsafe UCQs, we prove that model counting is #P-hard for a class of unsafe queries called Type-I forbidden queries.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130210269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Tractability Beyond ß-Acyclicity for Conjunctive Queries with Negation 带否定的连接查询的可追溯性超越ß-不循环性

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2020-07-17 DOI: 10.1145/3452021.3458308

Matthias Lanzinger

Numerous fundamental database and reasoning problems are known to be NP-hard in general but tractable on instances where the underlying hypergraph structure is β-acyclic. Despite the importance of many of these problems, there has been little success in generalizing these results beyond acyclicity. In this paper, we take on this challenge and propose nest-set width, a novel generalization of hypergraph β-acyclicity. We demonstrate that nest-set width has desirable properties and algorithmic significance. In particular, evaluation of boolean conjunctive queries with negation is tractable for classes with bounded nest-set width. Furthermore, propositional satisfiability is fixed-parameter tractable when parameterized by nest-set width.

许多基本的数据库和推理问题通常是np困难的，但在底层超图结构是β-无环的情况下是可处理的。尽管这些问题中的许多都很重要，但在推广这些非周期性以外的结果方面几乎没有成功。在本文中，我们接受了这一挑战，并提出了巢集宽度，这是超图β-不环性的一种新的推广。我们证明了巢集宽度具有理想的性质和算法意义。特别是，对于具有有界巢集宽度的类，具有否定的布尔联合查询的求值是可处理的。此外，当用巢集宽度参数化时，命题可满足性是固定参数可处理的。

引用次数: 4