In probabilistic databases the data is uncertain and is modeled by a probability distribution. The central problem in probabilistic databases is query evaluation, which requires performing not only traditional data processing such as joins, projections, unions, but also probabilistic inference in order to compute the probability of each item in the answer. At their core, probabilistic databases are a proposal to integrate logic with probability theory. This paper accompanies a talk given as part of the Gems of PODS series, and describes several results in probabilistic databases, explaining their significance in the broader context of model counting, probabilistic inference, and Statistical Relational Models.
{"title":"Probabilistic Databases for All","authors":"Dan Suciu","doi":"10.1145/3375395.3389129","DOIUrl":"https://doi.org/10.1145/3375395.3389129","url":null,"abstract":"In probabilistic databases the data is uncertain and is modeled by a probability distribution. The central problem in probabilistic databases is query evaluation, which requires performing not only traditional data processing such as joins, projections, unions, but also probabilistic inference in order to compute the probability of each item in the answer. At their core, probabilistic databases are a proposal to integrate logic with probability theory. This paper accompanies a talk given as part of the Gems of PODS series, and describes several results in probabilistic databases, explaining their significance in the broader context of model counting, probabilistic inference, and Statistical Relational Models.","PeriodicalId":412441,"journal":{"name":"Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129658953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Let P be a set of n (non-negatively) weighted points in Rd. We consider the problem of computing a subset of (at most) k diverse and high-valued points of P that lie inside a query range, a problem relevant to many areas such as search engines, recommendation systems, and online stores. The diversity and value of a set of points are measured as functions (say average or minimum) of their pairwise distances and weights, respectively. We study both bicriteria and constrained optimization problems. In the former, we wish to return a set of k points that maximize a weighted sum of their value and diversity measures, and in the latter, we wish to return a set of at most k points that maximize their value and satisfy a diversity constraint. We obtain three main types of results in this paper: Near-linear time (0.5-ε)-approximation algorithms for the bicriteria optimization problem in the offline setting. Near-linear size indexes for the bicriteria optimization problem that for a query rectangle return a (0.5-ε)-approximate solution in time O(k polylog(n)). The indexes can be constructed in O(n polylog(n)) time. Near-linear size indexes for answering constrained optimization range queries. For a query rectangle, a 0.5O(d)-approximate solution can be computed in O(k polylog(n)) time. If we allow some of the returned points to lie at most ε outside of the query rectangle then an (1-ε)-approximate solution can be computed in O(k polylog(n)) time. The indexes are constructed in O(n polylog(n)) and nO(1/εd) time, respectively.
{"title":"Efficient Indexes for Diverse Top-k Range Queries","authors":"P. Agarwal, Stavros Sintos, Alex Steiger","doi":"10.1145/3375395.3387667","DOIUrl":"https://doi.org/10.1145/3375395.3387667","url":null,"abstract":"Let P be a set of n (non-negatively) weighted points in Rd. We consider the problem of computing a subset of (at most) k diverse and high-valued points of P that lie inside a query range, a problem relevant to many areas such as search engines, recommendation systems, and online stores. The diversity and value of a set of points are measured as functions (say average or minimum) of their pairwise distances and weights, respectively. We study both bicriteria and constrained optimization problems. In the former, we wish to return a set of k points that maximize a weighted sum of their value and diversity measures, and in the latter, we wish to return a set of at most k points that maximize their value and satisfy a diversity constraint. We obtain three main types of results in this paper: Near-linear time (0.5-ε)-approximation algorithms for the bicriteria optimization problem in the offline setting. Near-linear size indexes for the bicriteria optimization problem that for a query rectangle return a (0.5-ε)-approximate solution in time O(k polylog(n)). The indexes can be constructed in O(n polylog(n)) time. Near-linear size indexes for answering constrained optimization range queries. For a query rectangle, a 0.5O(d)-approximate solution can be computed in O(k polylog(n)) time. If we allow some of the returned points to lie at most ε outside of the query rectangle then an (1-ε)-approximate solution can be computed in O(k polylog(n)) time. The indexes are constructed in O(n polylog(n)) and nO(1/εd) time, respectively.","PeriodicalId":412441,"journal":{"name":"Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"95 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134359746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","authors":"","doi":"10.1145/3375395","DOIUrl":"https://doi.org/10.1145/3375395","url":null,"abstract":"","PeriodicalId":412441,"journal":{"name":"Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132506214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study consistent query answering with respect to key dependencies. Given a (possibly inconsistent) database instance and a set of key dependencies, a repair is an inclusion-maximal subinstance that satisfies all key dependencies. Consistent query answering for a Boolean query is the following problem: given a database instance as input, is the query true in every repair? In [Koutris and Wijsen, ICDT 2019], it was shown that for every self-join-free Boolean conjunctive query and set of key dependencies containing exactly one key dependency per relation name (also called the primary key), this problem is in FO, L-complete, or coNP-complete, and it is decidable which of the three cases applies. In this paper, we consider the more general case where a relation name can be associated with more than one key dependency. It is shown that in this more general setting, it remains decidable whether or not the above problem is in FO, for self-join-free Boolean conjunctive queries. Moreover, it is possible to effectively construct a first-order query that solves the problem whenever such a query exists.
{"title":"First-Order Rewritability in Consistent Query Answering with Respect to Multiple Keys","authors":"Paraschos Koutris, J. Wijsen","doi":"10.1145/3375395.3387654","DOIUrl":"https://doi.org/10.1145/3375395.3387654","url":null,"abstract":"We study consistent query answering with respect to key dependencies. Given a (possibly inconsistent) database instance and a set of key dependencies, a repair is an inclusion-maximal subinstance that satisfies all key dependencies. Consistent query answering for a Boolean query is the following problem: given a database instance as input, is the query true in every repair? In [Koutris and Wijsen, ICDT 2019], it was shown that for every self-join-free Boolean conjunctive query and set of key dependencies containing exactly one key dependency per relation name (also called the primary key), this problem is in FO, L-complete, or coNP-complete, and it is decidable which of the three cases applies. In this paper, we consider the more general case where a relation name can be associated with more than one key dependency. It is shown that in this more general setting, it remains decidable whether or not the above problem is in FO, for self-join-free Boolean conjunctive queries. Moreover, it is possible to effectively construct a first-order query that solves the problem whenever such a query exists.","PeriodicalId":412441,"journal":{"name":"Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130468099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we design massively parallel algorithms for sparse matrix multiplication, as well as more general join-aggregate queries, where the join hypergraph is a tree with arbitrary output attributes. For each case, we obtain asymptotic improvement over existing algorithms. In particular, our matrix multiplication algorithm is shown to be optimal in the semiring model.
{"title":"Parallel Algorithms for Sparse Matrix Multiplication and Join-Aggregate Queries","authors":"Xiao Hu, K. Yi","doi":"10.1145/3375395.3387657","DOIUrl":"https://doi.org/10.1145/3375395.3387657","url":null,"abstract":"In this paper, we design massively parallel algorithms for sparse matrix multiplication, as well as more general join-aggregate queries, where the join hypergraph is a tree with arbitrary output attributes. For each case, we obtain asymptotic improvement over existing algorithms. In particular, our matrix multiplication algorithm is shown to be optimal in the semiring model.","PeriodicalId":412441,"journal":{"name":"Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115379199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The standard notion of query answering over incomplete database is that of certain answers, guaranteeing correctness regardless of how incomplete data is interpreted. In majority of real-life databases, relations have numerical columns and queries use arithmetic and comparisons. Even though the notion of certain answers still applies, we explain that it becomes much more problematic in situations when missing data occurs in numerical columns. We propose a new general framework that allows us to assign a measure of certainty to query answers. We test it in the agnostic scenario where we do not have prior information about values of numerical attributes, similarly to the predominant approach in handling incomplete data which assumes that each null can be interpreted as an arbitrary value of the domain. The key technical challenge is the lack of a uniform distribution over the entire domain of numerical attributes, such as real numbers. We overcome this by associating the measure of certainty with the asymptotic behavior of volumes of some subsets of the Euclidean space. We show that this measure is well-defined, and describe approaches to computing and approximating it. While it can be computationally hard, or result in an irrational number, even for simple constraints, we produce polynomial-time randomized approximation schemes with multiplicative guarantees for conjunctive queries, and with additive guarantees for arbitrary first-order queries. We also describe a set of experimental results to confirm the feasibility of this approach.
{"title":"Queries with Arithmetic on Incomplete Databases","authors":"Marco Console, M. Hofer, L. Libkin","doi":"10.1145/3375395.3387666","DOIUrl":"https://doi.org/10.1145/3375395.3387666","url":null,"abstract":"The standard notion of query answering over incomplete database is that of certain answers, guaranteeing correctness regardless of how incomplete data is interpreted. In majority of real-life databases, relations have numerical columns and queries use arithmetic and comparisons. Even though the notion of certain answers still applies, we explain that it becomes much more problematic in situations when missing data occurs in numerical columns. We propose a new general framework that allows us to assign a measure of certainty to query answers. We test it in the agnostic scenario where we do not have prior information about values of numerical attributes, similarly to the predominant approach in handling incomplete data which assumes that each null can be interpreted as an arbitrary value of the domain. The key technical challenge is the lack of a uniform distribution over the entire domain of numerical attributes, such as real numbers. We overcome this by associating the measure of certainty with the asymptotic behavior of volumes of some subsets of the Euclidean space. We show that this measure is well-defined, and describe approaches to computing and approximating it. While it can be computationally hard, or result in an irrational number, even for simple constraints, we produce polynomial-time randomized approximation schemes with multiplicative guarantees for conjunctive queries, and with additive guarantees for arbitrary first-order queries. We also describe a set of experimental results to confirm the feasibility of this approach.","PeriodicalId":412441,"journal":{"name":"Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123659555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Register automata have been used as a convenient model for specifying and verifying database driven systems. An important problem in such systems is to provide views that hide or restructure certain information about the data or process, extending classical notions of database views. In this paper we carry out a formal investigation of views of register automata by considering simple views that project away some of the registers. We show that classical register automata are not able to describe such projections and introduce more powerful register automata that are able to do so. We also show useful properties of these automata such as closure under projection and decidability of verifying temporal properties of their runs.
{"title":"Projection Views of Register Automata","authors":"L. Segoufin, V. Vianu","doi":"10.1145/3375395.3387651","DOIUrl":"https://doi.org/10.1145/3375395.3387651","url":null,"abstract":"Register automata have been used as a convenient model for specifying and verifying database driven systems. An important problem in such systems is to provide views that hide or restructure certain information about the data or process, extending classical notions of database views. In this paper we carry out a formal investigation of views of register automata by considering simple views that project away some of the registers. We show that classical register automata are not able to describe such projections and introduce more powerful register automata that are able to do so. We also show useful properties of these automata such as closure under projection and decidability of verifying temporal properties of their runs.","PeriodicalId":412441,"journal":{"name":"Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134443011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The PODS Executive Committee has appointed us to serve as the Award Committee for 2020. The committee would like to state that PODS 2010 boasted an exceptional set of influential papers, attesting to the strength and relevance of the field. We received a significant number of nominations for different truly excellent papers. After careful consideration and having solicited external nominations and advice, we have selected the following paper as the award winner for 2020:
{"title":"2020 ACM PODS Alberto O. Mendelzon Test-of-Time Award","authors":"G. Gottlob, J. V. D. Bussche, D. V. Gucht","doi":"10.1145/3375395.3387723","DOIUrl":"https://doi.org/10.1145/3375395.3387723","url":null,"abstract":"The PODS Executive Committee has appointed us to serve as the Award Committee for 2020. The committee would like to state that PODS 2010 boasted an exceptional set of influential papers, attesting to the strength and relevance of the field. We received a significant number of nominations for different truly excellent papers. After careful consideration and having solicited external nominations and advice, we have selected the following paper as the award winner for 2020:","PeriodicalId":412441,"journal":{"name":"Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125394311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marco Console, P. Guagliardo, L. Libkin, Etienne Toussaint
Handling incomplete data in a correct manner is a notoriously hard problem in databases. Theoretical approaches rely on the computationally hard notion of certain answers, while practical solutions rely on ad hoc query evaluation techniques based on three-valued logic. Can we find a middle ground, and produce correct answers efficiently? The paper surveys results of the last few years motivated by this question. We re-examine the notion of certainty itself, and show that it is much more varied than previously thought. We identify cases when certain answers can be computed efficiently and, short of that, provide deterministic and probabilistic approximation schemes for them. We look at the role of three-valued logic as used in SQL query evaluation, and discuss the correctness of the choice, as well as the necessity of such a logic for producing query answers.
{"title":"Coping with Incomplete Data: Recent Advances","authors":"Marco Console, P. Guagliardo, L. Libkin, Etienne Toussaint","doi":"10.1145/3375395.3387970","DOIUrl":"https://doi.org/10.1145/3375395.3387970","url":null,"abstract":"Handling incomplete data in a correct manner is a notoriously hard problem in databases. Theoretical approaches rely on the computationally hard notion of certain answers, while practical solutions rely on ad hoc query evaluation techniques based on three-valued logic. Can we find a middle ground, and produce correct answers efficiently? The paper surveys results of the last few years motivated by this question. We re-examine the notion of certainty itself, and show that it is much more varied than previously thought. We identify cases when certain answers can be computed efficiently and, short of that, provide deterministic and probabilistic approximation schemes for them. We look at the role of three-valued logic as used in SQL query evaluation, and discuss the correctness of the choice, as well as the necessity of such a logic for producing query answers.","PeriodicalId":412441,"journal":{"name":"Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132920208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bas Ketsman, Christoph Koch, F. Neven, Brecht Vandevoort
While serializability always guarantees application correctness, lower isolation levels can be chosen to improve transaction throughput at the risk of introducing certain anomalies. A set of transactions is robust against a given isolation level if every possible interleaving of the transactions under the specified isolation level is serializable. Robustness therefore always guarantees application correctness with the performance benefit of the lower isolation level. While the robustness problem has received considerable attention in the literature, only sufficient conditions have been obtained. The most notable exception is the seminal work by Fekete where he obtained a characterization for deciding robustness against SNAPSHOT ISOLATION. In this paper, we address the robustness problem for the lower SQL isolation levels READ UNCOMMITTED and READ COMMITTED which are defined in terms of the forbidden dirty write and dirty read patterns. The first main contribution of this paper is that we characterize robustness against both isolation levels in terms of the absence of counter example schedules of a specific form (split and multi-split schedules) and by the absence of cycles in interference graphs that satisfy various properties. A critical difference with Fekete's work, is that the properties of cycles obtained in this paper have to take the relative ordering of operations within transactions into account as READ UNCOMMITTED and READ COMMITTED do not satisfy the atomic visibility requirement. A particular consequence is that the latter renders the robustness problem against READ COMMITTED coNP-complete. The second main contribution of this paper is the coNP-hardness proof. For READ UNCOMMITTED, we obtain LOGSPACE-completeness.
{"title":"Deciding Robustness for Lower SQL Isolation Levels","authors":"Bas Ketsman, Christoph Koch, F. Neven, Brecht Vandevoort","doi":"10.1145/3375395.3387655","DOIUrl":"https://doi.org/10.1145/3375395.3387655","url":null,"abstract":"While serializability always guarantees application correctness, lower isolation levels can be chosen to improve transaction throughput at the risk of introducing certain anomalies. A set of transactions is robust against a given isolation level if every possible interleaving of the transactions under the specified isolation level is serializable. Robustness therefore always guarantees application correctness with the performance benefit of the lower isolation level. While the robustness problem has received considerable attention in the literature, only sufficient conditions have been obtained. The most notable exception is the seminal work by Fekete where he obtained a characterization for deciding robustness against SNAPSHOT ISOLATION. In this paper, we address the robustness problem for the lower SQL isolation levels READ UNCOMMITTED and READ COMMITTED which are defined in terms of the forbidden dirty write and dirty read patterns. The first main contribution of this paper is that we characterize robustness against both isolation levels in terms of the absence of counter example schedules of a specific form (split and multi-split schedules) and by the absence of cycles in interference graphs that satisfy various properties. A critical difference with Fekete's work, is that the properties of cycles obtained in this paper have to take the relative ordering of operations within transactions into account as READ UNCOMMITTED and READ COMMITTED do not satisfy the atomic visibility requirement. A particular consequence is that the latter renders the robustness problem against READ COMMITTED coNP-complete. The second main contribution of this paper is the coNP-hardness proof. For READ UNCOMMITTED, we obtain LOGSPACE-completeness.","PeriodicalId":412441,"journal":{"name":"Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116387996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}