Pub Date : 2020-01-01DOI: 10.4230/LIPIcs.ICDT.2020.17
Jelle Hellings, Mohammad Sadoghi
State-of-the-art fault-tolerant and federated data management systems rely on fully-replicated designs in which all participants have equivalent roles. Consequently, these systems have only limited scalability and are ill-suited for high-performance data management. As an alternative, we propose a hierarchical design in which a Byzantine cluster manages data, while an arbitrary number of learners can reliable learn these updates and use the corresponding data. To realize our design, we propose the delayed-replication algorithm, an efficient solution to the Byzantine learner problem that is central to our design. The delayed-replication algorithm is coordination-free, scalable, and has minimal communication cost for all participants involved. In doing so, the delayed-broadcast algorithm opens the door to new high-performance fault-tolerant and federated data management systems. To illustrate this, we show that the delayed-replication algorithm is not only useful to support specialized learners, but can also be used to reduce the overall communication cost of permissioned blockchains and to improve their storage scalability.
{"title":"Coordination-Free Byzantine Replication with Minimal Communication Costs","authors":"Jelle Hellings, Mohammad Sadoghi","doi":"10.4230/LIPIcs.ICDT.2020.17","DOIUrl":"https://doi.org/10.4230/LIPIcs.ICDT.2020.17","url":null,"abstract":"State-of-the-art fault-tolerant and federated data management systems rely on fully-replicated designs in which all participants have equivalent roles. Consequently, these systems have only limited scalability and are ill-suited for high-performance data management. As an alternative, we propose a hierarchical design in which a Byzantine cluster manages data, while an arbitrary number of learners can reliable learn these updates and use the corresponding data. To realize our design, we propose the delayed-replication algorithm, an efficient solution to the Byzantine learner problem that is central to our design. The delayed-replication algorithm is coordination-free, scalable, and has minimal communication cost for all participants involved. In doing so, the delayed-broadcast algorithm opens the door to new high-performance fault-tolerant and federated data management systems. To illustrate this, we show that the delayed-replication algorithm is not only useful to support specialized learners, but can also be used to reduce the overall communication cost of permissioned blockchains and to improve their storage scalability.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"56 1","pages":"17:1-17:20"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84515277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-25DOI: 10.4230/LIPIcs.ICDT.2020.6
P. Barceló, N. Higuera, Jorge Pérez, Bernardo Subercaseaux
We study the expressive power of the LARA language -- a recently proposed unified model for expressing relational and linear algebra operations -- both in terms of traditional database query languages and some analytic tasks often performed in machine learning pipelines. We start by showing LARA to be expressive complete with respect to first-order logic with aggregation. Since LARA is parameterized by a set of user-defined functions which allow to transform values in tables, the exact expressive power of the language depends on how these functions are defined. We distinguish two main cases depending on the level of genericity queries are enforced to satisfy. Under strong genericity assumptions the language cannot express matrix convolution, a very important operation in current machine learning operations. This language is also local, and thus cannot express operations such as matrix inverse that exhibit a recursive behavior. For expressing convolution, one can relax the genericity requirement by adding an underlying linear order on the domain. This, however, destroys locality and turns the expressive power of the language much more difficult to understand. In particular, although under complexity assumptions the resulting language can still not express matrix inverse, a proof of this fact without such assumptions seems challenging to obtain.
{"title":"On the Expressiveness of LARA: A Unified Language for Linear and Relational Algebra","authors":"P. Barceló, N. Higuera, Jorge Pérez, Bernardo Subercaseaux","doi":"10.4230/LIPIcs.ICDT.2020.6","DOIUrl":"https://doi.org/10.4230/LIPIcs.ICDT.2020.6","url":null,"abstract":"We study the expressive power of the LARA language -- a recently proposed unified model for expressing relational and linear algebra operations -- both in terms of traditional database query languages and some analytic tasks often performed in machine learning pipelines. We start by showing LARA to be expressive complete with respect to first-order logic with aggregation. Since LARA is parameterized by a set of user-defined functions which allow to transform values in tables, the exact expressive power of the language depends on how these functions are defined. We distinguish two main cases depending on the level of genericity queries are enforced to satisfy. Under strong genericity assumptions the language cannot express matrix convolution, a very important operation in current machine learning operations. This language is also local, and thus cannot express operations such as matrix inverse that exhibit a recursive behavior. For expressing convolution, one can relax the genericity requirement by adding an underlying linear order on the domain. This, however, destroys locality and turns the expressive power of the language much more difficult to understand. In particular, although under complexity assumptions the resulting language can still not express matrix inverse, a proof of this fact without such assumptions seems challenging to obtain.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"34 1","pages":"6:1-6:20"},"PeriodicalIF":0.0,"publicationDate":"2019-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81104012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.4230/LIPIcs.ICDT.2020.22
R. Pagh, J. Sivertsen
Motivated by the problem of filtering candidate pairs in inner product similarity joins we study the following inner product estimation problem: Given parameters $din {bf N}$, $alpha>betageq 0$ and unit vectors $x,yin {bf R}^{d}$ consider the task of distinguishing between the cases $langle x, yrangleleqbeta$ and $langle x, yranglegeq alpha$ where $langle x, yrangle = sum_{i=1}^d x_i y_i$ is the inner product of vectors $x$ and $y$. The goal is to distinguish these cases based on information on each vector encoded independently in a bit string of the shortest length possible. In contrast to much work on compressing vectors using randomized dimensionality reduction, we seek to solve the problem deterministically, with no probability of error. Inner product estimation can be solved in general via estimating $langle x, yrangle$ with an additive error bounded by $varepsilon = alpha - beta$. We show that $d log_2 left(tfrac{sqrt{1-beta}}{varepsilon}right) pm Theta(d)$ bits of information about each vector is necessary and sufficient. Our upper bound is constructive and improves a known upper bound of $d log_2(1/varepsilon) + O(d)$ by up to a factor of 2 when $beta$ is close to $1$. The lower bound holds even in a stronger model where one of the vectors is known exactly, and an arbitrary estimation function is allowed.
{"title":"The space complexity of inner product filters","authors":"R. Pagh, J. Sivertsen","doi":"10.4230/LIPIcs.ICDT.2020.22","DOIUrl":"https://doi.org/10.4230/LIPIcs.ICDT.2020.22","url":null,"abstract":"Motivated by the problem of filtering candidate pairs in inner product similarity joins we study the following inner product estimation problem: Given parameters $din {bf N}$, $alpha>betageq 0$ and unit vectors $x,yin {bf R}^{d}$ consider the task of distinguishing between the cases $langle x, yrangleleqbeta$ and $langle x, yranglegeq alpha$ where $langle x, yrangle = sum_{i=1}^d x_i y_i$ is the inner product of vectors $x$ and $y$. The goal is to distinguish these cases based on information on each vector encoded independently in a bit string of the shortest length possible. In contrast to much work on compressing vectors using randomized dimensionality reduction, we seek to solve the problem deterministically, with no probability of error. Inner product estimation can be solved in general via estimating $langle x, yrangle$ with an additive error bounded by $varepsilon = alpha - beta$. We show that $d log_2 left(tfrac{sqrt{1-beta}}{varepsilon}right) pm Theta(d)$ bits of information about each vector is necessary and sufficient. Our upper bound is constructive and improves a known upper bound of $d log_2(1/varepsilon) + O(d)$ by up to a factor of 2 when $beta$ is close to $1$. The lower bound holds even in a stronger model where one of the vectors is known exactly, and an arbitrary estimation function is allowed.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"36 1","pages":"22:1-22:14"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88177375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-08-30DOI: 10.46298/lmcs-18(1:21)2022
J. Doleschal, B. Kimelfeld, W. Martens, L. Peterfreund
The framework of document spanners abstracts the task of information extraction from text as a function that maps every document (a string) into a relation over the document's spans (intervals identified by their start and end indices). For instance, the regular spanners are the closure under the Relational Algebra (RA) of the regular expressions with capture variables, and the expressive power of the regular spanners is precisely captured by the class of VSet-automata -- a restricted class of transducers that mark the endpoints of selected spans. In this work, we embark on the investigation of document spanners that can annotate extractions with auxiliary information such as confidence, support, and confidentiality measures. To this end, we adopt the abstraction of provenance semirings by Green et al., where tuples of a relation are annotated with the elements of a commutative semiring, and where the annotation propagates through the positive RA operators via the semiring operators. Hence, the proposed spanner extension, referred to as an annotator, maps every string into an annotated relation over the spans. As a specific instantiation, we explore weighted VSet-automata that, similarly to weighted automata and transducers, attach semiring elements to transitions. We investigate key aspects of expressiveness, such as the closure under the positive RA, and key aspects of computational complexity, such as the enumeration of annotated answers and their ranked enumeration in the case of ordered semirings. For a number of these problems, fundamental properties of the underlying semiring, such as positivity, are crucial for establishing tractability.
{"title":"Weight Annotation in Information Extraction","authors":"J. Doleschal, B. Kimelfeld, W. Martens, L. Peterfreund","doi":"10.46298/lmcs-18(1:21)2022","DOIUrl":"https://doi.org/10.46298/lmcs-18(1:21)2022","url":null,"abstract":"The framework of document spanners abstracts the task of information\u0000extraction from text as a function that maps every document (a string) into a\u0000relation over the document's spans (intervals identified by their start and end\u0000indices). For instance, the regular spanners are the closure under the\u0000Relational Algebra (RA) of the regular expressions with capture variables, and\u0000the expressive power of the regular spanners is precisely captured by the class\u0000of VSet-automata -- a restricted class of transducers that mark the endpoints\u0000of selected spans.\u0000 In this work, we embark on the investigation of document spanners that can\u0000annotate extractions with auxiliary information such as confidence, support,\u0000and confidentiality measures. To this end, we adopt the abstraction of\u0000provenance semirings by Green et al., where tuples of a relation are annotated\u0000with the elements of a commutative semiring, and where the annotation\u0000propagates through the positive RA operators via the semiring operators. Hence,\u0000the proposed spanner extension, referred to as an annotator, maps every string\u0000into an annotated relation over the spans. As a specific instantiation, we\u0000explore weighted VSet-automata that, similarly to weighted automata and\u0000transducers, attach semiring elements to transitions. We investigate key\u0000aspects of expressiveness, such as the closure under the positive RA, and key\u0000aspects of computational complexity, such as the enumeration of annotated\u0000answers and their ranked enumeration in the case of ordered semirings. For a\u0000number of these problems, fundamental properties of the underlying semiring,\u0000such as positivity, are crucial for establishing tractability.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"42 1","pages":"8:1-8:18"},"PeriodicalIF":0.0,"publicationDate":"2019-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84689144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-04-18DOI: 10.46298/lmcs-17(3:22)2021
Ester Livshits, L. Bertossi, B. Kimelfeld, Moshe Sebag
We investigate the application of the Shapley value to quantifying the contribution of a tuple to a query answer. The Shapley value is a widely known numerical measure in cooperative game theory and in many applications of game theory for assessing the contribution of a player to a coalition game. It has been established already in the 1950s, and is theoretically justified by being the very single wealth-distribution measure that satisfies some natural axioms. While this value has been investigated in several areas, it received little attention in data management. We study this measure in the context of conjunctive and aggregate queries by defining corresponding coalition games. We provide algorithmic and complexity-theoretic results on the computation of Shapley-based contributions to query answers; and for the hard cases we present approximation algorithms.
{"title":"The Shapley Value of Tuples in Query Answering","authors":"Ester Livshits, L. Bertossi, B. Kimelfeld, Moshe Sebag","doi":"10.46298/lmcs-17(3:22)2021","DOIUrl":"https://doi.org/10.46298/lmcs-17(3:22)2021","url":null,"abstract":"We investigate the application of the Shapley value to quantifying the\u0000contribution of a tuple to a query answer. The Shapley value is a widely known\u0000numerical measure in cooperative game theory and in many applications of game\u0000theory for assessing the contribution of a player to a coalition game. It has\u0000been established already in the 1950s, and is theoretically justified by being\u0000the very single wealth-distribution measure that satisfies some natural axioms.\u0000While this value has been investigated in several areas, it received little\u0000attention in data management. We study this measure in the context of\u0000conjunctive and aggregate queries by defining corresponding coalition games. We\u0000provide algorithmic and complexity-theoretic results on the computation of\u0000Shapley-based contributions to query answers; and for the hard cases we present\u0000approximation algorithms.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"3 1","pages":"20:1-20:19"},"PeriodicalIF":0.0,"publicationDate":"2019-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91176498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-04-14DOI: 10.46298/lmcs-18(1:34)2022
Martin Grohe, P. Lindner
Probabilistic databases (PDBs) model uncertainty in data in a quantitative way. In the established formal framework, probabilistic (relational) databases are finite probability spaces over relational database instances. This finiteness can clash with intuitive query behavior (Ceylan et al., KR 2016), and with application scenarios that are better modeled by continuous probability distributions (Dalvi et al., CACM 2009). We formally introduced infinite PDBs in (Grohe and Lindner, PODS 2019) with a primary focus on countably infinite spaces. However, an extension beyond countable probability spaces raises nontrivial foundational issues concerned with the measurability of events and queries and ultimately with the question whether queries have a well-defined semantics. We argue that finite point processes are an appropriate model from probability theory for dealing with general probabilistic databases. This allows us to construct suitable (uncountable) probability spaces of database instances in a systematic way. Our main technical results are measurability statements for relational algebra queries as well as aggregate queries and Datalog queries.
概率数据库(PDBs)以定量的方式对数据中的不确定性进行建模。在已建立的正式框架中,概率(关系)数据库是关系数据库实例上的有限概率空间。这种有限性可能与直观的查询行为(Ceylan et al., KR 2016)以及通过连续概率分布更好地建模的应用场景(Dalvi et al., ccm 2009)相冲突。我们在(Grohe and Lindner, PODS 2019)中正式引入了无限pdb,主要关注可数无限空间。然而,超越可数概率空间的扩展引发了与事件和查询的可度量性有关的重要基础问题,并最终引发了查询是否具有良好定义的语义的问题。本文认为,有限点过程是概率论中处理一般概率数据库的合适模型。这允许我们以系统的方式构建数据库实例的合适(不可数)概率空间。我们的主要技术成果是关系代数查询以及聚合查询和数据查询的可度量语句。
{"title":"Infinite Probabilistic Databases","authors":"Martin Grohe, P. Lindner","doi":"10.46298/lmcs-18(1:34)2022","DOIUrl":"https://doi.org/10.46298/lmcs-18(1:34)2022","url":null,"abstract":"Probabilistic databases (PDBs) model uncertainty in data in a quantitative\u0000way. In the established formal framework, probabilistic (relational) databases\u0000are finite probability spaces over relational database instances. This\u0000finiteness can clash with intuitive query behavior (Ceylan et al., KR 2016),\u0000and with application scenarios that are better modeled by continuous\u0000probability distributions (Dalvi et al., CACM 2009).\u0000 We formally introduced infinite PDBs in (Grohe and Lindner, PODS 2019) with a\u0000primary focus on countably infinite spaces. However, an extension beyond\u0000countable probability spaces raises nontrivial foundational issues concerned\u0000with the measurability of events and queries and ultimately with the question\u0000whether queries have a well-defined semantics.\u0000 We argue that finite point processes are an appropriate model from\u0000probability theory for dealing with general probabilistic databases. This\u0000allows us to construct suitable (uncountable) probability spaces of database\u0000instances in a systematic way. Our main technical results are measurability\u0000statements for relational algebra queries as well as aggregate queries and\u0000Datalog queries.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"211 1","pages":"16:1-16:20"},"PeriodicalIF":0.0,"publicationDate":"2019-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78001133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-03-26DOI: 10.4230/LIPICS.ICDT.2019.17
M. Calautti, Andreas Pieris
The chase procedure is one of the most fundamental algorithmic tools in database theory. A key algorithmic task is uniform chase termination, i.e., given a set of tuple-generating dependencies (tgds), is it the case that the chase under this set of tgds terminates, for every input database? In view of the fact that this problem is undecidable, no matter which version of the chase we consider, it is natural to ask whether well-behaved classes of tgds, introduced in different contexts such as ontological reasoning, make our problem decidable. In this work, we consider a prominent decidability paradigm for tgds, called stickiness. We show that for sticky sets of tgds, uniform chase termination is decidable if we focus on the (semi-)oblivious chase, and we pinpoint its exact complexity: PSPACE-complete in general, and NLOGSPACE-complete for predicates of bounded arity. These complexity results are obtained via graph-based syntactic characterizations of chase termination that are of independent interest. 2012 ACM Subject Classification Theory of Computation→Database query languages (principles), database constraints theory, logic and databases
{"title":"Oblivious Chase Termination: The Sticky Case","authors":"M. Calautti, Andreas Pieris","doi":"10.4230/LIPICS.ICDT.2019.17","DOIUrl":"https://doi.org/10.4230/LIPICS.ICDT.2019.17","url":null,"abstract":"The chase procedure is one of the most fundamental algorithmic tools in database theory. A key algorithmic task is uniform chase termination, i.e., given a set of tuple-generating dependencies (tgds), is it the case that the chase under this set of tgds terminates, for every input database? In view of the fact that this problem is undecidable, no matter which version of the chase we consider, it is natural to ask whether well-behaved classes of tgds, introduced in different contexts such as ontological reasoning, make our problem decidable. In this work, we consider a prominent decidability paradigm for tgds, called stickiness. We show that for sticky sets of tgds, uniform chase termination is decidable if we focus on the (semi-)oblivious chase, and we pinpoint its exact complexity: PSPACE-complete in general, and NLOGSPACE-complete for predicates of bounded arity. These complexity results are obtained via graph-based syntactic characterizations of chase termination that are of independent interest. 2012 ACM Subject Classification Theory of Computation→Database query languages (principles), database constraints theory, logic and databases","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"26 1","pages":"17:1-17:18"},"PeriodicalIF":0.0,"publicationDate":"2019-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77565635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-03-01DOI: 10.4230/LIPIcs.ICDT.2017.10
D. Burdick, Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, W. Tan
We develop a unifying approach to declarative entity linking by introducing the notion of an entity linking framework and an accompanying notion of the certain links in such a framework. In an entity linking framework, logic-based constraints are used to express properties of the desired link relations in terms of source relations and, possibly, in terms of other link relations. The definition of the certain links in such a framework makes use of weighted repairs and consistent answers in inconsistent databases. We demonstrate the modeling capabilities of this approach by showing that numerous concrete entity linking scenarios can be cast as such entity linking frameworks for suitable choices of constraints and weights. By using the certain links as a measure of expressive power, we investigate the relative expressive power of several entity linking frameworks and obtain sharp comparisons. In particular, we show that we gain expressive power if we allow constraints that capture non-recursive collective entity resolution, where link relations may depend on other link relations (and not just on source relations). Moreover, we show that an increase in expressive power also takes place when we allow constraints that incorporate preferences as an additional mechanism for expressing "goodness" of links.
{"title":"Expressive Power of Entity-Linking Frameworks","authors":"D. Burdick, Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, W. Tan","doi":"10.4230/LIPIcs.ICDT.2017.10","DOIUrl":"https://doi.org/10.4230/LIPIcs.ICDT.2017.10","url":null,"abstract":"We develop a unifying approach to declarative entity linking by introducing the notion of an entity linking framework and an accompanying notion of the certain links in such a framework. In an entity linking framework, logic-based constraints are used to express properties of the desired link relations in terms of source relations and, possibly, in terms of other link relations. The definition of the certain links in such a framework makes use of weighted repairs and consistent answers in inconsistent databases. We demonstrate the modeling capabilities of this approach by showing that numerous concrete entity linking scenarios can be cast as such entity linking frameworks for suitable choices of constraints and weights. By using the certain links as a measure of expressive power, we investigate the relative expressive power of several entity linking frameworks and obtain sharp comparisons. In particular, we show that we gain expressive power if we allow constraints that capture non-recursive collective entity resolution, where link relations may depend on other link relations (and not just on source relations). Moreover, we show that an increase in expressive power also takes place when we allow constraints that incorporate preferences as an additional mechanism for expressing \"goodness\" of links.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"7 1","pages":"10:1-10:18"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87227303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-02-07DOI: 10.4230/LIPIcs.ICDT.2021.5
Shaleen Deep, Paraschos Koutris
We investigate the enumeration of top-k answers for conjunctive queries against relational databases according to a given ranking function. The task is to design data structures and algorithms that allow for efficient enumeration after a preprocessing phase. Our main contribution is a novel priority queue based algorithm with near-optimal delay and non-trivial space guarantees that are output sensitive and depend on structure of the query. In particular, we exploit certain desirable properties of ranking functions that frequently occur in practice and degree information in the database instance, allowing for efficient enumeration. We introduce the notion of {em decomposable} and {em compatible} ranking functions in conjunction with query decomposition, a property that allows for partial aggregation of tuple scores in order to efficiently enumerate the ranked output. We complement the algorithmic results with lower bounds justifying why certain assumptions about properties of ranking functions are necessary and discuss popular conjectures providing evidence for optimality of enumeration delay guarantees. Our results extend and improve upon a long line of work that has studied ranked enumeration from both theoretical and practical perspective.
{"title":"Ranked Enumeration of Conjunctive Query Results","authors":"Shaleen Deep, Paraschos Koutris","doi":"10.4230/LIPIcs.ICDT.2021.5","DOIUrl":"https://doi.org/10.4230/LIPIcs.ICDT.2021.5","url":null,"abstract":"We investigate the enumeration of top-k answers for conjunctive queries against relational databases according to a given ranking function. The task is to design data structures and algorithms that allow for efficient enumeration after a preprocessing phase. Our main contribution is a novel priority queue based algorithm with near-optimal delay and non-trivial space guarantees that are output sensitive and depend on structure of the query. In particular, we exploit certain desirable properties of ranking functions that frequently occur in practice and degree information in the database instance, allowing for efficient enumeration. We introduce the notion of {em decomposable} and {em compatible} ranking functions in conjunction with query decomposition, a property that allows for partial aggregation of tuple scores in order to efficiently enumerate the ranked output. We complement the algorithmic results with lower bounds justifying why certain assumptions about properties of ranking functions are necessary and discuss popular conjectures providing evidence for optimality of enumeration delay guarantees. Our results extend and improve upon a long line of work that has studied ranked enumeration from both theoretical and practical perspective.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"41 1","pages":"5:1-5:19"},"PeriodicalIF":0.0,"publicationDate":"2019-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79902111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-01-21DOI: 10.4230/LIPICS.ICDT.2019.12
S. Maniu, P. Senellart, Suraj Jog
Treewidth is a parameter that measures how tree-like a relational instance is, and whether it can reasonably be decomposed into a tree. Many computation tasks are known to be tractable on databases of small treewidth, but computing the treewidth of a given instance is intractable. This article is the first large-scale experimental study of treewidth and tree decompositions of real-world database instances (25 datasets from 8 different domains, with sizes ranging from a few thousand to a few million vertices). The goal is to determine which data, if any, can benefit of the wealth of algorithms for databases of small treewidth. For each dataset, we obtain upper and lower bound estimations of their treewidth, and study the properties of their tree decompositions. We show in particular that, even when treewidth is high, using partial tree decompositions can result in data structures that can assist algorithms.
{"title":"An Experimental Study of the Treewidth of Real-World Graph Data (Extended Version)","authors":"S. Maniu, P. Senellart, Suraj Jog","doi":"10.4230/LIPICS.ICDT.2019.12","DOIUrl":"https://doi.org/10.4230/LIPICS.ICDT.2019.12","url":null,"abstract":"Treewidth is a parameter that measures how tree-like a relational instance is, and whether it can reasonably be decomposed into a tree. Many computation tasks are known to be tractable on databases of small treewidth, but computing the treewidth of a given instance is intractable. This article is the first large-scale experimental study of treewidth and tree decompositions of real-world database instances (25 datasets from 8 different domains, with sizes ranging from a few thousand to a few million vertices). The goal is to determine which data, if any, can benefit of the wealth of algorithms for databases of small treewidth. For each dataset, we obtain upper and lower bound estimations of their treewidth, and study the properties of their tree decompositions. We show in particular that, even when treewidth is high, using partial tree decompositions can result in data structures that can assist algorithms.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"32 1","pages":"12:1-12:18"},"PeriodicalIF":0.0,"publicationDate":"2019-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77826124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}