This paper initiates the study of data-path query languages (in particular, regular data path queries (RDPQ) and conjunctive RDPQ (CRDPQ)) in the classic setting of embedded finite model theory, wherein each graph is "embedded" into a background infinite structure (with a decidable FO theory or fragments thereof). Our goal is to address the current lack of support for typed attribute data (e.g. integer arithmetics) in existing data-path query languages, which are crucial in practice. We propose an extension of register automata by allowing powerful constraints over the theory and the database as guards, and having two types of registers: registers that can store values from the active domain, and read-only registers that can store arbitrary values. We prove NL data complexity for (C)RDPQ over the Presburger arithmetic, the real-closed field, the existential theory of automatic structures and word equations with regular constraints. All these results strictly extend the known NL data complexity of RDPQ with only equality comparisons, and provides an answer to a recent open problem posed by Libkin et al. Among others, we introduce one crucial proof technique for obtaining NL data complexity for data path queries over embedded graph databases called "Restricted Register Collapse (RRC)", inspired by the notion of Restricted Quantifier Collapse (RQC) in embedded finite model theory.
{"title":"Data Path Queries over Embedded Graph Databases","authors":"Diego Figueira, Artur Jeż, A. Lin","doi":"10.1145/3517804.3524159","DOIUrl":"https://doi.org/10.1145/3517804.3524159","url":null,"abstract":"This paper initiates the study of data-path query languages (in particular, regular data path queries (RDPQ) and conjunctive RDPQ (CRDPQ)) in the classic setting of embedded finite model theory, wherein each graph is \"embedded\" into a background infinite structure (with a decidable FO theory or fragments thereof). Our goal is to address the current lack of support for typed attribute data (e.g. integer arithmetics) in existing data-path query languages, which are crucial in practice. We propose an extension of register automata by allowing powerful constraints over the theory and the database as guards, and having two types of registers: registers that can store values from the active domain, and read-only registers that can store arbitrary values. We prove NL data complexity for (C)RDPQ over the Presburger arithmetic, the real-closed field, the existential theory of automatic structures and word equations with regular constraints. All these results strictly extend the known NL data complexity of RDPQ with only equality comparisons, and provides an answer to a recent open problem posed by Libkin et al. Among others, we introduce one crucial proof technique for obtaining NL data complexity for data path queries over embedded graph databases called \"Restricted Register Collapse (RRC)\", inspired by the notion of Restricted Quantifier Collapse (RQC) in embedded finite model theory.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123562586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we revisit the problem of entity resolution and propose a novel, logical framework, LACE, which mixes declarative and procedural elements to achieve a number of desirable properties. Our approach is fundamentally declarative in nature: it utilizes hard and soft rules to specify conditions under which pairs of entity references must or may be merged, together with denial constraints that enforce consistency of the resulting instance. Importantly, however, rule bodies are evaluated on the instance resulting from applying the already 'derived' merges. It is the dynamic nature of our semantics that enables us to capture collective entity resolution scenarios, where merges can trigger further merges, while at the same time ensuring that every merge can be justified. As the denial constraints restrict which merges can be performed together, we obtain a space of (maximal) solutions, from which we can naturally define notions of certain and possible merges and query answers. We explore the computational properties of our framework and determine the precise computational complexity of the relevant decision problems. Furthermore, as a first step towards implementing our approach, we demonstrate how we can encode the various reasoning tasks using answer set programming.
{"title":"LACE: A Logical Approach to Collective Entity Resolution","authors":"Meghyn Bienvenu, Gianluca Cima, Víctor Gutiérrez-Basulto","doi":"10.1145/3517804.3526233","DOIUrl":"https://doi.org/10.1145/3517804.3526233","url":null,"abstract":"In this paper, we revisit the problem of entity resolution and propose a novel, logical framework, LACE, which mixes declarative and procedural elements to achieve a number of desirable properties. Our approach is fundamentally declarative in nature: it utilizes hard and soft rules to specify conditions under which pairs of entity references must or may be merged, together with denial constraints that enforce consistency of the resulting instance. Importantly, however, rule bodies are evaluated on the instance resulting from applying the already 'derived' merges. It is the dynamic nature of our semantics that enables us to capture collective entity resolution scenarios, where merges can trigger further merges, while at the same time ensuring that every merge can be justified. As the denial constraints restrict which merges can be performed together, we obtain a space of (maximal) solutions, from which we can naturally define notions of certain and possible merges and query answers. We explore the computational properties of our framework and determine the precise computational complexity of the relevant decision problems. Furthermore, as a first step towards implementing our approach, we demonstrate how we can encode the various reasoning tasks using answer set programming.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126007372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In addition to its theoretical interest, computing with circuits has found applications in many other areas such as secure multi-party computation and outsourced query processing. Yet, the exact circuit complexity of query evaluation had remained an unexplored topic. In this paper, we present circuit constructions for conjunctive queries under degree constraints. These circuits have polylogarithmic depth and their sizes match the polymatroid bound up to polylogarithmic factors. We also propose a definition of output-sensitive circuit families and obtain such circuits with sizes matching their RAM counterparts.
{"title":"Query Evaluation by Circuits","authors":"Yilei Wang, K. Yi","doi":"10.1145/3517804.3524142","DOIUrl":"https://doi.org/10.1145/3517804.3524142","url":null,"abstract":"In addition to its theoretical interest, computing with circuits has found applications in many other areas such as secure multi-party computation and outsourced query processing. Yet, the exact circuit complexity of query evaluation had remained an unexplored topic. In this paper, we present circuit constructions for conjunctive queries under degree constraints. These circuits have polylogarithmic depth and their sizes match the polymatroid bound up to polylogarithmic factors. We also propose a definition of output-sensitive circuit families and obtain such circuits with sizes matching their RAM counterparts.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128497077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shangqi Lu, W. Martens, Matthias Niewerth, Yufei Tao
We study partial order multiway search (POMS), which is a game between an algorithm A and an oracle, played on a directed acyclic graph G known to both parties. First, the oracle picks a vertex t in G called the target. Then, A needs to figure out which vertex is t by probing reachability. Specifically, in each probe, A selects a set Q of vertices in G whose size is bounded by a (pre-agreed) limit; the oracle reveals, for each vertex q ∈ Q, whether q can reach the target in G. The objective of A is to minimize the number of probes. This problem finds use in crowdsourcing, distributed file systems, software testing, etc. We describe an algorithm to solve POMS in O(log1+k n + d/k log1+dn) probes, where n is the number of vertices in G, k is the maximum permissible |Q|, and d is the largest out-degree of the vertices in G. We further establish the algorithm's asymptotic optimality by proving a matching lower bound. We also introduce a variant of POMS in the external memory (EM) computation model, which is the key to a black-box approach for converting a class of pointer-machine structures to their I/O-efficient counterparts. In the EM version of POMS, A is allowed to pre-compute a (disk-based) structure on G and is then required to clear its memory. The oracle (as before) picks a target t. A still needs to find t by issuing probes, except that the set Q in each probe must be read from the disk. The objective of A is now to minimize the number of I/Os. We present a structure that uses O(n/B) space and guarantees discovering the target in O(logB n + d/B log1+dn) I/Os where B is the block size, and n and d are as defined earlier. We establish the structure's asymptotic optimality by proving that any structure demands Ω(log_B n + d/B log1+d n) I/Os to find the target in the worst case regardless of the space consumption.
本文研究了在有向无环图G上,算法a与神谕之间的一种博弈——偏序多路搜索。首先,oracle在G中选择一个顶点t,称为目标。然后,A需要通过探测可达性来找出哪个顶点是t。具体来说,在每个探测中,A在G中选择一个集合Q,其大小受一个(预先约定的)限制;oracle显示,对于每个顶点q∈q, q是否能到达g中的目标。A的目标是使探测次数最小化。这个问题在众包、分布式文件系统、软件测试等领域都很常见。本文描述了一种用O(log1+k n +d /k log1+dn)个探针求解POMS的算法,其中n为G中的顶点数,k为允许的最大值Q, d为G中顶点的最大出度,并通过证明一个匹配的下界进一步证明了该算法的渐近最优性。我们还在外部存储器(EM)计算模型中引入了POMS的一种变体,这是将一类指针机结构转换为I/ o高效对应体的黑盒方法的关键。在EM版本的POMS中,允许A在G上预先计算一个(基于磁盘的)结构,然后需要清除其内存。oracle(和以前一样)选择一个目标t。a仍然需要通过发出探测来找到t,只是每个探测中的集合Q必须从磁盘读取。现在,A的目标是最小化I/ o的数量。我们提出了一个使用O(n/B)空间的结构,并保证在O(logB n +d /B log1+dn) I/O中发现目标,其中B是块大小,n和d与前面定义的一样。我们通过证明任何结构都需要Ω(log_B n +d /B log1+d n) I/ o来找到最坏情况下的目标,而不管空间消耗如何,从而建立了结构的渐近最优性。
{"title":"Optimal Algorithms for Multiway Search on Partial Orders","authors":"Shangqi Lu, W. Martens, Matthias Niewerth, Yufei Tao","doi":"10.1145/3517804.3524150","DOIUrl":"https://doi.org/10.1145/3517804.3524150","url":null,"abstract":"We study partial order multiway search (POMS), which is a game between an algorithm A and an oracle, played on a directed acyclic graph G known to both parties. First, the oracle picks a vertex t in G called the target. Then, A needs to figure out which vertex is t by probing reachability. Specifically, in each probe, A selects a set Q of vertices in G whose size is bounded by a (pre-agreed) limit; the oracle reveals, for each vertex q ∈ Q, whether q can reach the target in G. The objective of A is to minimize the number of probes. This problem finds use in crowdsourcing, distributed file systems, software testing, etc. We describe an algorithm to solve POMS in O(log1+k n + d/k log1+dn) probes, where n is the number of vertices in G, k is the maximum permissible |Q|, and d is the largest out-degree of the vertices in G. We further establish the algorithm's asymptotic optimality by proving a matching lower bound. We also introduce a variant of POMS in the external memory (EM) computation model, which is the key to a black-box approach for converting a class of pointer-machine structures to their I/O-efficient counterparts. In the EM version of POMS, A is allowed to pre-compute a (disk-based) structure on G and is then required to clear its memory. The oracle (as before) picks a target t. A still needs to find t by issuing probes, except that the set Q in each probe must be read from the disk. The objective of A is now to minimize the number of I/Os. We present a structure that uses O(n/B) space and guarantees discovering the target in O(logB n + d/B log1+dn) I/Os where B is the block size, and n and d are as defined earlier. We establish the structure's asymptotic optimality by proving that any structure demands Ω(log_B n + d/B log1+d n) I/Os to find the target in the worst case regardless of the space consumption.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122454583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Citation. This paper took research on a fundamental problem in database research join query processing in a new direction. Its motivation was the bound on join query size of Atserias, Grohe, and Marx, now known as the AGM bound (FOCS 2008). This raised the question of whether a join algorithm can achieve a worst-case running time in line with this bound. This paper presents an algorithm that achieves this bound, while showing that traditional query plans cannot achieve it. In the process, they connect join processing questions with geometric inequalities, a connection that has proven fruitful in subsequent work. The algorithmic contribution in this paper almost immediately resonated within database applications when it was observed that a join algorithm recently implemented in industry, Leapfrog Triejoin, achieves a similar optimality guarantee. This led to a line of papers and implementations of join algorithms building off the ideas in the paper. The contribution of the paper to analysis of join queries has arguably been more profound – the connection between join query processing, geometric inequalities, and worst-case size bounds have been subsequently explored in many other contexts, including in the presence of integrity constraints. This work has already been honored with a “Gems of PODS” talk in PODS 2018: the conference paper, journal paper in JACM, and SIGMOD record survey article discussing later developments are all highly cited. This underlines the fact that this paper represented a major departure point for research in database theory.
{"title":"2022 ACM PODS Alberto O. Mendelzon Test-of-Time Award","authors":"Michael Bender, Michael Benedikt, Sudeepa Roy","doi":"10.1145/3517804.3526070","DOIUrl":"https://doi.org/10.1145/3517804.3526070","url":null,"abstract":"Citation. This paper took research on a fundamental problem in database research join query processing in a new direction. Its motivation was the bound on join query size of Atserias, Grohe, and Marx, now known as the AGM bound (FOCS 2008). This raised the question of whether a join algorithm can achieve a worst-case running time in line with this bound. This paper presents an algorithm that achieves this bound, while showing that traditional query plans cannot achieve it. In the process, they connect join processing questions with geometric inequalities, a connection that has proven fruitful in subsequent work. The algorithmic contribution in this paper almost immediately resonated within database applications when it was observed that a join algorithm recently implemented in industry, Leapfrog Triejoin, achieves a similar optimality guarantee. This led to a line of papers and implementations of join algorithms building off the ideas in the paper. The contribution of the paper to analysis of join queries has arguably been more profound – the connection between join query processing, geometric inequalities, and worst-case size bounds have been subsequently explored in many other contexts, including in the presence of integrity constraints. This work has already been honored with a “Gems of PODS” talk in PODS 2018: the conference paper, journal paper in JACM, and SIGMOD record survey article discussing later developments are all highly cited. This underlines the fact that this paper represented a major departure point for research in database theory.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132211230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brecht Vandevoort, Bas Ketsman, Christoph E. Koch, F. Neven
Transaction processing is a central part of most database applications. While serializability remains the gold standard for desirable transactional semantics, many database systems offer improved transaction throughput at the expense of introducing potential anomalies through the choice of a lower isolation level. Transactions are often not arbitrary but are constrained by a set of transaction programs defined at the application level (as is the case for TPC-C for instance), implying that not every potential anomaly can effectively be realized. The question central to this paper is the following: when - within the context of specific transaction programs - do isolation levels weaker than serializability, provide the same guarantees as serializability? We refer to the latter as the robustness problem. This paper surveys recent results on robustness testing against (multiversion) read committed focusing on complete rather than sufficient conditions. We show how to lift robustness testing to transaction templates as well as to programs to increase practical applicability. We discuss open questions and highlight promising directions for future research.
{"title":"Robustness Against Read Committed: A Free Transactional Lunch","authors":"Brecht Vandevoort, Bas Ketsman, Christoph E. Koch, F. Neven","doi":"10.1145/3517804.3524162","DOIUrl":"https://doi.org/10.1145/3517804.3524162","url":null,"abstract":"Transaction processing is a central part of most database applications. While serializability remains the gold standard for desirable transactional semantics, many database systems offer improved transaction throughput at the expense of introducing potential anomalies through the choice of a lower isolation level. Transactions are often not arbitrary but are constrained by a set of transaction programs defined at the application level (as is the case for TPC-C for instance), implying that not every potential anomaly can effectively be realized. The question central to this paper is the following: when - within the context of specific transaction programs - do isolation levels weaker than serializability, provide the same guarantees as serializability? We refer to the latter as the robustness problem. This paper surveys recent results on robustness testing against (multiversion) read committed focusing on complete rather than sufficient conditions. We show how to lift robustness testing to transaction templates as well as to programs to increase practical applicability. We discuss open questions and highlight promising directions for future research.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"452 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124490247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to its many applications, the clustering ensemble problem has been subject of intense algorithmic study over the last two decades. The input to this problem is a set of clusterings; its goal is to output a clustering that minimizes the average distance to the input clusterings. In this paper, we propose, to the best of our knowledge, the first generative model for this problem. Our Gibbs-like model is parameterized by a center clustering, and by a scale ; the probability of a particular clustering decays exponentially with its scaled Rand distance to the center clustering. For our new model, we give polynomial-time algorithms for sampling, when the center clustering has a constant number of clusters and reconstruction, when the scale parameter is small. En route, we establish several interesting properties of our model. Our work shows that the combinatorial structure of a Gibbs-like model for clusterings is more intricate and challenging than the corresponding and well-studied (Mallows) model for permutations.
{"title":"The Gibbs-Rand Model","authors":"Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi","doi":"10.1145/3517804.3526227","DOIUrl":"https://doi.org/10.1145/3517804.3526227","url":null,"abstract":"Due to its many applications, the clustering ensemble problem has been subject of intense algorithmic study over the last two decades. The input to this problem is a set of clusterings; its goal is to output a clustering that minimizes the average distance to the input clusterings. In this paper, we propose, to the best of our knowledge, the first generative model for this problem. Our Gibbs-like model is parameterized by a center clustering, and by a scale ; the probability of a particular clustering decays exponentially with its scaled Rand distance to the center clustering. For our new model, we give polynomial-time algorithms for sampling, when the center clustering has a constant number of clusters and reconstruction, when the scale parameter is small. En route, we establish several interesting properties of our model. Our work shows that the combinatorial structure of a Gibbs-like model for clusterings is more intricate and challenging than the corresponding and well-studied (Mallows) model for permutations.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122076651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study the data complexity of regular trail and simple path queries on undirected graphs. Using techniques from structural graph theory, ranging from the graph minor theorem to group-labeled graphs, we are able to identify several tractable and intractable subclasses of the regular languages. In particular, we establish that trail evaluation for simple chain regular expressions, which are common in practice, is tractable, whereas simple path evaluation is tractable for a large subclass. The problem of fully classifying all regular languages is quite non-trivial, even on undirected graphs, since it subsumes an intriguing problem that has been open for 30 years.
{"title":"The Complexity of Regular Trail and Simple Path Queries on Undirected Graphs","authors":"W. Martens, Tina Popp","doi":"10.1145/3517804.3524149","DOIUrl":"https://doi.org/10.1145/3517804.3524149","url":null,"abstract":"We study the data complexity of regular trail and simple path queries on undirected graphs. Using techniques from structural graph theory, ranging from the graph minor theorem to group-labeled graphs, we are able to identify several tractable and intractable subclasses of the regular languages. In particular, we establish that trail evaluation for simple chain regular expressions, which are common in practice, is tractable, whereas simple path evaluation is tractable for a large subclass. The problem of fully classifying all regular languages is quite non-trivial, even on undirected graphs, since it subsumes an intriguing problem that has been open for 30 years.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123326062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It is known that the query result of a regular spanner over a single document D can be enumerated after O(|D|) preprocessing and with constant delay in data complexity (Florenzano et al., ACM TODS 2020, Amarilli et al., ACM TODS 2021). It has been shown (Schmid and Schweikardt, PODS'21) that if the document is represented by a straight-line program (SLP) S, then enumeration is possible with a delay of O(log |D|), but with preprocessing that is linear in |S| (which, in the best case, is logarithmic in |D|). Hence, this compressed setting allows for spanner evaluation in sub-linear time, i.e., with logarithmic upper bounds for preprocessing and delay, if the document is highly-compressible. In this work, we extend these results to the dynamic setting. We consider a document database DDB = D1, D2, ..., Dm that is represented by an SLP SDDB, and that supports regular spanners M1, M2, ..., Mk (meaning that we have data structures at our disposal that allow O(log |Di|)-delay enumeration of the result of spanner Mj on document Di). Then we can perform an update by manipulating the existing documents of DDB by a sequence of text-editing operations commonly found in text-editors (like copy and paste, deleting, or copying factors, concatenating documents etc.), and add the thus constructed document to the database. Such an operation is called complex document editing and is given by an expression φ in a suitable algebra. Moreover, after this operation, the document database still supports all the regular spanners M1, ..., Mk. The total time required for such an update is O(k |φ| log d), where d is the maximum length of any intermediate document constructed in the complex document editing described by φ. We stress the fact that the size |SDDB| of the SLP (which upper bounds the preprocessing in the static case) is potentially logarithmic in the data, but generally depends on the compressibility of the documents (in the worst case, it is even linear in the data). In contrast to that, we can guarantee that the dependency on the data of our updates is logarithmic regardless of the actual compression achieved by the SLP. In particular, any such update performed by complex document editing adds documents whose length may be exponentially larger than the time needed for performing such an update. Our approach hinges on balancing properties of SLPs, and for our updates it is vital to manipulate the SLP that represents the database in such a way that these balancing properties are maintained.
{"title":"Query Evaluation over SLP-Represented Document Databases with Complex Document Editing","authors":"Markus L. Schmid, Nicole Schweikardt","doi":"10.1145/3517804.3524158","DOIUrl":"https://doi.org/10.1145/3517804.3524158","url":null,"abstract":"It is known that the query result of a regular spanner over a single document D can be enumerated after O(|D|) preprocessing and with constant delay in data complexity (Florenzano et al., ACM TODS 2020, Amarilli et al., ACM TODS 2021). It has been shown (Schmid and Schweikardt, PODS'21) that if the document is represented by a straight-line program (SLP) S, then enumeration is possible with a delay of O(log |D|), but with preprocessing that is linear in |S| (which, in the best case, is logarithmic in |D|). Hence, this compressed setting allows for spanner evaluation in sub-linear time, i.e., with logarithmic upper bounds for preprocessing and delay, if the document is highly-compressible. In this work, we extend these results to the dynamic setting. We consider a document database DDB = D1, D2, ..., Dm that is represented by an SLP SDDB, and that supports regular spanners M1, M2, ..., Mk (meaning that we have data structures at our disposal that allow O(log |Di|)-delay enumeration of the result of spanner Mj on document Di). Then we can perform an update by manipulating the existing documents of DDB by a sequence of text-editing operations commonly found in text-editors (like copy and paste, deleting, or copying factors, concatenating documents etc.), and add the thus constructed document to the database. Such an operation is called complex document editing and is given by an expression φ in a suitable algebra. Moreover, after this operation, the document database still supports all the regular spanners M1, ..., Mk. The total time required for such an update is O(k |φ| log d), where d is the maximum length of any intermediate document constructed in the complex document editing described by φ. We stress the fact that the size |SDDB| of the SLP (which upper bounds the preprocessing in the static case) is potentially logarithmic in the data, but generally depends on the compressibility of the documents (in the worst case, it is even linear in the data). In contrast to that, we can guarantee that the dependency on the data of our updates is logarithmic regardless of the actual compression achieved by the SLP. In particular, any such update performed by complex document editing adds documents whose length may be exponentially larger than the time needed for performing such an update. Our approach hinges on balancing properties of SLPs, and for our updates it is vital to manipulate the SLP that represents the database in such a way that these balancing properties are maintained.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130187394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We investigate the complexity of the evaluation problem for ECRPQ: Conjunctive Regular Path Queries (CRPQ), extended with synchronous relations (aka regular or automatic). We give a characterization for the evaluation and parameterized evaluation problems of ECRPQ in terms of the underlying structure of queries. As we show, complexity can range between PSpace, NP and polynomial time for the evaluation problem, and between XNL, W[1], and FPT for parameterized evaluation.
{"title":"When is the Evaluation of Extended CRPQ Tractable?","authors":"Diego Figueira, Varun Ramanathan","doi":"10.1145/3517804.3524167","DOIUrl":"https://doi.org/10.1145/3517804.3524167","url":null,"abstract":"We investigate the complexity of the evaluation problem for ECRPQ: Conjunctive Regular Path Queries (CRPQ), extended with synchronous relations (aka regular or automatic). We give a characterization for the evaluation and parameterized evaluation problems of ECRPQ in terms of the underlying structure of queries. As we show, complexity can range between PSpace, NP and polynomial time for the evaluation problem, and between XNL, W[1], and FPT for parameterized evaluation.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132899580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}