Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems最新文献_第2页

The ACM PODS Alberto O. Mendelzon test-of-time award 2013 2013年ACM PODS Alberto O. Mendelzon时间测试奖

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Pub Date : 2013-06-22 DOI: 10.1145/2463664.2494090

Michael Benedikt, T. Milo, D. V. Gucht

Mendelzon was an international leader in database theory, whose pioneering and fundamental work has inspired and influenced both database theoreticians and practitioners, and continues to be applied in a variety of advanced settings. He served the database community in many ways; in particular, he served as the General Chair of the PODS conference, and was instrumental in bringing together the PODS and SIGMOD conferences. He also was an outstanding educator, who guided the research of numerous doctoral students and postdoctoral fellows. The Award is to be awarded each year to a paper or a small number of papers published in the PODS proceedings ten years prior, that had the most impact (in terms of research, methodology, or transfer of practice) over the intervening decade. The decision was approved by SIGMOD and ACM. The funds for the Award were contributed by IBM Toronto.

Mendelzon是数据库理论的国际领导者，他的开创性和基础性工作激励和影响了数据库理论家和实践者，并继续在各种先进环境中应用。他以多种方式为数据库社区服务;特别是，他担任PODS会议的总主席，并在将PODS和SIGMOD会议结合在一起方面发挥了重要作用。他也是一位杰出的教育家，指导了许多博士生和博士后的研究。该奖项每年颁发给十年前在pod会议记录上发表的一篇或少数论文，这些论文在过去十年中(在研究、方法或实践转移方面)具有最大的影响力。该决定得到了SIGMOD和ACM的批准。该奖项的资金由IBM多伦多公司提供。

引用次数: 0

Sketching via hashing: from heavy hitters to compressed sensing to sparse fourier transform 通过哈希绘制草图:从重量级到压缩感知到稀疏傅里叶变换

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Pub Date : 2013-06-22 DOI: 10.1145/2463664.2465217

P. Indyk

Sketching via hashing is a popular and useful method for processing large data sets. Its basic idea is as follows. Suppose that we have a large multi-set of elements S = {a1, . . . as} ⊂ {1 . . . n}, and we would like to identify the elements that occur “frequently” in S. The algorithm starts by selecting a hash function h that maps the elements into an array c[1 . . .m]. The array entries are initialized to 0. Then, for each element a ∈ S, the algorithm increments c[h(a)]. At the end of the process, each array entry c[j] contains the count of all data elements a ∈ S mapped to j. It can be observed that if an element a occurs frequently enough in the data set S, then the value of the counter c[h(a)] must be large. That is, “frequent” elements are mapped to “heavy” buckets. By identifying the elements mapped to heavy buckets and repeating the process several times, one can efficiently recover the frequent elements, possibly together with a few extra ones (false positives).

通过哈希绘制草图是处理大型数据集的一种流行而有用的方法。其基本思想如下。假设我们有一个大的多元素集S = {a1，…{1……n}，我们想要识别在s中“频繁”出现的元素。算法首先选择一个哈希函数h，将这些元素映射到数组c[1 . .m]中。数组项初始化为0。然后，对于每个元素a∈S，算法增加c[h(a)]。在此过程结束时，每个数组项c[j]包含映射到j的所有数据元素a∈S的计数。可以看出，如果一个元素a在数据集S中出现的频率足够高，那么计数器c[h(a)]的值一定很大。也就是说，“频繁”元素被映射到“重”桶。通过识别映射到重桶的元素并重复该过程几次，可以有效地恢复频繁元素，可能还有一些额外的元素(误报)。

引用次数: 11

The complexity of mining maximal frequent subgraphs 挖掘最大频繁子图的复杂性

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Pub Date : 2013-06-22 DOI: 10.1145/2463664.2465222

B. Kimelfeld, Phokion G. Kolaitis

A frequent subgraph of a given collection of graphs is a graph that is isomorphic to a subgraph of at least as many graphs in the collection as a given threshold. Frequent subgraphs generalize frequent itemsets and arise in various contexts, from bioinformatics to the Web. Since the space of frequent subgraphs is typically extremely large, research in graph mining has focused on special types of frequent subgraphs that can be orders of magnitude smaller in number, yet encapsulate the space of all frequent subgraphs. Maximal frequent subgraphs (i.e., the ones not properly contained in any frequent subgraph) constitute the most useful such type. In this paper, we embark on a comprehensive investigation of the computational complexity of mining maximal frequent subgraphs. Our study is carried out by considering the effect of three different parameters: possible restrictions on the class of graphs; a fixed bound on the threshold; and a fixed bound on the number of desired answers. We focus on specific classes of connected graphs: general graphs, planar graphs, graphs of bounded degree, and graphs of bounded tree-width (trees being a special case). Moreover, each class has two variants: the one in which the nodes are unlabeled, and the one in which they are uniquely labeled. We delineate the complexity of the enumeration problem for each of these variants by determining when it is solvable in (total or incremental) polynomial time and when it is NP-hard. Specifically, for the labeled classes, we show that bounding the threshold yields tractability but, in most cases, bounding the number of answers does not, unless P=NP; an exception is the case of labeled trees, where bounding either of these two parameters yields tractability. The state of affairs turns out to be quite different for the unlabeled classes. The main (and most challenging to prove) result concerns unlabeled trees: we show NP-hardness, even if the input consists of two trees, and both the threshold and the number of desired answers are equal to just two. In other words, we establish that the following problem is NP-complete: given two unlabeled trees, do they have more than one maximal subtree in common?

给定图集合的频繁子图是与集合中至少与给定阈值相同的图的子图同构的图。频繁子图概括了频繁项集，并出现在从生物信息学到网络的各种环境中。由于频繁子图的空间通常非常大，图挖掘的研究主要集中在特殊类型的频繁子图上，这些频繁子图的数量可以小几个数量级，但却封装了所有频繁子图的空间。最大频繁子图(即那些没有被适当地包含在任何频繁子图中的子图)构成了最有用的这种类型。本文对挖掘最大频繁子图的计算复杂度进行了全面的研究。我们的研究是通过考虑三个不同参数的影响来进行的:对图类的可能限制;门槛:门槛上的固定界限;并且对期望答案的数量有一个固定的界限。我们专注于连接图的特定类别:一般图、平面图、有界度图和有界树宽度图(树是一种特殊情况)。此外，每个类都有两个变体:一个是节点未标记的变体，另一个是节点被唯一标记的变体。我们通过确定枚举问题何时可在(总或增量)多项式时间内解决以及何时是np困难来描述每个这些变量的枚举问题的复杂性。具体来说，对于有标记的类，我们表明限定阈值会产生可追溯性，但在大多数情况下，限定答案的数量不会，除非P=NP;一个例外是标记树的情况，其中这两个参数中的任何一个都可以产生可跟踪性。对于未被标记的阶级来说，情况就大不相同了。主要的(也是最难证明的)结果与未标记树有关:我们显示了np -硬度，即使输入由两棵树组成，并且阈值和期望答案的数量都等于两个。换句话说，我们确定以下问题是np完全的:给定两棵未标记的树，它们是否有一个以上的最大子树?

{"title":"The complexity of mining maximal frequent subgraphs","authors":"B. Kimelfeld, Phokion G. Kolaitis","doi":"10.1145/2463664.2465222","DOIUrl":"https://doi.org/10.1145/2463664.2465222","url":null,"abstract":"A frequent subgraph of a given collection of graphs is a graph that is isomorphic to a subgraph of at least as many graphs in the collection as a given threshold. Frequent subgraphs generalize frequent itemsets and arise in various contexts, from bioinformatics to the Web. Since the space of frequent subgraphs is typically extremely large, research in graph mining has focused on special types of frequent subgraphs that can be orders of magnitude smaller in number, yet encapsulate the space of all frequent subgraphs. Maximal frequent subgraphs (i.e., the ones not properly contained in any frequent subgraph) constitute the most useful such type.\u0000 In this paper, we embark on a comprehensive investigation of the computational complexity of mining maximal frequent subgraphs. Our study is carried out by considering the effect of three different parameters: possible restrictions on the class of graphs; a fixed bound on the threshold; and a fixed bound on the number of desired answers. We focus on specific classes of connected graphs: general graphs, planar graphs, graphs of bounded degree, and graphs of bounded tree-width (trees being a special case). Moreover, each class has two variants: the one in which the nodes are unlabeled, and the one in which they are uniquely labeled. We delineate the complexity of the enumeration problem for each of these variants by determining when it is solvable in (total or incremental) polynomial time and when it is NP-hard. Specifically, for the labeled classes, we show that bounding the threshold yields tractability but, in most cases, bounding the number of answers does not, unless P=NP; an exception is the case of labeled trees, where bounding either of these two parameters yields tractability. The state of affairs turns out to be quite different for the unlabeled classes. The main (and most challenging to prove) result concerns unlabeled trees: we show NP-hardness, even if the input consists of two trees, and both the threshold and the number of desired answers are equal to just two. In other words, we establish that the following problem is NP-complete: given two unlabeled trees, do they have more than one maximal subtree in common?","PeriodicalId":92118,"journal":{"name":"Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems","volume":"1 1","pages":"13-24"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82991402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Verification of database-driven systems via amalgamation 通过合并验证数据库驱动的系统

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Pub Date : 2013-06-22 DOI: 10.1145/2463664.2465228

M. Bojanczyk, L. Segoufin, Szymon Toruńczyk

We describe a general framework for static verification of systems that base their decisions upon queries to databases. The database is specified using constraints, typically a schema, and is not modified during a run of the system. The system is equipped with a finite number of registers for storing intermediate information from the database and the specification consists of a transition table described using quantifier-free formulas that can query either the database or the registers. Our main result concerns systems querying XML databases -- modeled as data trees -- using quantifier-free formulas with predicates such as the descendant axis or comparison of data values. In this scenario we show an ExpSpace algorithm for deciding reachability. Our technique is based on the notion of amalgamation and is quite general. For instance it also applies to relational databases (with an optimal PSpace algorithm). We also show that minor extensions of the model lead to undecidability.

我们描述了基于对数据库的查询做出决策的系统的静态验证的一般框架。数据库是使用约束(通常是模式)指定的，在系统运行期间不会修改数据库。该系统配备有有限数量的寄存器，用于存储来自数据库的中间信息，并且该规范包括使用可查询数据库或寄存器的无量词公式描述的转换表。我们的主要结果涉及查询XML数据库(建模为数据树)的系统，这些数据库使用带有后代轴或数据值比较等谓词的无量词公式。在这个场景中，我们将展示用于确定可达性的ExpSpace算法。我们的技术是基于融合的概念，是相当普遍的。例如，它也适用于关系数据库(使用最优PSpace算法)。我们还表明，模型的微小扩展会导致不确定性。

引用次数: 24

Flag & check: data access with monadically defined queries 标记和检查:单定义查询的数据访问

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Pub Date : 2013-06-22 DOI: 10.1145/2463664.2465227

S. Rudolph, M. Krötzsch

We introduce monadically defined queries (MODEQs) and nested monadically defined queries (NEMODEQs), two querying formalisms that extend conjunctive queries, conjunctive two-way regular path queries, and monadic Datalog queries. Both can be expressed as Datalog queries and in monadic second-order logic, yet they have a decidable query containment problem and favorable query answering complexities: a data complexity of P, and a combined complexity of NP (MODEQs) and PSpace (NEMODEQs). We show that (NE)MODEQ answering remains decidable in the presence of a well-known generic class of tuple-generating dependencies. In addition, techniques to rewrite queries under dependencies into (NE)MODEQs are introduced. Rewriting can be applied partially, and (NE)MODEQ answering is still decidable if the non-rewritable part of the TGDs permits decidable (NE)MODEQ answering on other grounds.

我们介绍单行定义查询(modeq)和嵌套单行定义查询(nemodeq)，这两种查询形式扩展了连接查询、连接双向正则路径查询和单行Datalog查询。两者都可以表示为Datalog查询和一元二阶逻辑，但它们具有可判定的查询包含问题和有利的查询应答复杂性:数据复杂性为P，组合复杂性为NP (MODEQs)和PSpace (NEMODEQs)。我们证明(NE)MODEQ回答在一个众所周知的元组生成依赖的通用类的存在下仍然是可决定的。此外，还介绍了将依赖项下的查询重写为(NE)模型的技术。重写可以部分应用，如果tgd的不可重写部分允许基于其他理由的可决定(NE)MODEQ回答，则(NE)MODEQ回答仍然是可决定的。

引用次数: 46

A dichotomy in the intensional expressive power of nested relational calculi augmented with aggregate functions and a powerset operator 集函数增广的嵌套关系演算的内蕴表达能力的二分法和幂集算子

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Pub Date : 2013-06-22 DOI: 10.1145/2463664.2463670

L. Wong

The extensional aspect of expressive power---i.e., what queries can or cannot be expressed---has been the subject of many studies of query languages. Paradoxically, although efficiency is of primary concern in computer science, the intensional aspect of expressive power---i.e., what queries can or cannot be implemented efficiently---has been much neglected. Here, we discuss the intensional expressive power of NRC(Q, +, ·, ‏, ÷, Σ, powerset), a nested relational calculus augmented with aggregate functions and a powerset operation. We show that queries on structures such as long chains, deep trees, etc. have a dichotomous behaviour: Either they are already expressible in the calculus without using the powerset operation or they require at least exponential space. This result generalizes in three significant ways several old dichotomy-like results, such as that of Suciu and Paredaens that the complex object algebra of Abiteboul and Beeri needs exponential space to implement the transitive closure of a long chain. Firstly, a more expressive query language---in particular, one that captures SQL---is considered here. Secondly, queries on a more general class of structures than a long chain are considered here. Lastly, our proof is more general and holds for all query languages exhibiting a certain normal form and possessing a locality property.

表现力的外延方面——即。查询可以或不可以表达什么——一直是许多查询语言研究的主题。矛盾的是，尽管效率是计算机科学主要关注的问题，但表达能力的内涵方面——即:什么查询可以有效地实现，什么查询不能有效地实现——在很大程度上被忽视了。本文讨论了NRC(Q， +，·，@，÷， Σ， powerset)的内涵表达能力，NRC(Q， +，·，@，÷， Σ， powerset)是一个嵌套关系演算，它具有聚集函数和幂集运算。我们证明了对长链、深树等结构的查询具有二分类行为:要么它们在微积分中已经可以不使用幂集运算表示，要么它们至少需要指数空间。这一结果以三种重要的方式推广了几个古老的类二分类结果，如Suciu和Paredaens关于Abiteboul和Beeri的复对象代数需要指数空间来实现长链的传递闭包的结论。首先，这里考虑一种更具表现力的查询语言——特别是捕获SQL的语言。其次，这里考虑的是对比长链更一般的结构类的查询。最后，我们的证明是更一般的，适用于所有的查询语言表现出一定的范式和具有局部性。

{"title":"A dichotomy in the intensional expressive power of nested relational calculi augmented with aggregate functions and a powerset operator","authors":"L. Wong","doi":"10.1145/2463664.2463670","DOIUrl":"https://doi.org/10.1145/2463664.2463670","url":null,"abstract":"The extensional aspect of expressive power---i.e., what queries can or cannot be expressed---has been the subject of many studies of query languages. Paradoxically, although efficiency is of primary concern in computer science, the intensional aspect of expressive power---i.e., what queries can or cannot be implemented efficiently---has been much neglected. Here, we discuss the intensional expressive power of NRC(Q, +, ·, ‏, ÷, Σ, powerset), a nested relational calculus augmented with aggregate functions and a powerset operation. We show that queries on structures such as long chains, deep trees, etc. have a dichotomous behaviour: Either they are already expressible in the calculus without using the powerset operation or they require at least exponential space. This result generalizes in three significant ways several old dichotomy-like results, such as that of Suciu and Paredaens that the complex object algebra of Abiteboul and Beeri needs exponential space to implement the transitive closure of a long chain. Firstly, a more expressive query language---in particular, one that captures SQL---is considered here. Secondly, queries on a more general class of structures than a long chain are considered here. Lastly, our proof is more general and holds for all query languages exhibiting a certain normal form and possessing a locality property.","PeriodicalId":92118,"journal":{"name":"Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems","volume":"11 1","pages":"285-296"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81819822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Collaborative data-driven workflows: think global, act local 协作数据驱动工作流:全局思考，本地行动

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Pub Date : 2013-06-22 DOI: 10.1145/2463664.2463672

S. Abiteboul, V. Vianu

We introduce and study a model of collaborative data-driven workflows. In a local-as-view style, each peer has a partial view of a global instance that remains purely virtual. Local updates have side effects on other peers' data, defined via the global instance. We also assume that the peers provide (an abstraction of) their specifications, so that each peer can actually see and reason on the specification of the entire system. We study the ability of a peer to carry out runtime reasoning about the global run of the system, and in particular about actions of other peers, based on its own local observations. A main contribution is to show that, under a reasonable restriction (namely, key-visibility), one can construct a finite symbolic representation of the infinite set of global runs consistent with given local observations. Using the symbolic representation, we show that we can evaluate in PSPACE a large class of properties over global runs, expressed in an extension of first-order logic with past linear-time temporal operators, PLTL-FO. We also provide a variant of the algorithm allowing to incrementally monitor a statically defined property, and then develop an extension allowing to monitor an infinite class of properties sharing the same temporal structure, defined dynamically as the run unfolds. Finally, we consider an extension of the language, that permits workflow control with PLTL-FO formulas. We prove that this does not increase the power of the workflow specification language, thereby showing that the language is closed under such introspective reasoning.

我们介绍并研究了一个协作数据驱动工作流模型。在本地即视图(local-as-view)样式中，每个对等体都拥有全局实例的部分视图，该视图保持纯虚拟状态。本地更新对通过全局实例定义的其他对等体的数据有副作用。我们还假设对等体提供了它们的规范(抽象)，这样每个对等体就可以实际地看到整个系统的规范并进行推理。我们研究了一个节点基于自己的局部观察，对系统的全局运行，特别是其他节点的行为进行运行时推理的能力。主要贡献是表明，在合理的限制下(即键可见性)，可以构造与给定局部观测一致的无限全局运行集的有限符号表示。使用符号表示，我们表明我们可以在PSPACE中评估全局运行的一大类属性，这些属性以一阶逻辑的扩展表示，具有过去的线性时间时间算子PLTL-FO。我们还提供了该算法的一个变体，允许增量地监视静态定义的属性，然后开发一个扩展，允许监视共享相同时间结构的无限类属性，随着运行展开动态定义。最后，我们考虑了该语言的扩展，它允许使用PLTL-FO公式进行工作流控制。我们证明了这并没有增加工作流规范语言的能力，从而表明该语言在这种内省推理下是封闭的。

{"title":"Collaborative data-driven workflows: think global, act local","authors":"S. Abiteboul, V. Vianu","doi":"10.1145/2463664.2463672","DOIUrl":"https://doi.org/10.1145/2463664.2463672","url":null,"abstract":"We introduce and study a model of collaborative data-driven workflows. In a local-as-view style, each peer has a partial view of a global instance that remains purely virtual. Local updates have side effects on other peers' data, defined via the global instance. We also assume that the peers provide (an abstraction of) their specifications, so that each peer can actually see and reason on the specification of the entire system.\u0000 We study the ability of a peer to carry out runtime reasoning about the global run of the system, and in particular about actions of other peers, based on its own local observations. A main contribution is to show that, under a reasonable restriction (namely, key-visibility), one can construct a finite symbolic representation of the infinite set of global runs consistent with given local observations. Using the symbolic representation, we show that we can evaluate in PSPACE a large class of properties over global runs, expressed in an extension of first-order logic with past linear-time temporal operators, PLTL-FO. We also provide a variant of the algorithm allowing to incrementally monitor a statically defined property, and then develop an extension allowing to monitor an infinite class of properties sharing the same temporal structure, defined dynamically as the run unfolds. Finally, we consider an extension of the language, that permits workflow control with PLTL-FO formulas. We prove that this does not increase the power of the workflow specification language, thereby showing that the language is closed under such introspective reasoning.","PeriodicalId":92118,"journal":{"name":"Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems","volume":"80 1","pages":"91-102"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88529031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Spanners: a formal framework for information extraction 扳手:用于信息提取的正式框架

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Pub Date : 2013-06-22 DOI: 10.1145/2463664.2463665

Ronald Fagin, B. Kimelfeld, Frederick Reiss, Stijn Vansummeren

An intrinsic part of information extraction is the creation and manipulation of relations extracted from text. In this paper, we develop a foundational framework where the central construct is what we call a spanner. A spanner maps an input string into relations over the spans (intervals specified by bounding indices) of the string. The focus of this paper is on the representation of spanners. Conceptually, there are two kinds of such representations. Spanners defined in a primitive representation extract relations directly from the input string; those defined in an algebra apply algebraic operations to the primitively represented spanners. This framework is driven by SystemT, an IBM commercial product for text analysis, where the primitive representation is that of regular expressions with capture variables. We define additional types of primitive spanner representations by means of two kinds of automata that assign spans to variables. We prove that the first kind has the same expressive power as regular expressions with capture variables; the second kind expresses precisely the algebra of the regular spanners---the closure of the first kind under standard relational operators. The core spanners extend the regular ones by string-equality selection (an extension used in SystemT). We give some fundamental results on the expressiveness of regular and core spanners. As an example, we prove that regular spanners are closed under difference (and complement), but core spanners are not. Finally, we establish connections with related notions in the literature.

信息提取的一个本质部分是从文本中提取的关系的创建和操作。在本文中，我们开发了一个基本框架，其中的中心结构是我们称之为扳手的东西。扳手将输入字符串映射到字符串的跨度(由边界索引指定的间隔)上的关系。本文的重点是扳手的表示。从概念上讲，有两种这样的表示。定义在原始表示中的扳手直接从输入字符串中提取关系;那些在代数中定义的工具对原始表示的生成工具应用代数操作。该框架由SystemT驱动，SystemT是用于文本分析的IBM商业产品，其基本表示是带有捕获变量的正则表达式。通过将跨度赋值给变量的两种自动机，我们定义了原始扳手表示的附加类型。证明了第一类正则表达式与带捕获变量的正则表达式具有相同的表达能力;第二类精确地表示正则扳手的代数——第一类在标准关系操作符下的闭包。核心扳手通过字符串相等选择(SystemT中使用的扩展)扩展了常规扳手。给出了正则扳手和芯扳手的可表达性的一些基本结果。作为一个例子，我们证明了正则扳手在差(和补)下是封闭的，而芯扳手则不是。最后，我们与文献中的相关概念建立联系。

{"title":"Spanners: a formal framework for information extraction","authors":"Ronald Fagin, B. Kimelfeld, Frederick Reiss, Stijn Vansummeren","doi":"10.1145/2463664.2463665","DOIUrl":"https://doi.org/10.1145/2463664.2463665","url":null,"abstract":"An intrinsic part of information extraction is the creation and manipulation of relations extracted from text. In this paper, we develop a foundational framework where the central construct is what we call a spanner. A spanner maps an input string into relations over the spans (intervals specified by bounding indices) of the string. The focus of this paper is on the representation of spanners. Conceptually, there are two kinds of such representations. Spanners defined in a primitive representation extract relations directly from the input string; those defined in an algebra apply algebraic operations to the primitively represented spanners. This framework is driven by SystemT, an IBM commercial product for text analysis, where the primitive representation is that of regular expressions with capture variables.\u0000 We define additional types of primitive spanner representations by means of two kinds of automata that assign spans to variables. We prove that the first kind has the same expressive power as regular expressions with capture variables; the second kind expresses precisely the algebra of the regular spanners---the closure of the first kind under standard relational operators. The core spanners extend the regular ones by string-equality selection (an extension used in SystemT). We give some fundamental results on the expressiveness of regular and core spanners. As an example, we prove that regular spanners are closed under difference (and complement), but core spanners are not. Finally, we establish connections with related notions in the literature.","PeriodicalId":92118,"journal":{"name":"Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems","volume":"201 1","pages":"37-48"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76995490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

When is naive evaluation possible? 什么时候朴素评估是可能的?

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Pub Date : 2013-06-22 DOI: 10.1145/2463664.2463674

Amélie Gheerbrant, L. Libkin, C. Sirangelo

The term naive evaluation refers to evaluating queries over incomplete databases as if nulls were usual data values, i.e., to using the standard database query evaluation engine. Since the semantics of query answering over incomplete databases is that of certain answers, we would like to know when naive evaluation computes them: i.e., when certain answers can be found without inventing new specialized algorithms. For relational databases it is well known that unions of conjunctive queries possess this desirable property, and results on preservation of formulae under homomorphisms tell us that within relational calculus, this class cannot be extended under the open-world assumption. Our goal here is twofold. First, we develop a general framework that allows us to determine, for a given semantics of incompleteness, classes of queries for which naive evaluation computes certain answers. Second, we apply this approach to a variety of semantics, showing that for many classes of queries beyond unions of conjunctive queries, naive evaluation makes perfect sense under assumptions different from open-world. Our key observations are: (1) naive evaluation is equivalent to monotonicity of queries with respect to a semantics-induced ordering, and (2) for most reasonable semantics, such monotonicity is captured by preservation under various types of homomorphisms. Using these results we find classes of queries for which naive evaluation works, e.g., positive first-order formulae for the closed-world semantics. Even more, we introduce a general relation-based framework for defining semantics of incompleteness, show how it can be used to capture many known semantics and to introduce new ones, and describe classes of first-order queries for which naive evaluation works under such semantics.

术语朴素求值是指对不完整数据库的查询进行求值，就好像null是通常的数据值一样，也就是说，使用标准的数据库查询求值引擎。由于不完整数据库上查询回答的语义是特定答案的语义，我们想知道朴素评估何时计算它们:即，何时可以在不发明新的专门算法的情况下找到特定答案。对于关系数据库，众所周知，合取查询的并集具有这种理想的性质，并且关于同态下公式保存的结果告诉我们，在关系演算中，这类不能在开放世界假设下扩展。我们的目标是双重的。首先，我们开发了一个通用框架，该框架允许我们确定，对于给定的不完备语义，查询类的朴素求值计算某些答案。其次，我们将这种方法应用于各种语义，表明对于许多超越联合查询联合的查询类，朴素求值在不同于开放世界的假设下是完全有意义的。我们的主要观察结果是:(1)朴素求值等价于关于语义诱导排序的查询的单调性，(2)对于大多数合理的语义，这种单调性是通过在各种同态类型下的保存来捕获的。利用这些结果，我们找到了朴素求值有效的查询类，例如，闭世界语义的正一阶公式。此外，我们还引入了一个通用的基于关系的框架来定义不完备的语义，展示了如何使用它来捕获许多已知的语义和引入新的语义，并描述了在这种语义下进行朴素求值的一阶查询类。

{"title":"When is naive evaluation possible?","authors":"Amélie Gheerbrant, L. Libkin, C. Sirangelo","doi":"10.1145/2463664.2463674","DOIUrl":"https://doi.org/10.1145/2463664.2463674","url":null,"abstract":"The term naive evaluation refers to evaluating queries over incomplete databases as if nulls were usual data values, i.e., to using the standard database query evaluation engine. Since the semantics of query answering over incomplete databases is that of certain answers, we would like to know when naive evaluation computes them: i.e., when certain answers can be found without inventing new specialized algorithms. For relational databases it is well known that unions of conjunctive queries possess this desirable property, and results on preservation of formulae under homomorphisms tell us that within relational calculus, this class cannot be extended under the open-world assumption.\u0000 Our goal here is twofold. First, we develop a general framework that allows us to determine, for a given semantics of incompleteness, classes of queries for which naive evaluation computes certain answers. Second, we apply this approach to a variety of semantics, showing that for many classes of queries beyond unions of conjunctive queries, naive evaluation makes perfect sense under assumptions different from open-world. Our key observations are: (1) naive evaluation is equivalent to monotonicity of queries with respect to a semantics-induced ordering, and (2) for most reasonable semantics, such monotonicity is captured by preservation under various types of homomorphisms. Using these results we find classes of queries for which naive evaluation works, e.g., positive first-order formulae for the closed-world semantics. Even more, we introduce a general relation-based framework for defining semantics of incompleteness, show how it can be used to capture many known semantics and to introduce new ones, and describe classes of first-order queries for which naive evaluation works under such semantics.","PeriodicalId":92118,"journal":{"name":"Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems","volume":"26 1","pages":"75-86"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77338839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Trial for RDF: adapting graph query languages for RDF data 试用RDF:为RDF数据调整图形查询语言

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Pub Date : 2013-06-22 DOI: 10.1145/2463664.2465226

L. Libkin, Juan L. Reutter, D. Vrgoc

Querying RDF data is viewed as one of the main applications of graph query languages, and yet the standard model of graph databases -- essentially labeled graphs -- is different from the triples-based model of RDF. While encodings of RDF databases into graph data exist, we show that even the most natural ones are bound to lose some functionality when used in conjunction with graph query languages. The solution is to work directly with triples, but then many properties taken for granted in the graph database context (e.g., reachability) lose their natural meaning. Our goal is to introduce languages that work directly over triples and are closed, i.e., they produce sets of triples, rather than graphs. Our basic language is called TriAL, or Triple Algebra: it guarantees closure properties by replacing the product with a family of join operations. We extend TriAL with recursion, and explain why such an extension is more intricate for triples than for graphs. We present a declarative language, namely a fragment of datalog, capturing the recursive algebra. For both languages, the combined complexity of query evaluation is given by low-degree polynomials. We compare our languages with relational languages, such as finite-variable logics, and previously studied graph query languages such as adaptations of XPath, regular path queries, and nested regular expressions; many of these languages are subsumed by the recursive triple algebra. We also provide examples of the usefulness of TriAL in querying graph and RDF data.

查询RDF数据被视为图查询语言的主要应用之一，然而图数据库的标准模型——本质上是标记的图——不同于RDF的基于三元组的模型。虽然存在将RDF数据库编码为图数据的方法，但我们表明，即使是最自然的RDF数据库，在与图查询语言结合使用时，也必然会失去一些功能。解决方案是直接使用三元组，但是在图数据库上下文中，许多被认为是理所当然的属性(例如，可达性)失去了其自然意义。我们的目标是引入直接在三元组上工作并且是封闭的语言，也就是说，它们产生三元组的集合，而不是图。我们的基本语言叫做TriAL，或者Triple Algebra:它通过用一系列连接操作替换乘积来保证闭包属性。我们用递归扩展了TriAL，并解释了为什么这种扩展对于三元组比对于图更复杂。我们提出了一种声明性语言，即数据的片段，捕捉递归代数。对于两种语言，查询求值的组合复杂度由低次多项式表示。我们将我们的语言与关系语言(如有限变量逻辑)和先前研究过的图形查询语言(如XPath的适配、正则路径查询和嵌套正则表达式)进行比较;这些语言中的许多都被归为递归三重代数。我们还提供了TriAL在查询图和RDF数据方面有用的示例。

{"title":"Trial for RDF: adapting graph query languages for RDF data","authors":"L. Libkin, Juan L. Reutter, D. Vrgoc","doi":"10.1145/2463664.2465226","DOIUrl":"https://doi.org/10.1145/2463664.2465226","url":null,"abstract":"Querying RDF data is viewed as one of the main applications of graph query languages, and yet the standard model of graph databases -- essentially labeled graphs -- is different from the triples-based model of RDF. While encodings of RDF databases into graph data exist, we show that even the most natural ones are bound to lose some functionality when used in conjunction with graph query languages. The solution is to work directly with triples, but then many properties taken for granted in the graph database context (e.g., reachability) lose their natural meaning.\u0000 Our goal is to introduce languages that work directly over triples and are closed, i.e., they produce sets of triples, rather than graphs. Our basic language is called TriAL, or Triple Algebra: it guarantees closure properties by replacing the product with a family of join operations. We extend TriAL with recursion, and explain why such an extension is more intricate for triples than for graphs. We present a declarative language, namely a fragment of datalog, capturing the recursive algebra. For both languages, the combined complexity of query evaluation is given by low-degree polynomials. We compare our languages with relational languages, such as finite-variable logics, and previously studied graph query languages such as adaptations of XPath, regular path queries, and nested regular expressions; many of these languages are subsumed by the recursive triple algebra. We also provide examples of the usefulness of TriAL in querying graph and RDF data.","PeriodicalId":92118,"journal":{"name":"Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems","volume":"89 1","pages":"201-212"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78390449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53