ACM Transactions on Database Systems最新文献_第10页

High-performance complex event processing over hierarchical data 分层数据上的高性能复杂事件处理

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2013-11-01 DOI: 10.1145/2536779

Barzan Mozafari, Kai Zeng, Loris D'antoni, C. Zaniolo

While Complex Event Processing (CEP) constitutes a considerable portion of the so-called Big Data analytics, current CEP systems can only process data having a simple structure, and are otherwise limited in their ability to efficiently support complex continuous queries on structured or semistructured information. However, XML-like streams represent a very popular form of data exchange, comprising large portions of social network and RSS feeds, financial feeds, configuration files, and similar applications requiring advanced CEP queries. In this article, we present the XSeq language and system that support CEP on XML streams, via an extension of XPath that is both powerful and amenable to an efficient implementation. Specifically, the XSeq language extends XPath with natural operators to express sequential and Kleene-* patterns over XML streams, while remaining highly amenable to efficient execution. In fact, XSeq is designed to take full advantage of the recently proposed Visibly Pushdown Automata (VPA), where higher expressive power can be achieved without compromising the computationally attractive properties of finite state automata. Besides the efficiency and expressivity benefits, the choice of VPA as the underlying model also enables XSeq to go beyond XML streams and be easily applicable to any data with both sequential and hierarchical structures, including JSON messages, RNA sequences, and software traces. Therefore, we illustrate the XSeq's power for CEP applications through examples from different domains and provide formal results on its expressiveness and complexity. Finally, we present several optimization techniques for XSeq queries. Our extensive experiments indicate that XSeq brings outstanding performance to CEP applications: two orders of magnitude improvement is obtained over the same queries executed in general-purpose XML engines.

虽然复杂事件处理(CEP)在所谓的大数据分析中占相当大的一部分，但目前的CEP系统只能处理结构简单的数据，并且在有效支持对结构化或半结构化信息的复杂连续查询方面能力有限。然而，类似xml的流代表了一种非常流行的数据交换形式，包括大部分的社交网络和RSS提要、金融提要、配置文件以及需要高级CEP查询的类似应用程序。在本文中，我们将通过XPath的扩展介绍在XML流上支持CEP的XSeq语言和系统，该扩展既强大又易于高效实现。具体来说，XSeq语言使用自然运算符扩展XPath，以在XML流上表示顺序模式和Kleene-*模式，同时保持对高效执行的高度适应性。事实上，XSeq是为了充分利用最近提出的可视下推自动机(VPA)而设计的，VPA可以在不影响有限状态自动机的计算吸引力的情况下实现更高的表达能力。除了效率和表达性方面的好处之外，选择VPA作为底层模型还使XSeq能够超越XML流，并且可以轻松地应用于具有顺序和层次结构的任何数据，包括JSON消息、RNA序列和软件跟踪。因此，我们将通过来自不同领域的示例说明XSeq在CEP应用程序中的强大功能，并提供有关其表达性和复杂性的正式结果。最后，我们介绍了几种用于XSeq查询的优化技术。我们的大量实验表明，XSeq为CEP应用程序带来了出色的性能:与在通用XML引擎中执行的相同查询相比，获得了两个数量级的改进。

{"title":"High-performance complex event processing over hierarchical data","authors":"Barzan Mozafari, Kai Zeng, Loris D'antoni, C. Zaniolo","doi":"10.1145/2536779","DOIUrl":"https://doi.org/10.1145/2536779","url":null,"abstract":"While Complex Event Processing (CEP) constitutes a considerable portion of the so-called Big Data analytics, current CEP systems can only process data having a simple structure, and are otherwise limited in their ability to efficiently support complex continuous queries on structured or semistructured information. However, XML-like streams represent a very popular form of data exchange, comprising large portions of social network and RSS feeds, financial feeds, configuration files, and similar applications requiring advanced CEP queries. In this article, we present the XSeq language and system that support CEP on XML streams, via an extension of XPath that is both powerful and amenable to an efficient implementation. Specifically, the XSeq language extends XPath with natural operators to express sequential and Kleene-* patterns over XML streams, while remaining highly amenable to efficient execution. In fact, XSeq is designed to take full advantage of the recently proposed Visibly Pushdown Automata (VPA), where higher expressive power can be achieved without compromising the computationally attractive properties of finite state automata. Besides the efficiency and expressivity benefits, the choice of VPA as the underlying model also enables XSeq to go beyond XML streams and be easily applicable to any data with both sequential and hierarchical structures, including JSON messages, RNA sequences, and software traces. Therefore, we illustrate the XSeq's power for CEP applications through examples from different domains and provide formal results on its expressiveness and complexity. Finally, we present several optimization techniques for XSeq queries. Our extensive experiments indicate that XSeq brings outstanding performance to CEP applications: two orders of magnitude improvement is obtained over the same queries executed in general-purpose XML engines.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"11 1","pages":"21"},"PeriodicalIF":1.8,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75550230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Static analysis and optimization of semantic web queries 语义web查询的静态分析和优化

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2013-11-01 DOI: 10.1145/2500130

Andrés Letelier, Jorge Pérez, R. Pichler, Sebastian Skritek

Static analysis is a fundamental task in query optimization. In this article we study static analysis and optimization techniques for SPARQL, which is the standard language for querying Semantic Web data. Of particular interest for us is the optionality feature in SPARQL. It is crucial in Semantic Web data management, where data sources are inherently incomplete and the user is usually interested in partial answers to queries. This feature is one of the most complicated constructors in SPARQL and also the one that makes this language depart from classical query languages such as relational conjunctive queries. We focus on the class of well-designed SPARQL queries, which has been proposed in the literature as a fragment of the language with good properties regarding query evaluation. We first propose a tree representation for SPARQL queries, called pattern trees, which captures the class of well-designed SPARQL graph patterns. Among other results, we propose several rules that can be used to transform pattern trees into a simple normal form, and study equivalence and containment. We also study the evaluation and enumeration problems for this class of queries.

静态分析是查询优化中的一项基本任务。在本文中，我们将研究SPARQL的静态分析和优化技术，SPARQL是查询语义Web数据的标准语言。我们特别感兴趣的是SPARQL中的可选特性。它在语义Web数据管理中是至关重要的，在语义Web数据管理中，数据源本质上是不完整的，用户通常对查询的部分答案感兴趣。这个特性是SPARQL中最复杂的构造函数之一，也是使SPARQL语言有别于传统查询语言(如关系连接查询)的一个特性。我们关注的是一类设计良好的SPARQL查询，这类查询在文献中被认为是语言的一个片段，在查询求值方面具有良好的属性。我们首先提出了SPARQL查询的树表示，称为模式树，它捕获了一类设计良好的SPARQL图形模式。在其他结果中，我们提出了一些可用于将模式树转换为简单范式的规则，并研究了等价和包含。我们还研究了这类查询的求值和枚举问题。

{"title":"Static analysis and optimization of semantic web queries","authors":"Andrés Letelier, Jorge Pérez, R. Pichler, Sebastian Skritek","doi":"10.1145/2500130","DOIUrl":"https://doi.org/10.1145/2500130","url":null,"abstract":"Static analysis is a fundamental task in query optimization. In this article we study static analysis and optimization techniques for SPARQL, which is the standard language for querying Semantic Web data. Of particular interest for us is the optionality feature in SPARQL. It is crucial in Semantic Web data management, where data sources are inherently incomplete and the user is usually interested in partial answers to queries. This feature is one of the most complicated constructors in SPARQL and also the one that makes this language depart from classical query languages such as relational conjunctive queries. We focus on the class of well-designed SPARQL queries, which has been proposed in the literature as a fragment of the language with good properties regarding query evaluation. We first propose a tree representation for SPARQL queries, called pattern trees, which captures the class of well-designed SPARQL graph patterns. Among other results, we propose several rules that can be used to transform pattern trees into a simple normal form, and study equivalence and containment. We also study the evaluation and enumeration problems for this class of queries.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"58 1","pages":"25"},"PeriodicalIF":1.8,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78473638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

Consistent thinning of large geographical data for map visualization 为地图可视化而持续细化大型地理数据

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2013-11-01 DOI: 10.1145/2539032.2539034

A. Sarma, Hongrae Lee, Hector Gonzalez, J. Madhavan, A. Halevy

Large-scale map visualization systems play an increasingly important role in presenting geographic datasets to end-users. Since these datasets can be extremely large, a map rendering system often needs to select a small fraction of the data to visualize them in a limited space. This article addresses the fundamental challenge of thinning: determining appropriate samples of data to be shown on specific geographical regions and zoom levels. Other than the sheer scale of the data, the thinning problem is challenging because of a number of other reasons: (1) data can consist of complex geographical shapes, (2) rendering of data needs to satisfy certain constraints, such as data being preserved across zoom levels and adjacent regions, and (3) after satisfying the constraints, an optimal solution needs to be chosen based on objectives such as maximality, fairness, and importance of data. This article formally defines and presents a complete solution to the thinning problem. First, we express the problem as an integer programming formulation that efficiently solves thinning for desired objectives. Second, we present more efficient solutions for maximality, based on DFS traversal of a spatial tree. Third, we consider the common special case of point datasets, and present an even more efficient randomized algorithm. Fourth, we show that contiguous regions are tractable for a general version of maximality for which arbitrary regions are intractable. Fifth, we examine the structure of our integer programming formulation and show that for point datasets, our program is integral. Finally, we have implemented all techniques from this article in Google Maps [Google 2005] visualizations of fusion tables [Gonzalez et al. 2010], and we describe a set of experiments that demonstrate the trade-offs among the algorithms.

大规模地图可视化系统在向最终用户呈现地理数据集方面发挥着越来越重要的作用。由于这些数据集可能非常大，因此地图渲染系统通常需要在有限的空间中选择一小部分数据来可视化它们。本文解决了细化的基本挑战:确定要在特定地理区域和缩放级别上显示的适当数据样本。除了数据的绝对规模之外，细化问题具有挑战性，因为许多其他原因:(1)数据可能由复杂的地理形状组成，(2)数据的呈现需要满足某些约束，例如跨缩放级别和相邻区域保存的数据，以及(3)在满足约束之后，需要根据数据的最大化，公平性和重要性等目标选择最优解决方案。本文正式定义并提出了细化问题的完整解决方案。首先，我们将问题表示为一个整数规划公式，有效地解决了期望目标的细化问题。其次，我们提出了基于空间树的DFS遍历的更有效的最大化解决方案。第三，我们考虑了点数据集的常见特殊情况，并提出了一种更有效的随机化算法。第四，我们证明了对于任意区域难以处理的最大值的一般版本，连续区域是可处理的。第五，我们检查了我们的整数规划公式的结构，并表明对于点数据集，我们的程序是积分的。最后，我们在谷歌地图[Google 2005]融合表可视化[Gonzalez et al. 2010]中实现了本文中的所有技术，并描述了一组实验，展示了算法之间的权衡。

{"title":"Consistent thinning of large geographical data for map visualization","authors":"A. Sarma, Hongrae Lee, Hector Gonzalez, J. Madhavan, A. Halevy","doi":"10.1145/2539032.2539034","DOIUrl":"https://doi.org/10.1145/2539032.2539034","url":null,"abstract":"Large-scale map visualization systems play an increasingly important role in presenting geographic datasets to end-users. Since these datasets can be extremely large, a map rendering system often needs to select a small fraction of the data to visualize them in a limited space. This article addresses the fundamental challenge of thinning: determining appropriate samples of data to be shown on specific geographical regions and zoom levels. Other than the sheer scale of the data, the thinning problem is challenging because of a number of other reasons: (1) data can consist of complex geographical shapes, (2) rendering of data needs to satisfy certain constraints, such as data being preserved across zoom levels and adjacent regions, and (3) after satisfying the constraints, an optimal solution needs to be chosen based on objectives such as maximality, fairness, and importance of data.\u0000 This article formally defines and presents a complete solution to the thinning problem. First, we express the problem as an integer programming formulation that efficiently solves thinning for desired objectives. Second, we present more efficient solutions for maximality, based on DFS traversal of a spatial tree. Third, we consider the common special case of point datasets, and present an even more efficient randomized algorithm. Fourth, we show that contiguous regions are tractable for a general version of maximality for which arbitrary regions are intractable. Fifth, we examine the structure of our integer programming formulation and show that for point datasets, our program is integral. Finally, we have implemented all techniques from this article in Google Maps [Google 2005] visualizations of fusion tables [Gonzalez et al. 2010], and we describe a set of experiments that demonstrate the trade-offs among the algorithms.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"38 1","pages":"22"},"PeriodicalIF":1.8,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78969822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Validating XML documents in the streaming model with external memory 使用外部内存验证流模型中的XML文档

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2013-11-01 DOI: 10.1145/2504590

C. Konrad, F. Magniez

We study the problem of validating XML documents of size N against general DTDs in the context of streaming algorithms. The starting point of this work is a well-known space lower bound. There are XML documents and DTDs for which p-pass streaming algorithms require Ω(N/p) space. We show that when allowing access to external memory, there is a deterministic streaming algorithm that solves this problem with memory space &Order;(log2 N), a constant number of auxiliary read/write streams, and &Order;(log N) total number of passes on the XML document and auxiliary streams. An important intermediate step of this algorithm is the computation of the First-Child-Next-Sibling (FCNS) encoding of the initial XML document in a streaming fashion. We study this problem independently, and we also provide memory-efficient streaming algorithms for decoding an XML document given in its FCNS encoding. Furthermore, validating XML documents encoding binary trees against any DTD in the usual streaming model without external memory can be done with sublinear memory. There is a one-pass algorithm using &Order;(√N log N) space, and a bidirectional two-pass algorithm using &Order;(log2 N) space which perform this task.

我们研究了在流算法上下文中针对一般dtd验证大小为N的XML文档的问题。这项工作的出发点是一个众所周知的空间下界。有些XML文档和dtd的p-pass流算法需要Ω(N/p)空间。我们展示了当允许访问外部内存时，有一种确定性流算法可以用内存空间&Order;(log2 N)、一个常数的辅助读/写流和&Order;(log N)传递XML文档和辅助流的总次数来解决这个问题。该算法的一个重要中间步骤是以流方式计算初始XML文档的第一子-下一兄弟(FCNS)编码。我们独立地研究了这个问题，并且我们还提供了用于解码以FCNS编码给出的XML文档的内存高效流算法。此外，在没有外部内存的情况下，根据常规流模型中的任何DTD验证编码二叉树的XML文档，可以使用次线性内存来完成。有一种使用&Order;(√N log N)空间的单遍算法，以及一种使用&Order;(log2n)空间的双向双遍算法来执行此任务。

{"title":"Validating XML documents in the streaming model with external memory","authors":"C. Konrad, F. Magniez","doi":"10.1145/2504590","DOIUrl":"https://doi.org/10.1145/2504590","url":null,"abstract":"We study the problem of validating XML documents of size N against general DTDs in the context of streaming algorithms. The starting point of this work is a well-known space lower bound. There are XML documents and DTDs for which p-pass streaming algorithms require Ω(N/p) space. We show that when allowing access to external memory, there is a deterministic streaming algorithm that solves this problem with memory space &Order;(log2 N), a constant number of auxiliary read/write streams, and &Order;(log N) total number of passes on the XML document and auxiliary streams. An important intermediate step of this algorithm is the computation of the First-Child-Next-Sibling (FCNS) encoding of the initial XML document in a streaming fashion. We study this problem independently, and we also provide memory-efficient streaming algorithms for decoding an XML document given in its FCNS encoding. Furthermore, validating XML documents encoding binary trees against any DTD in the usual streaming model without external memory can be done with sublinear memory. There is a one-pass algorithm using &Order;(√N log N) space, and a bidirectional two-pass algorithm using &Order;(log2 N) space which perform this task.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"29 1","pages":"27"},"PeriodicalIF":1.8,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89108634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Mergeable summaries 可以合并汇总

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2013-11-01 DOI: 10.1145/2500128

P. Agarwal, Graham Cormode, Zengfeng Huang, J. M. Phillips, Zhewei Wei, K. Yi

We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two datasets, there is a way to merge the two summaries into a single summary on the two datasets combined together, while preserving the error and size guarantees. This property means that the summaries can be merged in a way akin to other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the datasets. But some other fundamental ones, like those for heavy hitters and quantiles, are not (known to be) mergeable. In this article, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for ϵ-approximate heavy hitters, there is a deterministic mergeable summary of size O(1/ϵ); for ϵ-approximate quantiles, there is a deterministic summary of size O((1/ϵ) log(ϵ n)) that has a restricted form of mergeability, and a randomized one of size O((1/ϵ) log3/2(1/ϵ)) with full mergeability. We also extend our results to geometric summaries such as ϵ-approximations which permit approximate multidimensional range counting queries. While most of the results in this article are theoretical in nature, some of the algorithms are actually very simple and even perform better than the previously best known algorithms, which we demonstrate through experiments in a simulated sensor network. We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for ϵ-approximate quantiles that depends only on ϵ, of size O((1/ϵ) log3/2(1/ϵ)), and (2) we demonstrate that the MG and the SpaceSaving summaries for heavy hitters are isomorphic.

我们研究了数据摘要的可合并性。非正式地说，可合并性要求，给定两个数据集上的两个摘要，有一种方法可以将两个摘要合并为两个数据集上的单个摘要，同时保留错误和大小保证。这个属性意味着可以以类似于其他代数运算符(如sum和max)的方式合并汇总，这对于在大规模分布式数据上计算汇总特别有用。一些数据摘要可以通过构造简单地合并，最值得注意的是所有的草图都是数据集的线性函数。但其他一些基本的，比如那些重量级人物和分位数，是(已知的)不可合并的。在本文中，我们将演示这些摘要确实是可合并的，或者经过适当的修改后可以使其可合并。具体来说，我们表明，对于ϵ-approximate重量级人物，存在一个大小为O(1/ λ)的确定性可合并总结;对于ϵ-approximate分位数，有一个大小为O((1/ λ) log(λ))的确定性总结，它具有有限形式的可合并性，以及一个大小为O((1/ λ) log3/2(1/ λ))的随机总结，具有完全可合并性。我们还将结果扩展到几何摘要，例如ϵ-approximations，它允许近似多维范围计数查询。虽然本文中的大多数结果本质上是理论性的，但有些算法实际上非常简单，甚至比以前最知名的算法表现得更好，我们通过模拟传感器网络中的实验来证明这一点。我们还获得了两个独立感兴趣的结果:(1)我们为ϵ-approximate分位数提供了最著名的随机流边界，该分位数仅依赖于大小为O((1/御柱)log3/2(1/御柱)的御柱，(2)我们证明了MG和重量级的SpaceSaving摘要是同态的。

{"title":"Mergeable summaries","authors":"P. Agarwal, Graham Cormode, Zengfeng Huang, J. M. Phillips, Zhewei Wei, K. Yi","doi":"10.1145/2500128","DOIUrl":"https://doi.org/10.1145/2500128","url":null,"abstract":"We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two datasets, there is a way to merge the two summaries into a single summary on the two datasets combined together, while preserving the error and size guarantees. This property means that the summaries can be merged in a way akin to other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the datasets. But some other fundamental ones, like those for heavy hitters and quantiles, are not (known to be) mergeable. In this article, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for ϵ-approximate heavy hitters, there is a deterministic mergeable summary of size O(1/ϵ); for ϵ-approximate quantiles, there is a deterministic summary of size O((1/ϵ) log(ϵ n)) that has a restricted form of mergeability, and a randomized one of size O((1/ϵ) log3/2(1/ϵ)) with full mergeability. We also extend our results to geometric summaries such as ϵ-approximations which permit approximate multidimensional range counting queries. While most of the results in this article are theoretical in nature, some of the algorithms are actually very simple and even perform better than the previously best known algorithms, which we demonstrate through experiments in a simulated sensor network.\u0000 We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for ϵ-approximate quantiles that depends only on ϵ, of size O((1/ϵ) log3/2(1/ϵ)), and (2) we demonstrate that the MG and the SpaceSaving summaries for heavy hitters are isomorphic.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"9 1","pages":"26"},"PeriodicalIF":1.8,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87896232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 61

Algebraic incremental maintenance of XML views XML视图的代数增量维护

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2013-08-01 DOI: 10.1145/2508020.2508021

A. Bonifati, M. Goodfellow, I. Manolescu, Domenica Sileo

Materialized views can bring important performance benefits when querying XML documents. In the presence of XML document changes, materialized views need to be updated to faithfully reflect the changed document. In this work, we present an algebraic approach for propagating source updates to XML materialized views expressed in a powerful XML tree pattern formalism. Our approach differs from the state-of-the-art in the area in two important ways. First, it relies on set-oriented, algebraic operations, to be contrasted with node-based previous approaches. Second, it exploits state-of-the-art features of XML stores and XML query evaluation engines, notably XML structural identifiers and associated structural join algorithms. We present algorithms for determining how updates should be propagated to views, and highlight the benefits of our approach over existing algorithms through a series of experiments.

在查询XML文档时，物化视图可以带来重要的性能优势。在出现XML文档更改时，需要更新物化视图以忠实地反映更改的文档。在这项工作中，我们提出了一种代数方法，用于将源更新传播到用强大的XML树模式形式化表达的XML物化视图。我们的方法在两个重要方面不同于该地区最先进的技术。首先，它依赖于面向集合的代数运算，与之前基于节点的方法形成对比。其次，它利用了XML存储和XML查询计算引擎的最新特性，特别是XML结构标识符和相关的结构连接算法。我们提出了用于确定如何将更新传播到视图的算法，并通过一系列实验强调了我们的方法相对于现有算法的好处。

引用次数: 16

Incremental graph pattern matching 增量图模式匹配

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2013-08-01 DOI: 10.1145/2489791

Wenfei Fan, Xin Wang, Yinghui Wu

Graph pattern matching is commonly used in a variety of emerging applications such as social network analysis. These applications highlight the need for studying the following two issues. First, graph pattern matching is traditionally defined in terms of subgraph isomorphism or graph simulation. These notions, however, often impose too strong a topological constraint on graphs to identify meaningful matches. Second, in practice a graph is typically large, and is frequently updated with small changes. It is often prohibitively expensive to recompute matches starting from scratch via batch algorithms when the graph is updated. This article studies these two issues. (1) We propose to define graph pattern matching based on a notion of bounded simulation, which extends graph simulation by specifying the connectivity of nodes in a graph within a predefined number of hops. We show that bounded simulation is able to find sensible matches that the traditional matching notions fail to catch. We also show that matching via bounded simulation is in cubic time, by giving such an algorithm. (2) We provide an account of results on incremental graph pattern matching, for matching defined with graph simulation, bounded simulation, and subgraph isomorphism. We show that the incremental matching problem is unbounded, that is, its cost is not determined alone by the size of the changes in the input and output, for all these matching notions. Nonetheless, when matching is defined in terms of simulation or bounded simulation, incremental matching is semibounded, that is, its worst-time complexity is bounded by a polynomial in the size of the changes in the input, output, and auxiliary information that is necessarily maintained to reuse previous computation, and the size of graph patterns. We also develop incremental matching algorithms for graph simulation and bounded simulation, by minimizing unnecessary recomputation. In contrast, matching based on subgraph isomorphism is neither bounded nor semibounded. (3) We experimentally verify the effectiveness and efficiency of these algorithms, and show that: (a) the revised notion of graph pattern matching allows us to identify communities commonly found in real-life networks, and (b) the incremental algorithms substantially outperform their batch counterparts in response to small changes. These suggest a promising framework for real-life graph pattern matching.

图形模式匹配通常用于各种新兴应用，如社会网络分析。这些应用突出了研究以下两个问题的必要性。首先，图模式匹配传统上是根据子图同构或图模拟来定义的。然而，这些概念通常对图施加了过于强烈的拓扑约束，从而无法识别有意义的匹配。其次，在实践中，图通常很大，并且经常更新小的更改。当图更新时，通过批处理算法从头开始重新计算匹配通常是非常昂贵的。本文对这两个问题进行了研究。(1)我们提出了基于有界模拟的概念来定义图模式匹配，它通过在预定义的跳数内指定图中节点的连通性来扩展图模拟。我们证明了有界模拟能够找到传统匹配概念无法捕获的合理匹配。通过给出这样的算法，我们还证明了通过有界模拟的匹配是在三次时间内完成的。(2)给出了图模拟、有界模拟和子图同构定义的增量图模式匹配的结果。我们证明了增量匹配问题是无界的，也就是说，对于所有这些匹配概念，它的成本并不仅仅取决于输入和输出变化的大小。然而，当匹配被定义为模拟或有界模拟时，增量匹配是半有界的，也就是说，它的最坏时间复杂度被输入、输出和辅助信息的变化大小的多项式和图模式的大小所限制，这些信息是为了重用以前的计算而必须维护的。通过最小化不必要的重新计算，我们还开发了图形模拟和有界模拟的增量匹配算法。相反，基于子图同构的匹配既不是有界的，也不是半有界的。(3)我们通过实验验证了这些算法的有效性和效率，并表明:(a)修订后的图模式匹配概念使我们能够识别现实生活网络中常见的社区，以及(b)增量算法在响应小变化方面大大优于批量算法。这为现实生活中的图形模式匹配提供了一个有前景的框架。

{"title":"Incremental graph pattern matching","authors":"Wenfei Fan, Xin Wang, Yinghui Wu","doi":"10.1145/2489791","DOIUrl":"https://doi.org/10.1145/2489791","url":null,"abstract":"Graph pattern matching is commonly used in a variety of emerging applications such as social network analysis. These applications highlight the need for studying the following two issues. First, graph pattern matching is traditionally defined in terms of subgraph isomorphism or graph simulation. These notions, however, often impose too strong a topological constraint on graphs to identify meaningful matches. Second, in practice a graph is typically large, and is frequently updated with small changes. It is often prohibitively expensive to recompute matches starting from scratch via batch algorithms when the graph is updated.\u0000 This article studies these two issues. (1) We propose to define graph pattern matching based on a notion of bounded simulation, which extends graph simulation by specifying the connectivity of nodes in a graph within a predefined number of hops. We show that bounded simulation is able to find sensible matches that the traditional matching notions fail to catch. We also show that matching via bounded simulation is in cubic time, by giving such an algorithm. (2) We provide an account of results on incremental graph pattern matching, for matching defined with graph simulation, bounded simulation, and subgraph isomorphism. We show that the incremental matching problem is unbounded, that is, its cost is not determined alone by the size of the changes in the input and output, for all these matching notions. Nonetheless, when matching is defined in terms of simulation or bounded simulation, incremental matching is semibounded, that is, its worst-time complexity is bounded by a polynomial in the size of the changes in the input, output, and auxiliary information that is necessarily maintained to reuse previous computation, and the size of graph patterns. We also develop incremental matching algorithms for graph simulation and bounded simulation, by minimizing unnecessary recomputation. In contrast, matching based on subgraph isomorphism is neither bounded nor semibounded. (3) We experimentally verify the effectiveness and efficiency of these algorithms, and show that: (a) the revised notion of graph pattern matching allows us to identify communities commonly found in real-life networks, and (b) the incremental algorithms substantially outperform their batch counterparts in response to small changes. These suggest a promising framework for real-life graph pattern matching.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"8 1","pages":"18"},"PeriodicalIF":1.8,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85079316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 130

Flexible and extensible preference evaluation in database systems 数据库系统中灵活和可扩展的偏好评估

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2013-08-01 DOI: 10.1145/2493268

Justin J. Levandoski, A. Eldawy, M. Mokbel, Mohamed E. Khalefa

Personalized database systems give users answers tailored to their personal preferences. While numerous preference evaluation methods for databases have been proposed (e.g., skyline, top-k, k-dominance, k-frequency), the implementation of these methods at the core of a database system is a double-edged sword. Core implementation provides efficient query processing for arbitrary database queries, however, this approach is not practical since each existing (and future) preference method requires implementation within the database engine. To solve this problem, this article introduces FlexPref, a framework for extensible preference evaluation in database systems. FlexPref, implemented in the query processor, aims to support a wide array of preference evaluation methods in a single extensible code base. Integration with FlexPref is simple, involving the registration of only three functions that capture the essence of the preference method. Once integrated, the preference method “lives” at the core of the database, enabling the efficient execution of preference queries involving common database operations. This article also provides a query optimization framework for FlexPref, as well as a theoretical framework that defines the properties a preference method must exhibit to be implemented in FlexPref. To demonstrate the extensibility of FlexPref, this article also provides case studies detailing the implementation of seven state-of-the-art preference evaluation methods within FlexPref. We also experimentally study the strengths and weaknesses of an implementation of FlexPref in PostgreSQL over a range of single-table and multitable preference queries.

个性化的数据库系统根据用户的个人喜好提供答案。虽然已经提出了许多数据库偏好评估方法(例如，skyline, top-k, k-dominance, k-frequency)，但这些方法在数据库系统核心的实现是一把双刃剑。核心实现为任意数据库查询提供了高效的查询处理，但是，这种方法并不实用，因为每个现有(和将来的)首选项方法都需要在数据库引擎中实现。为了解决这个问题，本文介绍了FlexPref，一个数据库系统中可扩展的偏好评估框架。在查询处理器中实现的FlexPref旨在在一个可扩展的代码库中支持广泛的首选项评估方法。与FlexPref的集成很简单，只涉及三个函数的注册，这些函数捕捉了首选项方法的本质。一旦集成，首选项方法就会“驻留”在数据库的核心，从而能够高效地执行涉及常见数据库操作的首选项查询。本文还为FlexPref提供了一个查询优化框架，以及一个理论框架，该框架定义了在FlexPref中实现首选项方法必须展示的属性。为了演示FlexPref的可扩展性，本文还提供了案例研究，详细介绍了FlexPref中七种最先进的偏好评估方法的实现。我们还通过实验研究了FlexPref在PostgreSQL中对一系列单表和多表首选项查询的实现的优点和缺点。

{"title":"Flexible and extensible preference evaluation in database systems","authors":"Justin J. Levandoski, A. Eldawy, M. Mokbel, Mohamed E. Khalefa","doi":"10.1145/2493268","DOIUrl":"https://doi.org/10.1145/2493268","url":null,"abstract":"Personalized database systems give users answers tailored to their personal preferences. While numerous preference evaluation methods for databases have been proposed (e.g., skyline, top-k, k-dominance, k-frequency), the implementation of these methods at the core of a database system is a double-edged sword. Core implementation provides efficient query processing for arbitrary database queries, however, this approach is not practical since each existing (and future) preference method requires implementation within the database engine. To solve this problem, this article introduces FlexPref, a framework for extensible preference evaluation in database systems. FlexPref, implemented in the query processor, aims to support a wide array of preference evaluation methods in a single extensible code base. Integration with FlexPref is simple, involving the registration of only three functions that capture the essence of the preference method. Once integrated, the preference method “lives” at the core of the database, enabling the efficient execution of preference queries involving common database operations. This article also provides a query optimization framework for FlexPref, as well as a theoretical framework that defines the properties a preference method must exhibit to be implemented in FlexPref. To demonstrate the extensibility of FlexPref, this article also provides case studies detailing the implementation of seven state-of-the-art preference evaluation methods within FlexPref. We also experimentally study the strengths and weaknesses of an implementation of FlexPref in PostgreSQL over a range of single-table and multitable preference queries.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"40 1","pages":"17"},"PeriodicalIF":1.8,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89286713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Collaborative data sharing via update exchange and provenance 通过更新交换和来源进行协作数据共享

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2013-08-01 DOI: 10.1145/2500127

G. Karvounarakis, Todd J. Green, Z. Ives, V. Tannen

Recent work [Ives et al. 2005] proposed a new class of systems for supporting data sharing among scientific and other collaborations: this new collaborative data sharing system connects heterogeneous logical peers using a network of schema mappings. Each peer has a locally controlled and edited database instance, but wants to incorporate related data from other peers as well. To achieve this, every peer's data and updates propagate along the mappings to the other peers. However, this operation, termed update exchange, is filtered by trust conditions—expressing what data and sources a peer judges to be authoritative—which may cause a peer to reject another's updates. In order to support such filtering, updates carry provenance information. This article develops methods for realizing such systems: we build upon techniques from data integration, data exchange, incremental view maintenance, and view update to propagate updates along mappings, both to derived and optionally to source instances. We incorporate a novel model for tracking data provenance, such that curators may filter updates based on trust conditions over this provenance. We implement our techniques in a layer above an off-the-shelf RDBMS, and we experimentally demonstrate the viability of these techniques in the Orchestra prototype system.

最近的工作[Ives et al. 2005]提出了一类新的系统，用于支持科学和其他协作之间的数据共享:这种新的协作数据共享系统使用模式映射网络连接异构逻辑对等体。每个对等点都有一个本地控制和编辑的数据库实例，但也希望合并来自其他对等点的相关数据。为了实现这一点，每个对等点的数据和更新沿着映射传播到其他对等点。然而，这种称为更新交换的操作是通过信任条件(表示对等方判断哪些数据和数据源是权威的)过滤的，这可能导致对等方拒绝另一方的更新。为了支持这种过滤，更新带有来源信息。本文开发了实现这类系统的方法:我们建立在数据集成、数据交换、增量视图维护和视图更新等技术的基础上，以沿着映射将更新传播到派生实例和可选的源实例。我们采用了一种新颖的模型来跟踪数据来源，这样管理员就可以根据对该来源的信任条件过滤更新。我们在一个现成的RDBMS之上的一层实现了我们的技术，并且我们通过实验证明了这些技术在Orchestra原型系统中的可行性。

{"title":"Collaborative data sharing via update exchange and provenance","authors":"G. Karvounarakis, Todd J. Green, Z. Ives, V. Tannen","doi":"10.1145/2500127","DOIUrl":"https://doi.org/10.1145/2500127","url":null,"abstract":"Recent work [Ives et al. 2005] proposed a new class of systems for supporting data sharing among scientific and other collaborations: this new collaborative data sharing system connects heterogeneous logical peers using a network of schema mappings. Each peer has a locally controlled and edited database instance, but wants to incorporate related data from other peers as well. To achieve this, every peer's data and updates propagate along the mappings to the other peers. However, this operation, termed update exchange, is filtered by trust conditions—expressing what data and sources a peer judges to be authoritative—which may cause a peer to reject another's updates. In order to support such filtering, updates carry provenance information.\u0000 This article develops methods for realizing such systems: we build upon techniques from data integration, data exchange, incremental view maintenance, and view update to propagate updates along mappings, both to derived and optionally to source instances. We incorporate a novel model for tracking data provenance, such that curators may filter updates based on trust conditions over this provenance. We implement our techniques in a layer above an off-the-shelf RDBMS, and we experimentally demonstrate the viability of these techniques in the Orchestra prototype system.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"146 1","pages":"19"},"PeriodicalIF":1.8,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79948417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

Almost-linear inclusion for XML regular expression types 几乎线性地包含XML正则表达式类型

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2013-08-01 DOI: 10.1145/2508020.2508022

Dario Colazzo, G. Ghelli, Luca Pardini, C. Sartiani

Type inclusion is a fundamental operation in every type-checking compiler, but it is quite expensive for XML manipulation languages. A polynomial inclusion checking algorithm for an expressive family of XML type languages is known, but it runs in quadratic time both in the best and in the worst cases. We present here an algorithm that has a linear-time backbone, and resorts to the quadratic approach for some specific parts of the compared types. Our experiments show that the new algorithm is much faster than the quadratic one, and that it typically runs in linear time, hence it can be used as a building block for a practical type-checking compiler.

类型包含是每个类型检查编译器中的基本操作，但是对于XML操作语言来说，它的代价相当高。已知一种用于表达性XML类型语言的多项式包含检查算法，但是无论在最好的情况下还是在最坏的情况下，它的运行时间都是二次的。我们在这里提出了一种具有线性时间骨干的算法，并对比较类型的某些特定部分采用二次方法。我们的实验表明，新算法比二次算法快得多，并且它通常在线性时间内运行，因此它可以用作实用类型检查编译器的构建块。

引用次数: 12