Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems最新文献_第6页

Theory of data stream computing: where to go 数据流计算理论:走向何方

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Pub Date : 2011-06-13 DOI: 10.1145/1989284.1989314

S. Muthukrishnan

Computing power has been growing steadily, just as communication rate and memory size. Simultaneously our ability to create data has been growing phenomenally and therefore the need to analyze it. We now have examples of massive data streams that are created in far higher rate than we can capture and store in memory economically, gathered in far more quantity than can be transported to central databases without overwhelming the communication infrastructure, and arrives far faster than we can compute with them in a sophisticated way. This phenomenon has challenged how we store, communicate and compute with data. Theories developed over past 50 years have relied on full capture, storage and communication of data. Instead, what we need for managing modern massive data streams are new methods built around working with less. The past 10 years have seen new theories emerge in computing (data stream algorithms), communication (compressed sensing), databases (data stream management systems) and other areas to address the challenges of massive data streams. Still, lot remains open and new applications of massive data streams have emerged recently. We present an overview of these challenges.

计算能力一直在稳步增长，就像通信速率和内存大小一样。与此同时，我们创造数据的能力也在惊人地增长，因此分析数据的需求也在增长。我们现在有大量数据流的例子，这些数据流的创建速度远远超过我们经济地捕获和存储在内存中的速度，收集的数据量远远超过在不压倒通信基础设施的情况下传输到中央数据库的速度，并且到达的速度远远超过我们以复杂的方式计算它们的速度。这一现象对我们存储、交流和计算数据的方式提出了挑战。过去50年发展起来的理论依赖于对数据的充分捕捉、存储和交流。相反，我们管理现代海量数据流所需要的是围绕“少用”构建的新方法。在过去的10年里，新的理论在计算(数据流算法)、通信(压缩感知)、数据库(数据流管理系统)和其他领域出现，以应对海量数据流的挑战。尽管如此，大量数据流的新应用最近出现了。我们将概述这些挑战。

{"title":"Theory of data stream computing: where to go","authors":"S. Muthukrishnan","doi":"10.1145/1989284.1989314","DOIUrl":"https://doi.org/10.1145/1989284.1989314","url":null,"abstract":"Computing power has been growing steadily, just as communication rate and memory size. Simultaneously our ability to create data has been growing phenomenally and therefore the need to analyze it. We now have examples of massive data streams that are created in far higher rate than we can capture and store in memory economically, gathered in far more quantity than can be transported to central databases without overwhelming the communication infrastructure, and arrives far faster than we can compute with them in a sophisticated way.\u0000 This phenomenon has challenged how we store, communicate and compute with data. Theories developed over past 50 years have relied on full capture, storage and communication of data. Instead, what we need for managing modern massive data streams are new methods built around working with less. The past 10 years have seen new theories emerge in computing (data stream algorithms), communication (compressed sensing), databases (data stream management systems) and other areas to address the challenges of massive data streams. Still, lot remains open and new applications of massive data streams have emerged recently. We present an overview of these challenges.","PeriodicalId":92118,"journal":{"name":"Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems","volume":"9 1","pages":"317-319"},"PeriodicalIF":0.0,"publicationDate":"2011-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75803152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

The complexity of text-preserving XML transformations 保存文本的XML转换的复杂性

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Pub Date : 2011-06-13 DOI: 10.1145/1989284.1989316

Timos Antonopoulos, W. Martens, F. Neven

While XML is nowadays adopted as the de facto standard for data exchange, historically, its predecessor SGML was invented for describing electronic documents, i.e., marked up text. Actually, today there are still large volumes of such XML texts. We consider simple transformations which can change the internal structure of documents, that is, the mark-up, and can filter out parts of the text but do not disrupt the ordering of the words. Specifically, we focus on XML transformations where the transformed document is a subsequence of the input document when ignoring mark-up. We call the latter text-preserving XML transformations. We characterize such transformations as copy- and rearrange-free transductions. Furthermore, we study the problem of deciding whether a given XML transducer is text-preserving over a given tree language. We consider top-down transducers as well as the abstraction of XSLT called DTL. We show that deciding whether a transformation is text-preserving over an unranked regular tree language is in PTime for top-down transducers, EXPTime-complete for DTL with XPath, and decidable for DTL with MSO patterns. Finally, we obtain that for every transducer in one of the above mentioned classes, the maximal subset of the input schema can be computed on which the transformation is text-preserving.

虽然XML现在被用作数据交换的实际标准，但从历史上看，它的前身SGML是为描述电子文档(即标记文本)而发明的。实际上，今天仍然有大量这样的XML文本。我们考虑一些简单的转换，这些转换可以改变文档的内部结构，即标记，并且可以过滤掉部分文本，但不破坏单词的顺序。具体来说，我们关注XML转换，其中转换后的文档是输入文档的子序列，而忽略了标记。我们称后者为保存文本的XML转换。我们将这种转化描述为无复制和无重排转导。此外，我们还研究了确定给定XML换能器是否在给定树语言上保持文本的问题。我们考虑自顶向下的转换器以及称为DTL的XSLT抽象。我们表明，对于自顶向下的转换器，在PTime中决定转换是否在未排序的规则树语言上保持文本，对于使用XPath的DTL，在EXPTime-complete中决定，对于使用MSO模式的DTL，在可决定中决定。最后，我们得到了对于上述类中的每个换能器，可以计算出输入模式的最大子集，在该子集上转换是保持文本的。

{"title":"The complexity of text-preserving XML transformations","authors":"Timos Antonopoulos, W. Martens, F. Neven","doi":"10.1145/1989284.1989316","DOIUrl":"https://doi.org/10.1145/1989284.1989316","url":null,"abstract":"While XML is nowadays adopted as the de facto standard for data exchange, historically, its predecessor SGML was invented for describing electronic documents, i.e., marked up text. Actually, today there are still large volumes of such XML texts. We consider simple transformations which can change the internal structure of documents, that is, the mark-up, and can filter out parts of the text but do not disrupt the ordering of the words. Specifically, we focus on XML transformations where the transformed document is a subsequence of the input document when ignoring mark-up. We call the latter text-preserving XML transformations. We characterize such transformations as copy- and rearrange-free transductions. Furthermore, we study the problem of deciding whether a given XML transducer is text-preserving over a given tree language. We consider top-down transducers as well as the abstraction of XSLT called DTL. We show that deciding whether a transformation is text-preserving over an unranked regular tree language is in PTime for top-down transducers, EXPTime-complete for DTL with XPath, and decidable for DTL with MSO patterns. Finally, we obtain that for every transducer in one of the above mentioned classes, the maximal subset of the input schema can be computed on which the transformation is text-preserving.","PeriodicalId":92118,"journal":{"name":"Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems","volume":"19 1","pages":"247-258"},"PeriodicalIF":0.0,"publicationDate":"2011-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73163603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Parallel evaluation of conjunctive queries 联合查询的并行求值

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Pub Date : 2011-06-13 DOI: 10.1145/1989284.1989310

Paraschos Koutris, Dan Suciu

The availability of large data centers with tens of thousands of servers has led to the popular adoption of massive parallelism for data analysis on large datasets. Several query languages exist for running queries on massively parallel architectures, some based on the MapReduce infrastructure, others using proprietary implementations. Motivated by this trend, this paper analyzes the parallel complexity of conjunctive queries. We propose a very simple model of parallel computation that captures these architectures, in which the complexity parameter is the number of parallel steps requiring synchronization of all servers. We study the complexity of conjunctive queries and give a complete characterization of the queries which can be computed in one parallel step. These form a strict subset of hierarchical queries, and include flat queries like R(x,y), S(x,z), T(x,v), U(x,w), tall queries like R(x), S(x,y), T(x,y,z), U(x,y,z,w), and combinations thereof, which we call tall-flat queries. We describe an algorithm for computing in parallel any tall-flat query, and prove that any query that is not tall-flat cannot be computed in one step in this model. Finally, we present extensions of our results to queries that are not tall-flat.

拥有数万台服务器的大型数据中心的可用性导致了大规模并行性在大型数据集上的数据分析的广泛采用。有几种查询语言可以在大规模并行架构上运行查询，其中一些基于MapReduce基础设施，另一些使用专有实现。基于这一趋势，本文分析了连接查询的并行复杂性。我们提出了一个非常简单的并行计算模型来捕获这些架构，其中复杂性参数是需要所有服务器同步的并行步骤的数量。我们研究了联合查询的复杂性，给出了可以在一个并行步骤中计算的查询的完整表征。它们构成了层次查询的严格子集，包括像R(x,y)、S(x,z)、T(x,v)、U(x,w)这样的平坦查询，像R(x)、S(x,y)、T(x,y,z)、U(x,y,z)这样的高查询，以及它们的组合，我们称之为高平坦查询。我们描述了一种并行计算任意高平面查询的算法，并证明了在该模型中，任何非高平面查询都不能在一步内计算出来。最后，我们将结果扩展到不平坦的查询。

{"title":"Parallel evaluation of conjunctive queries","authors":"Paraschos Koutris, Dan Suciu","doi":"10.1145/1989284.1989310","DOIUrl":"https://doi.org/10.1145/1989284.1989310","url":null,"abstract":"The availability of large data centers with tens of thousands of servers has led to the popular adoption of massive parallelism for data analysis on large datasets. Several query languages exist for running queries on massively parallel architectures, some based on the MapReduce infrastructure, others using proprietary implementations. Motivated by this trend, this paper analyzes the parallel complexity of conjunctive queries. We propose a very simple model of parallel computation that captures these architectures, in which the complexity parameter is the number of parallel steps requiring synchronization of all servers. We study the complexity of conjunctive queries and give a complete characterization of the queries which can be computed in one parallel step. These form a strict subset of hierarchical queries, and include flat queries like R(x,y), S(x,z), T(x,v), U(x,w), tall queries like R(x), S(x,y), T(x,y,z), U(x,y,z,w), and combinations thereof, which we call tall-flat queries. We describe an algorithm for computing in parallel any tall-flat query, and prove that any query that is not tall-flat cannot be computed in one step in this model. Finally, we present extensions of our results to queries that are not tall-flat.","PeriodicalId":92118,"journal":{"name":"Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems","volume":"38 1","pages":"223-234"},"PeriodicalIF":0.0,"publicationDate":"2011-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/1989284.1989310","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72469178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 99

Pan-private algorithms via statistics on sketches 基于草图统计的泛私有算法

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Pub Date : 2011-06-13 DOI: 10.1145/1989284.1989290

Darakhshan J. Mir, S. Muthukrishnan, Aleksandar Nikolov, R. Wright

Consider fully dynamic data, where we track data as it gets inserted and deleted. There are well developed notions of private data analyses with dynamic data, for example, using differential privacy. We want to go beyond privacy, and consider privacy together with security, formulated recently as pan-privacy by Dwork et al. (ICS 2010). Informally, pan-privacy preserves differential privacy while computing desired statistics on the data, even if the internal memory of the algorithm is compromised (say, by a malicious break-in or insider curiosity or by fiat by the government or law). We study pan-private algorithms for basic analyses, like estimating distinct count, moments, and heavy hitter count, with fully dynamic data. We present the first known pan-private algorithms for these problems in the fully dynamic model. Our algorithms rely on sketching techniques popular in streaming: in some cases, we add suitable noise to a previously known sketch, using a novel approach of calibrating noise to the underlying problem structure and the projection matrix of the sketch; in other cases, we maintain certain statistics on sketches; in yet others, we define novel sketches. We also present the first known lower bounds explicitly for pan privacy, showing our results to be nearly optimal for these problems. Our lower bounds are stronger than those implied by differential privacy or dynamic data streaming alone and hold even if unbounded memory and/or unbounded processing time are allowed. The lower bounds use a noisy decoding argument and exploit a connection between pan-private algorithms and data sanitization.

考虑完全动态的数据，我们跟踪数据的插入和删除。对于动态数据的私有数据分析，已经有了很好的概念，例如，使用差分隐私。我们希望超越隐私，并将隐私与安全一起考虑，Dwork等人最近将其表述为泛隐私(ICS 2010)。非正式地，泛隐私在计算数据所需统计数据的同时，保留了差异隐私，即使算法的内部内存受到损害(例如，恶意入侵或内部好奇心或政府或法律的命令)。我们研究了用于基本分析的泛私有算法，如估计完全动态数据的不同计数，矩和重拳计数。我们提出了已知的第一个在全动态模型中解决这些问题的泛私有算法。我们的算法依赖于流媒体中流行的草图绘制技术:在某些情况下，我们使用一种新颖的方法将噪声校准到草图的潜在问题结构和投影矩阵，在先前已知的草图中添加合适的噪声;在其他情况下，我们对草图保持一定的统计数据;在另一些例子中，我们定义了新颖的草图。我们还明确地给出了pan隐私的第一个已知下界，表明我们的结果对于这些问题几乎是最优的。我们的下限比单独的差分隐私或动态数据流所隐含的下限更强，即使允许无限内存和/或无限处理时间，下限也保持不变。下界使用一个嘈杂的解码参数，并利用泛私有算法和数据清理之间的联系。

{"title":"Pan-private algorithms via statistics on sketches","authors":"Darakhshan J. Mir, S. Muthukrishnan, Aleksandar Nikolov, R. Wright","doi":"10.1145/1989284.1989290","DOIUrl":"https://doi.org/10.1145/1989284.1989290","url":null,"abstract":"Consider fully dynamic data, where we track data as it gets inserted and deleted. There are well developed notions of private data analyses with dynamic data, for example, using differential privacy. We want to go beyond privacy, and consider privacy together with security, formulated recently as pan-privacy by Dwork et al. (ICS 2010). Informally, pan-privacy preserves differential privacy while computing desired statistics on the data, even if the internal memory of the algorithm is compromised (say, by a malicious break-in or insider curiosity or by fiat by the government or law).\u0000 We study pan-private algorithms for basic analyses, like estimating distinct count, moments, and heavy hitter count, with fully dynamic data. We present the first known pan-private algorithms for these problems in the fully dynamic model. Our algorithms rely on sketching techniques popular in streaming: in some cases, we add suitable noise to a previously known sketch, using a novel approach of calibrating noise to the underlying problem structure and the projection matrix of the sketch; in other cases, we maintain certain statistics on sketches; in yet others, we define novel sketches. We also present the first known lower bounds explicitly for pan privacy, showing our results to be nearly optimal for these problems. Our lower bounds are stronger than those implied by differential privacy or dynamic data streaming alone and hold even if unbounded memory and/or unbounded processing time are allowed. The lower bounds use a noisy decoding argument and exploit a connection between pan-private algorithms and data sanitization.","PeriodicalId":92118,"journal":{"name":"Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems","volume":"2009 1","pages":"37-48"},"PeriodicalIF":0.0,"publicationDate":"2011-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86246699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 86

Data exchange beyond complete data 超越完整数据的数据交换

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Pub Date : 2011-06-13 DOI: 10.1145/1989284.1989293

M. Arenas, Jorge Pérez, Juan L. Reutter

In the traditional data exchange setting, source instances are restricted to be complete in the sense that every fact is either true or false in these instances. Although natural for a typical database translation scenario, this restriction is gradually becoming an impediment to the development of a wide range of applications that need to exchange objects that admit several interpretations. In particular, we are motivated by two specific applications that go beyond the usual data exchange scenario: exchanging incomplete information and exchanging knowledge bases. In this paper, we propose a general framework for data exchange that can deal with these two applications. More specifically, we address the problem of exchanging information given by representation systems, which are essentially finite descriptions of (possibly infinite) sets of complete instances. We make use of the classical semantics of mappings specified by sets of logical sentences to give a meaningful semantics to the notion of exchanging representatives, from which the standard notions of solution, space of solutions, and universal solution naturally arise. We also introduce the notion of strong representation system for a class of mappings, that resembles the concept of strong representation system for a query language. We show the robustness of our proposal by applying it to the two applications mentioned above: exchanging incomplete information and exchanging knowledge bases, which are both instantiations of the exchanging problem for representation systems. We study these two applications in detail, presenting results regarding expressiveness, query answering and complexity of computing solutions, and also algorithms to materialize solutions.

在传统的数据交换设置中，源实例被限制为完整的，即这些实例中的每个事实要么为真，要么为假。尽管对于典型的数据库转换场景来说，这种限制是很自然的，但它正逐渐成为开发需要交换承认多种解释的对象的广泛应用程序的障碍。特别地，我们受到两个特定应用程序的激励，这两个应用程序超出了通常的数据交换场景:交换不完整的信息和交换知识库。在本文中，我们提出了一个通用的数据交换框架，可以处理这两种应用。更具体地说，我们解决了表征系统给出的信息交换问题，表征系统本质上是对完整实例集的有限描述(可能是无限的)。我们利用由逻辑句集所指定的映射的经典语义，给交换代表的概念赋予了有意义的语义，由此自然产生了解、解空间和全称解的标准概念。我们还为一类映射引入了强表示系统的概念，它类似于查询语言的强表示系统的概念。我们通过将我们的建议应用于上面提到的两个应用程序来显示其鲁棒性:交换不完全信息和交换知识库，这两个应用程序都是表示系统交换问题的实例。我们对这两种应用进行了详细的研究，给出了关于计算解的表达性、查询回答和复杂性的结果，以及实现解的算法。

{"title":"Data exchange beyond complete data","authors":"M. Arenas, Jorge Pérez, Juan L. Reutter","doi":"10.1145/1989284.1989293","DOIUrl":"https://doi.org/10.1145/1989284.1989293","url":null,"abstract":"In the traditional data exchange setting, source instances are restricted to be complete in the sense that every fact is either true or false in these instances. Although natural for a typical database translation scenario, this restriction is gradually becoming an impediment to the development of a wide range of applications that need to exchange objects that admit several interpretations. In particular, we are motivated by two specific applications that go beyond the usual data exchange scenario: exchanging incomplete information and exchanging knowledge bases. In this paper, we propose a general framework for data exchange that can deal with these two applications. More specifically, we address the problem of exchanging information given by representation systems, which are essentially finite descriptions of (possibly infinite) sets of complete instances. We make use of the classical semantics of mappings specified by sets of logical sentences to give a meaningful semantics to the notion of exchanging representatives, from which the standard notions of solution, space of solutions, and universal solution naturally arise. We also introduce the notion of strong representation system for a class of mappings, that resembles the concept of strong representation system for a query language. We show the robustness of our proposal by applying it to the two applications mentioned above: exchanging incomplete information and exchanging knowledge bases, which are both instantiations of the exchanging problem for representation systems. We study these two applications in detail, presenting results regarding expressiveness, query answering and complexity of computing solutions, and also algorithms to materialize solutions.","PeriodicalId":92118,"journal":{"name":"Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems","volume":"340 1","pages":"83-94"},"PeriodicalIF":0.0,"publicationDate":"2011-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87806300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Beyond simple aggregates: indexing for summary queries 除了简单的聚合之外:为摘要查询建立索引

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Pub Date : 2011-06-13 DOI: 10.1145/1989284.1989299

Zhewei Wei, K. Yi

Database queries can be broadly classified into two categories: reporting queries and aggregation queries. The former retrieves a collection of records from the database that match the query's conditions, while the latter returns an aggregate, such as count, sum, average, or max (min), of a particular attribute of these records. Aggregation queries are especially useful in business intelligence and data analysis applications where users are interested not in the actual records, but some statistics of them. They can also be executed much more efficiently than reporting queries, by embedding properly precomputed aggregates into an index. However, reporting and aggregation queries provide only two extremes for exploring the data. Data analysts often need more insight into the data distribution than what those simple aggregates provide, and yet certainly do not want the sheer volume of data returned by reporting queries. In this paper, we design indexing techniques that allow for extracting a statistical summary of all the records in the query. The summaries we support include frequent items, quantiles, various sketches, and wavelets, all of which are of central importance in massive data analysis. Our indexes require linear space and extract a summary with the optimal or near-optimal query cost.

数据库查询可以大致分为两类:报告查询和聚合查询。前者从数据库中检索符合查询条件的记录集合，而后者返回这些记录的特定属性的集合，如count、sum、average或max (min)。聚合查询在商业智能和数据分析应用程序中特别有用，在这些应用程序中，用户感兴趣的不是实际记录，而是它们的一些统计数据。通过将预先计算好的聚合嵌入到索引中，它们的执行效率也比报告查询高得多。然而，报告和聚合查询只提供了两种极端的数据探索方式。数据分析师通常需要更深入地了解数据分布，而不是那些简单的聚合所提供的，当然也不希望通过报告查询返回大量的数据。在本文中，我们设计了索引技术，允许提取查询中所有记录的统计摘要。我们支持的摘要包括频繁项、分位数、各种草图和小波，所有这些在海量数据分析中都是至关重要的。我们的索引需要线性空间，并以最优或接近最优的查询成本提取摘要。

{"title":"Beyond simple aggregates: indexing for summary queries","authors":"Zhewei Wei, K. Yi","doi":"10.1145/1989284.1989299","DOIUrl":"https://doi.org/10.1145/1989284.1989299","url":null,"abstract":"Database queries can be broadly classified into two categories: reporting queries and aggregation queries. The former retrieves a collection of records from the database that match the query's conditions, while the latter returns an aggregate, such as count, sum, average, or max (min), of a particular attribute of these records. Aggregation queries are especially useful in business intelligence and data analysis applications where users are interested not in the actual records, but some statistics of them. They can also be executed much more efficiently than reporting queries, by embedding properly precomputed aggregates into an index.\u0000 However, reporting and aggregation queries provide only two extremes for exploring the data. Data analysts often need more insight into the data distribution than what those simple aggregates provide, and yet certainly do not want the sheer volume of data returned by reporting queries. In this paper, we design indexing techniques that allow for extracting a statistical summary of all the records in the query. The summaries we support include frequent items, quantiles, various sketches, and wavelets, all of which are of central importance in massive data analysis. Our indexes require linear space and extract a summary with the optimal or near-optimal query cost.","PeriodicalId":92118,"journal":{"name":"Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems","volume":"37 1","pages":"117-128"},"PeriodicalIF":0.0,"publicationDate":"2011-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80545572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

On the complexity of privacy-preserving complex event processing 关于隐私保护复杂事件处理的复杂性

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Pub Date : 2011-06-13 DOI: 10.1145/1989284.1989304

Yeye He, Siddharth Barman, Di Wang, J. Naughton

Complex Event Processing (CEP) Systems are stream processing systems that monitor incoming event streams in search of userspecified event patterns. While CEP systems have been adopted in a variety of applications, the privacy implications of event pattern reporting mechanisms have yet to be studied - a stark contrast to the significant amount of attention that has been devoted to privacy for relational systems. In this paper we present a privacy problem that arises when the system must support desired patterns (those that should be reported if detected) and private patterns (those that should not be revealed). We formalize this problem, which we term privacy-preserving, utility maximizing CEP (PP-CEP), and analyze its complexity under various assumptions. Our results show that this is a rich problem to study and shed some light on the difficulty of developing algorithms that preserve utility without compromising privacy.

复杂事件处理(CEP)系统是流处理系统，它监视传入的事件流，以搜索用户指定的事件模式。虽然CEP系统已经在各种应用程序中被采用，但事件模式报告机制的隐私含义还有待研究——这与关系系统中大量关注的隐私形成鲜明对比。在本文中，我们提出了当系统必须支持所需模式(如果检测到应该报告的模式)和私有模式(不应该显示的模式)时出现的隐私问题。我们将这个问题形式化，我们称之为隐私保护，效用最大化CEP (PP-CEP)，并在各种假设下分析了它的复杂性。我们的研究结果表明，这是一个值得研究的丰富问题，并揭示了开发在不损害隐私的情况下保持效用的算法的难度。

引用次数: 31

Determining relevance of accesses at runtime 在运行时确定访问的相关性

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Pub Date : 2011-06-13 DOI: 10.1145/1989284.1989309

Michael Benedikt, G. Gottlob, P. Senellart

Consider the situation where a query is to be answered using Web sources that restrict the accesses that can be made on backend relational data by requiring some attributes to be given as input of the service. The accesses provide lookups on the collection of attributes values that match the binding. They can differ in whether or not they require arguments to be generated from prior accesses. Prior work has focused on the question of whether a query can be answered using a set of data sources, and in developing static access plans (e.g., Datalog programs) that implement query answering. We are interested in dynamic aspects of the query answering problem: given partial information about the data, which accesses could provide relevant data for answering a given query? We consider immediate and long-term notions of "relevant accesses", and ascertain the complexity of query relevance, for both conjunctive queries and arbitrary positive queries. In the process, we relate dynamic relevance of an access to query containment under access limitations and characterize the complexity of this problem; we produce several complexity results about containment that are of interest by themselves.

考虑使用Web源来回答查询的情况，这些Web源通过要求将某些属性作为服务的输入来限制对后端关系数据的访问。这些访问提供了对与绑定匹配的属性值集合的查找。它们的不同之处在于是否需要从先前的访问中生成参数。先前的工作集中在查询是否可以使用一组数据源回答问题，以及开发实现查询回答的静态访问计划(例如，Datalog程序)。我们对查询回答问题的动态方面感兴趣:给定关于数据的部分信息，哪些访问可以为回答给定的查询提供相关的数据?我们考虑了“相关访问”的即时和长期概念，并确定了连接查询和任意肯定查询的查询相关性的复杂性。在此过程中，我们将访问限制与查询包含的动态相关性联系起来，并描述了该问题的复杂性;我们得出了几个关于包容的复杂性结果，它们本身都很有趣。

{"title":"Determining relevance of accesses at runtime","authors":"Michael Benedikt, G. Gottlob, P. Senellart","doi":"10.1145/1989284.1989309","DOIUrl":"https://doi.org/10.1145/1989284.1989309","url":null,"abstract":"Consider the situation where a query is to be answered using Web sources that restrict the accesses that can be made on backend relational data by requiring some attributes to be given as input of the service. The accesses provide lookups on the collection of attributes values that match the binding. They can differ in whether or not they require arguments to be generated from prior accesses. Prior work has focused on the question of whether a query can be answered using a set of data sources, and in developing static access plans (e.g., Datalog programs) that implement query answering. We are interested in dynamic aspects of the query answering problem: given partial information about the data, which accesses could provide relevant data for answering a given query? We consider immediate and long-term notions of \"relevant accesses\", and ascertain the complexity of query relevance, for both conjunctive queries and arbitrary positive queries. In the process, we relate dynamic relevance of an access to query containment under access limitations and characterize the complexity of this problem; we produce several complexity results about containment that are of interest by themselves.","PeriodicalId":92118,"journal":{"name":"Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems","volume":"24 1","pages":"211-222"},"PeriodicalIF":0.0,"publicationDate":"2011-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73901895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Querying graph patterns 查询图模式

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Pub Date : 2011-06-13 DOI: 10.1145/1989284.1989307

P. Barceló, L. Libkin, Juan L. Reutter

Graph data appears in a variety of application domains, and many uses of it, such as querying, matching, and transforming data, naturally result in incompletely specified graph data, i.e., graph patterns. While queries need to be posed against such data, techniques for querying patterns are generally lacking, and properties of such queries are not well understood. Our goal is to study the basics of querying graph patterns. We first identify key features of patterns, such as node and label variables and edges specified by regular expressions, and define a classification of patterns based on them. We then study standard graph queries on graph patterns, and give precise characterizations of both data and combined complexity for each class of patterns. If complexity is high, we do further analysis of features that lead to intractability, as well as lower complexity restrictions. We introduce a new automata model for query answering with two modes of acceptance: one captures queries returning nodes, and the other queries returning paths. We study properties of such automata, and the key computational tasks associated with them. Finally, we provide additional restrictions for tractability, and show that some intractable cases can be naturally cast as instances of constraint satisfaction problem.

图数据出现在各种应用程序领域中，对它的许多使用，如查询、匹配和转换数据，自然会导致不完全指定的图数据，即图模式。虽然需要对这些数据进行查询，但通常缺乏查询模式的技术，而且对这些查询的属性也没有很好的理解。我们的目标是学习查询图形模式的基础知识。我们首先识别模式的关键特征，如节点和标签变量以及正则表达式指定的边，并基于它们定义模式分类。然后，我们研究了图模式上的标准图查询，并给出了每一类模式的数据和组合复杂性的精确特征。如果复杂性很高，我们将进一步分析导致难以处理的特性，以及较低的复杂性限制。我们引入了一种新的查询应答自动机模型，该模型具有两种接受模式:一种捕获查询返回节点，另一种捕获查询返回路径。我们研究这类自动机的性质，以及与之相关的关键计算任务。最后，我们对可跟踪性提供了额外的限制，并表明一些棘手的情况可以很自然地转化为约束满足问题的实例。

{"title":"Querying graph patterns","authors":"P. Barceló, L. Libkin, Juan L. Reutter","doi":"10.1145/1989284.1989307","DOIUrl":"https://doi.org/10.1145/1989284.1989307","url":null,"abstract":"Graph data appears in a variety of application domains, and many uses of it, such as querying, matching, and transforming data, naturally result in incompletely specified graph data, i.e., graph patterns. While queries need to be posed against such data, techniques for querying patterns are generally lacking, and properties of such queries are not well understood.\u0000 Our goal is to study the basics of querying graph patterns. We first identify key features of patterns, such as node and label variables and edges specified by regular expressions, and define a classification of patterns based on them. We then study standard graph queries on graph patterns, and give precise characterizations of both data and combined complexity for each class of patterns. If complexity is high, we do further analysis of features that lead to intractability, as well as lower complexity restrictions. We introduce a new automata model for query answering with two modes of acceptance: one captures queries returning nodes, and the other queries returning paths. We study properties of such automata, and the key computational tasks associated with them. Finally, we provide additional restrictions for tractability, and show that some intractable cases can be naturally cast as instances of constraint satisfaction problem.","PeriodicalId":92118,"journal":{"name":"Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems","volume":"28 1","pages":"199-210"},"PeriodicalIF":0.0,"publicationDate":"2011-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73827559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 72

FIFO indexes for decomposable problems 可分解问题的FIFO索引

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Pub Date : 2011-06-13 DOI: 10.1145/1989284.1989291

Cheng Sheng, Yufei Tao

This paper studies first-in-first-out (FIFO) indexes, each of which manages a dataset where objects are deleted in the same order as their insertions. We give a technique that converts a static data structure to a FIFO index for all decomposable problems, provided that the static structure can be constructed efficiently. We present FIFO access methods to solve several problems including half-plane search, nearest neighbor search, and extreme-point search. All of our structures consume linear space, and have optimal or near-optimal query cost.

本文研究了先进先出(FIFO)索引，其中每个索引管理一个数据集，其中对象的删除顺序与插入顺序相同。我们给出了一种将静态数据结构转换为所有可分解问题的FIFO索引的技术，前提是静态结构可以有效地构建。针对半平面搜索、最近邻搜索和极值点搜索等问题，提出了先进先出存取方法。我们所有的结构都消耗线性空间，并且具有最优或接近最优的查询成本。

引用次数: 2