2013 IEEE 29th International Conference on Data Engineering (ICDE)最新文献_第10页

AFFINITY: Efficiently querying statistical measures on time-series data 亲和力:有效地查询时间序列数据的统计度量

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544879

Saket K. Sathe, K. Aberer

Computing statistical measures for large databases of time series is a fundamental primitive for querying and mining time-series data [1]-[6]. This primitive is gaining importance with the increasing number and rapid growth of time series databases. In this paper, we introduce a framework for efficient computation of statistical measures by exploiting the concept of affine relationships. Affine relationships can be used to infer statistical measures for time series, from other related time series, instead of computing them directly; thus, reducing the overall computational cost significantly. The resulting methods exhibit at least one order of magnitude improvement over the best known methods. To the best of our knowledge, this is the first work that presents an unified approach for computing and querying several statistical measures at once. Our approach exploits affine relationships using three key components. First, the AFCLST algorithm clusters the time-series data, such that high-quality affine relationships could be easily found. Second, the SYMEX algorithm uses the clustered time series and efficiently computes the desired affine relationships. Third, the SCAPE index structure produces a many-fold improvement in the performance of processing several statistical queries by seamlessly indexing the affine relationships. Finally, we establish the effectiveness of our approaches by performing comprehensive experimental evaluation on real datasets.

计算大型时间序列数据库的统计测度是查询和挖掘时间序列数据的基本原语[1]-[6]。随着时间序列数据库数量的增加和快速增长，这一原语变得越来越重要。本文利用仿射关系的概念，提出了一个有效计算统计测度的框架。仿射关系可用于从其他相关时间序列中推断时间序列的统计度量，而不是直接计算它们;因此，大大降低了总体计算成本。所得到的方法比最知名的方法至少表现出一个数量级的改进。据我们所知，这是第一次提出一个统一的方法来计算和查询几个统计度量。我们的方法利用三个关键组件利用仿射关系。首先，AFCLST算法将时间序列数据聚类，这样可以很容易地找到高质量的仿射关系。其次，SYMEX算法利用聚类时间序列，有效地计算出所需的仿射关系。第三，通过无缝地索引仿射关系，SCAPE索引结构在处理多个统计查询的性能上产生了许多倍的改进。最后，我们通过在真实数据集上进行全面的实验评估来验证我们方法的有效性。

{"title":"AFFINITY: Efficiently querying statistical measures on time-series data","authors":"Saket K. Sathe, K. Aberer","doi":"10.1109/ICDE.2013.6544879","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544879","url":null,"abstract":"Computing statistical measures for large databases of time series is a fundamental primitive for querying and mining time-series data [1]-[6]. This primitive is gaining importance with the increasing number and rapid growth of time series databases. In this paper, we introduce a framework for efficient computation of statistical measures by exploiting the concept of affine relationships. Affine relationships can be used to infer statistical measures for time series, from other related time series, instead of computing them directly; thus, reducing the overall computational cost significantly. The resulting methods exhibit at least one order of magnitude improvement over the best known methods. To the best of our knowledge, this is the first work that presents an unified approach for computing and querying several statistical measures at once. Our approach exploits affine relationships using three key components. First, the AFCLST algorithm clusters the time-series data, such that high-quality affine relationships could be easily found. Second, the SYMEX algorithm uses the clustered time series and efficiently computes the desired affine relationships. Third, the SCAPE index structure produces a many-fold improvement in the performance of processing several statistical queries by seamlessly indexing the affine relationships. Finally, we establish the effectiveness of our approaches by performing comprehensive experimental evaluation on real datasets.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127512839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Holistic data cleaning: Putting violations into context 整体数据清理:将违规置于上下文中

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544847

Xu Chu, I. Ilyas, Paolo Papotti

Data cleaning is an important problem and data quality rules are the most promising way to face it with a declarative approach. Previous work has focused on specific formalisms, such as functional dependencies (FDs), conditional functional dependencies (CFDs), and matching dependencies (MDs), and those have always been studied in isolation. Moreover, such techniques are usually applied in a pipeline or interleaved. In this work we tackle the problem in a novel, unified framework. First, we let users specify quality rules using denial constraints with ad-hoc predicates. This language subsumes existing formalisms and can express rules involving numerical values, with predicates such as “greater than” and “less than”. More importantly, we exploit the interaction of the heterogeneous constraints by encoding them in a conflict hypergraph. Such holistic view of the conflicts is the starting point for a novel definition of repair context which allows us to compute automatically repairs of better quality w.r.t. previous approaches in the literature. Experimental results on real datasets show that the holistic approach outperforms previous algorithms in terms of quality and efficiency of the repair.

数据清理是一个重要的问题，数据质量规则是使用声明性方法来解决这个问题的最有希望的方法。以前的工作主要集中在特定的形式化，如功能依赖关系(fd)、条件功能依赖关系(cfd)和匹配依赖关系(md)，并且这些依赖关系一直是孤立研究的。此外，这种技术通常应用于流水线或交错。在这项工作中，我们以一种新颖的、统一的框架来解决这个问题。首先，我们让用户使用带有特别谓词的拒绝约束来指定质量规则。这种语言包含了现有的形式，可以用诸如“大于”和“小于”之类的谓词表达涉及数值的规则。更重要的是，我们通过在冲突超图中编码异构约束来利用它们之间的相互作用。这种冲突的整体观点是修复上下文的新定义的起点，它允许我们计算更好质量的自动修复，而不是以前文献中的方法。在实际数据集上的实验结果表明，整体修复方法在修复质量和效率方面都优于以往的算法。

{"title":"Holistic data cleaning: Putting violations into context","authors":"Xu Chu, I. Ilyas, Paolo Papotti","doi":"10.1109/ICDE.2013.6544847","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544847","url":null,"abstract":"Data cleaning is an important problem and data quality rules are the most promising way to face it with a declarative approach. Previous work has focused on specific formalisms, such as functional dependencies (FDs), conditional functional dependencies (CFDs), and matching dependencies (MDs), and those have always been studied in isolation. Moreover, such techniques are usually applied in a pipeline or interleaved. In this work we tackle the problem in a novel, unified framework. First, we let users specify quality rules using denial constraints with ad-hoc predicates. This language subsumes existing formalisms and can express rules involving numerical values, with predicates such as “greater than” and “less than”. More importantly, we exploit the interaction of the heterogeneous constraints by encoding them in a conflict hypergraph. Such holistic view of the conflicts is the starting point for a novel definition of repair context which allows us to compute automatically repairs of better quality w.r.t. previous approaches in the literature. Experimental results on real datasets show that the holistic approach outperforms previous algorithms in terms of quality and efficiency of the repair.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132006237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 306

Robust distributed stream processing 健壮的分布式流处理

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544877

Chuan Lei, Elke A. Rundensteiner, J. Guttman

Distributed stream processing systems must function efficiently for data streams that fluctuate in their arrival rates and data distributions. Yet repeated and prohibitively expensive load re-allocation across machines may make these systems ineffective, potentially resulting in data loss or even system failure. To overcome this problem, we instead propose a load distribution (RLD) strategy that is robust to data fluctuations. RLD provides ϵ-optimal query performance under load fluctuations without suffering from the performance penalty caused by load migration. RLD is based on three key strategies. First, we model robust distributed stream processing as a parametric query optimization problem. The notions of robust logical and robust physical plans then are overlays of this parameter space. Second, our Early-terminated Robust Partitioning (ERP) finds a set of robust logical plans, covering the parameter space, while minimizing the number of prohibitively expensive optimizer calls with a probabilistic bound on the space coverage. Third, our OptPrune algorithm maps the space-covering logical solution to a single robust physical plan tolerant to deviations in data statistics that maximizes the parameter space coverage at runtime. Our experimental study using stock market and sensor networks streams demonstrates that our RLD methodology consistently outperforms state-of-the-art solutions in terms of efficiency and effectiveness in highly fluctuating data stream environments.

分布式流处理系统必须有效地处理在到达率和数据分布上波动的数据流。然而，跨机器重复且代价高昂的负载重新分配可能会使这些系统无效，可能导致数据丢失甚至系统故障。为了克服这个问题，我们提出了一种对数据波动具有鲁棒性的负载分布(RLD)策略。RLD在负载波动下提供ϵ-optimal查询性能，而不会遭受负载迁移带来的性能损失。RLD基于三个关键战略。首先，我们将鲁棒分布式流处理建模为一个参数查询优化问题。健壮的逻辑和健壮的物理计划的概念是这个参数空间的叠加。其次，我们的早终止健壮分区(early - end Robust Partitioning, ERP)找到一组健壮的逻辑计划，覆盖参数空间，同时在空间覆盖的概率范围内最小化代价高昂的优化器调用的数量。第三，我们的OptPrune算法将覆盖空间的逻辑解决方案映射到能够容忍数据统计偏差的单个健壮的物理计划，从而在运行时最大化参数空间覆盖。我们使用股票市场和传感器网络流进行的实验研究表明，在高度波动的数据流环境中，我们的RLD方法在效率和有效性方面始终优于最先进的解决方案。

{"title":"Robust distributed stream processing","authors":"Chuan Lei, Elke A. Rundensteiner, J. Guttman","doi":"10.1109/ICDE.2013.6544877","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544877","url":null,"abstract":"Distributed stream processing systems must function efficiently for data streams that fluctuate in their arrival rates and data distributions. Yet repeated and prohibitively expensive load re-allocation across machines may make these systems ineffective, potentially resulting in data loss or even system failure. To overcome this problem, we instead propose a load distribution (RLD) strategy that is robust to data fluctuations. RLD provides ϵ-optimal query performance under load fluctuations without suffering from the performance penalty caused by load migration. RLD is based on three key strategies. First, we model robust distributed stream processing as a parametric query optimization problem. The notions of robust logical and robust physical plans then are overlays of this parameter space. Second, our Early-terminated Robust Partitioning (ERP) finds a set of robust logical plans, covering the parameter space, while minimizing the number of prohibitively expensive optimizer calls with a probabilistic bound on the space coverage. Third, our OptPrune algorithm maps the space-covering logical solution to a single robust physical plan tolerant to deviations in data statistics that maximizes the parameter space coverage at runtime. Our experimental study using stock market and sensor networks streams demonstrates that our RLD methodology consistently outperforms state-of-the-art solutions in terms of efficiency and effectiveness in highly fluctuating data stream environments.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130692892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

HFMS: Managing the lifecycle and complexity of hybrid analytic data flows HFMS:管理混合分析数据流的生命周期和复杂性

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544907

A. Simitsis, K. Wilkinson, U. Dayal, M. Hsu

To remain competitive, enterprises are evolving their business intelligence systems to provide dynamic, near realtime views of business activities. To enable this, they deploy complex workflows of analytic data flows that access multiple storage repositories and execution engines and that span the enterprise and even outside the enterprise. We call these multi-engine flows hybrid flows. Designing and optimizing hybrid flows is a challenging task. Managing a workload of hybrid flows is even more challenging since their execution engines are likely under different administrative domains and there is no single point of control. To address these needs, we present a Hybrid Flow Management System (HFMS). It is an independent software layer over a number of independent execution engines and storage repositories. It simplifies the design of analytic data flows and includes optimization and executor modules to produce optimized executable flows that can run across multiple execution engines. HFMS dispatches flows for execution and monitors their progress. To meet service level objectives for a workload, it may dynamically change a flow's execution plan to avoid processing bottlenecks in the computing infrastructure. We present the architecture of HFMS and describe its components. To demonstrate its potential benefit, we describe performance results for running sample batch workloads with and without HFMS. The ability to monitor multiple execution engines and to dynamically adjust plans enables HFMS to provide better service guarantees and better system utilization.

为了保持竞争力，企业正在发展其商业智能系统，以提供动态的、接近实时的业务活动视图。为了实现这一点，他们部署了复杂的分析数据流工作流，这些工作流访问多个存储库和执行引擎，并且跨越企业甚至企业外部。我们称这些多引擎流为混合流。设计和优化混合流是一项具有挑战性的任务。管理混合流的工作负载更具挑战性，因为它们的执行引擎可能位于不同的管理域下，并且没有单一的控制点。为了满足这些需求，我们提出了一种混合流量管理系统(HFMS)。它是一个独立的软件层，位于许多独立的执行引擎和存储库之上。它简化了分析数据流的设计，并包括优化和执行器模块，以生成可以跨多个执行引擎运行的优化的可执行流。HFMS分派执行流并监视其进度。为了满足工作负载的服务级别目标，可以动态更改流的执行计划，以避免计算基础设施中的处理瓶颈。介绍了HFMS的体系结构，并对其组成进行了描述。为了演示其潜在的好处，我们描述了在有和没有HFMS的情况下运行样例批处理工作负载的性能结果。监视多个执行引擎和动态调整计划的能力使HFMS能够提供更好的服务保证和更好的系统利用率。

{"title":"HFMS: Managing the lifecycle and complexity of hybrid analytic data flows","authors":"A. Simitsis, K. Wilkinson, U. Dayal, M. Hsu","doi":"10.1109/ICDE.2013.6544907","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544907","url":null,"abstract":"To remain competitive, enterprises are evolving their business intelligence systems to provide dynamic, near realtime views of business activities. To enable this, they deploy complex workflows of analytic data flows that access multiple storage repositories and execution engines and that span the enterprise and even outside the enterprise. We call these multi-engine flows hybrid flows. Designing and optimizing hybrid flows is a challenging task. Managing a workload of hybrid flows is even more challenging since their execution engines are likely under different administrative domains and there is no single point of control. To address these needs, we present a Hybrid Flow Management System (HFMS). It is an independent software layer over a number of independent execution engines and storage repositories. It simplifies the design of analytic data flows and includes optimization and executor modules to produce optimized executable flows that can run across multiple execution engines. HFMS dispatches flows for execution and monitors their progress. To meet service level objectives for a workload, it may dynamically change a flow's execution plan to avoid processing bottlenecks in the computing infrastructure. We present the architecture of HFMS and describe its components. To demonstrate its potential benefit, we describe performance results for running sample batch workloads with and without HFMS. The ability to monitor multiple execution engines and to dynamically adjust plans enables HFMS to provide better service guarantees and better system utilization.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130746538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Inferring data currency and consistency for conflict resolution 推断数据的流通和一致性，以解决冲突

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544848

W. Fan, Floris Geerts, N. Tang, Wenyuan Yu

This paper introduces a new approach for conflict resolution: given a set of tuples pertaining to the same entity, it is to identify a single tuple in which each attribute has the latest and consistent value in the set. This problem is important in data integration, data cleaning and query answering. It is, however, challenging since in practice, reliable timestamps are often absent, among other things. We propose a model for conflict resolution, by specifying data currency in terms of partial currency orders and currency constraints, and by enforcing data consistency with constant conditional functional dependencies. We show that identifying data currency orders helps us repair inconsistent data, and vice versa. We investigate a number of fundamental problems associated with conflict resolution, and establish their complexity. In addition, we introduce a framework and develop algorithms for conflict resolution, by integrating data currency and consistency inferences into a single process, and by interacting with users. We experimentally verify the accuracy and efficiency of our methods using real-life and synthetic data.

本文介绍了一种新的冲突解决方法:给定属于同一实体的一组元组，识别其中每个属性具有该集合中最新且一致的值的单个元组。该问题在数据集成、数据清理和查询应答中具有重要意义。然而，这是具有挑战性的，因为在实践中，除了其他事情之外，经常缺乏可靠的时间戳。我们提出了一个解决冲突的模型，通过根据部分货币顺序和货币约束来指定数据货币，并通过使用恒定的条件函数依赖来强制数据一致性。我们表明，识别数据货币订单有助于我们修复不一致的数据，反之亦然。我们调查了一些与冲突解决相关的基本问题，并确定了它们的复杂性。此外，我们通过将数据货币和一致性推断集成到单个过程中，并通过与用户交互，引入了一个框架并开发了冲突解决算法。我们通过实验验证了我们方法的准确性和效率，使用现实生活和合成数据。

{"title":"Inferring data currency and consistency for conflict resolution","authors":"W. Fan, Floris Geerts, N. Tang, Wenyuan Yu","doi":"10.1109/ICDE.2013.6544848","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544848","url":null,"abstract":"This paper introduces a new approach for conflict resolution: given a set of tuples pertaining to the same entity, it is to identify a single tuple in which each attribute has the latest and consistent value in the set. This problem is important in data integration, data cleaning and query answering. It is, however, challenging since in practice, reliable timestamps are often absent, among other things. We propose a model for conflict resolution, by specifying data currency in terms of partial currency orders and currency constraints, and by enforcing data consistency with constant conditional functional dependencies. We show that identifying data currency orders helps us repair inconsistent data, and vice versa. We investigate a number of fundamental problems associated with conflict resolution, and establish their complexity. In addition, we introduce a framework and develop algorithms for conflict resolution, by integrating data currency and consistency inferences into a single process, and by interacting with users. We experimentally verify the accuracy and efficiency of our methods using real-life and synthetic data.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125894156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 52

The adaptive radix tree: ARTful indexing for main-memory databases 自适应基数树:主存数据库的巧妙索引

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544812

Viktor Leis, A. Kemper, Thomas Neumann

Main memory capacities have grown up to a point where most databases fit into RAM. For main-memory database systems, index structure performance is a critical bottleneck. Traditional in-memory data structures like balanced binary search trees are not efficient on modern hardware, because they do not optimally utilize on-CPU caches. Hash tables, also often used for main-memory indexes, are fast but only support point queries. To overcome these shortcomings, we present ART, an adaptive radix tree (trie) for efficient indexing in main memory. Its lookup performance surpasses highly tuned, read-only search trees, while supporting very efficient insertions and deletions as well. At the same time, ART is very space efficient and solves the problem of excessive worst-case space consumption, which plagues most radix trees, by adaptively choosing compact and efficient data structures for internal nodes. Even though ART's performance is comparable to hash tables, it maintains the data in sorted order, which enables additional operations like range scan and prefix lookup.

主存容量已经增长到大多数数据库都可以放入RAM的程度。对于主存数据库系统，索引结构性能是一个关键的瓶颈。传统的内存数据结构(如平衡二叉搜索树)在现代硬件上效率不高，因为它们不能最优地利用cpu上的缓存。哈希表也经常用于主存索引，虽然速度很快，但只支持点查询。为了克服这些缺点，我们提出了ART，一种在主存中有效索引的自适应基树(trie)。它的查找性能超过了高度调优的只读搜索树，同时也支持非常高效的插入和删除。同时，ART具有很高的空间效率，通过自适应地为内部节点选择紧凑高效的数据结构，解决了困扰大多数根树的最坏情况空间消耗过大的问题。尽管ART的性能与哈希表相当，但它按排序顺序维护数据，从而支持范围扫描和前缀查找等其他操作。

引用次数: 354

CPU and cache efficient management of memory-resident databases CPU和缓存高效管理内存驻留数据库

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544810

H. Pirk, Florian Funke, M. Grund, Thomas Neumann, U. Leser, S. Manegold, A. Kemper, M. Kersten

Memory-Resident Database Management Systems (MRDBMS) have to be optimized for two resources: CPU cycles and memory bandwidth. To optimize for bandwidth in mixed OLTP/OLAP scenarios, the hybrid or Partially Decomposed Storage Model (PDSM) has been proposed. However, in current implementations, bandwidth savings achieved by partial decomposition come at increased CPU costs. To achieve the aspired bandwidth savings without sacrificing CPU efficiency, we combine partially decomposed storage with Just-in-Time (JiT) compilation of queries, thus eliminating CPU inefficient function calls. Since existing cost based optimization components are not designed for JiT-compiled query execution, we also develop a novel approach to cost modeling and subsequent storage layout optimization. Our evaluation shows that the JiT-based processor maintains the bandwidth savings of previously presented hybrid query processors but outperforms them by two orders of magnitude due to increased CPU efficiency.

内存驻留数据库管理系统(MRDBMS)必须针对两种资源进行优化:CPU周期和内存带宽。为了优化OLTP/OLAP混合场景下的带宽，提出了混合或部分分解存储模型(PDSM)。然而，在当前的实现中，通过部分分解实现的带宽节省是以增加的CPU成本为代价的。为了在不牺牲CPU效率的情况下实现预期的带宽节省，我们将部分分解的存储与即时(JiT)查询编译结合起来，从而消除了CPU效率低下的函数调用。由于现有的基于成本的优化组件不是为jit编译的查询执行而设计的，因此我们还开发了一种新的方法来进行成本建模和随后的存储布局优化。我们的评估表明，基于jit的处理器保持了之前提出的混合查询处理器的带宽节省，但由于CPU效率的提高，性能比它们高出两个数量级。

引用次数: 37

Breaking the top-k barrier of hidden web databases? 打破隐藏网络数据库的顶级壁垒?

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544896

Saravanan Thirumuruganathan, Nan Zhang, Gautam Das

A large number of web databases are only accessible through proprietary form-like interfaces which require users to query the system by entering desired values for a few attributes. A key restriction enforced by such an interface is the top-k output constraint - i.e., when there are a large number of matching tuples, only a few (top-k) of them are preferentially selected and returned by the website, often according to a proprietary ranking function. Since most web database owners set k to be a small value, the top-k output constraint prevents many interesting third-party (e.g., mashup) services from being developed over real-world web databases. In this paper we consider the novel problem of “digging deeper” into such web databases. Our main contribution is the meta-algorithm GetNext that can retrieve the next ranked tuple from the hidden web database using only the restrictive interface of a web database without any prior knowledge of its ranking function. This algorithm can then be called iteratively to retrieve as many top ranked tuples as necessary. We develop principled and efficient algorithms that are based on generating and executing multiple reformulated queries and inferring the next ranked tuple from their returned results. We provide theoretical analysis of our algorithms, as well as extensive experimental results over synthetic and real-world databases that illustrate the effectiveness of our techniques.

大量的web数据库只能通过专有的类似表单的接口访问，这些接口要求用户通过为一些属性输入所需的值来查询系统。这种接口强制执行的一个关键限制是top-k输出约束——也就是说，当有大量匹配的元组时，只有少数(top-k)被优先选择并由网站返回，通常是根据专有的排序功能。由于大多数web数据库所有者将k设置为一个较小的值，因此top-k输出约束阻止了许多有趣的第三方(例如mashup)服务在真实的web数据库上开发。在本文中，我们考虑了“深入挖掘”这种网络数据库的新问题。我们的主要贡献是元算法GetNext，它可以从隐藏的web数据库中检索下一个排名元组，只使用web数据库的限制性接口，而不需要任何关于其排名函数的先验知识。然后可以迭代地调用该算法来检索尽可能多的顶级元组。我们开发了有原则和高效的算法，这些算法基于生成和执行多个重新制定的查询，并从返回的结果中推断下一个排名的元组。我们提供了算法的理论分析，以及在合成和现实世界数据库上的广泛实验结果，说明了我们技术的有效性。

{"title":"Breaking the top-k barrier of hidden web databases?","authors":"Saravanan Thirumuruganathan, Nan Zhang, Gautam Das","doi":"10.1109/ICDE.2013.6544896","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544896","url":null,"abstract":"A large number of web databases are only accessible through proprietary form-like interfaces which require users to query the system by entering desired values for a few attributes. A key restriction enforced by such an interface is the top-k output constraint - i.e., when there are a large number of matching tuples, only a few (top-k) of them are preferentially selected and returned by the website, often according to a proprietary ranking function. Since most web database owners set k to be a small value, the top-k output constraint prevents many interesting third-party (e.g., mashup) services from being developed over real-world web databases. In this paper we consider the novel problem of “digging deeper” into such web databases. Our main contribution is the meta-algorithm GetNext that can retrieve the next ranked tuple from the hidden web database using only the restrictive interface of a web database without any prior knowledge of its ranking function. This algorithm can then be called iteratively to retrieve as many top ranked tuples as necessary. We develop principled and efficient algorithms that are based on generating and executing multiple reformulated queries and inferring the next ranked tuple from their returned results. We provide theoretical analysis of our algorithms, as well as extensive experimental results over synthetic and real-world databases that illustrate the effectiveness of our techniques.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"538 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123369314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Crowdsourced enumeration queries 众包枚举查询

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544865

Beth Trushkowsky, Tim Kraska, M. Franklin, Purnamrita Sarkar

Hybrid human/computer database systems promise to greatly expand the usefulness of query processing by incorporating the crowd for data gathering and other tasks. Such systems raise many implementation questions. Perhaps the most fundamental question is that the closed world assumption underlying relational query semantics does not hold in such systems. As a consequence the meaning of even simple queries can be called into question. Furthermore, query progress monitoring becomes difficult due to non-uniformities in the arrival of crowdsourced data and peculiarities of how people work in crowdsourcing systems. To address these issues, we develop statistical tools that enable users and systems developers to reason about query completeness. These tools can also help drive query execution and crowdsourcing strategies. We evaluate our techniques using experiments on a popular crowdsourcing platform.

人/计算机混合数据库系统通过整合数据收集和其他任务的人群，有望极大地扩展查询处理的有用性。这样的系统提出了许多执行问题。也许最根本的问题是，作为关系查询语义基础的封闭世界假设在这样的系统中不成立。因此，即使是简单查询的意义也会受到质疑。此外，由于众包数据到达的不一致性和众包系统中人们工作方式的特殊性，查询进度监控变得困难。为了解决这些问题，我们开发了统计工具，使用户和系统开发人员能够推断查询的完整性。这些工具还可以帮助驱动查询执行和众包策略。我们在一个流行的众包平台上通过实验来评估我们的技术。

引用次数: 121

Interval indexing and querying on key-value cloud stores 键值云存储上的间隔索引和查询

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544876

G. Sfakianakis, I. Patlakas, Nikos Ntarmos, P. Triantafillou

Cloud key-value stores are becoming increasingly more important. Challenging applications, requiring efficient and scalable access to massive data, arise every day. We focus on supporting interval queries (which are prevalent in several data intensive applications, such as temporal querying for temporal analytics), an efficient solution for which is lacking. We contribute a compound interval index structure, comprised of two tiers: (i) the MRSegmentTree (MRST), a key-value representation of the Segment Tree, and (ii) the Endpoints Index (EPI), a column family index that stores information for interval endpoints. In addition to the above, our contributions include: (i) algorithms for efficiently constructing and populating our indices using MapReduce jobs, (ii) techniques for efficient and scalable index maintenance, and (iii) algorithms for processing interval queries. We have implemented all algorithms using HBase and Hadoop, and conducted a detailed performance evaluation. We quantify the costs associated with the construction of the indices, and evaluate our query processing algorithms using queries on real data sets. We compare the performance of our approach to two alternatives: the native support for interval queries provided in HBase, and the execution of such queries using the Hive query execution tool. Our results show a significant speedup, far outperforming the state of the art.

云键值存储正变得越来越重要。每天都有具有挑战性的应用程序出现，这些应用程序需要对大量数据进行高效和可扩展的访问。我们专注于支持区间查询(这在一些数据密集型应用程序中很普遍，例如用于时间分析的时间查询)，这是一种缺乏的有效解决方案。我们提供了一个复合区间索引结构，由两层组成:(i) MRSegmentTree (MRST)，这是段树的键值表示;(ii)端点索引(EPI)，这是存储区间端点信息的列族索引。除此之外，我们的贡献还包括:(i)使用MapReduce作业高效构建和填充索引的算法，(ii)高效且可扩展的索引维护技术，以及(iii)处理区间查询的算法。我们使用HBase和Hadoop实现了所有算法，并进行了详细的性能评估。我们量化了与索引构建相关的成本，并使用对真实数据集的查询来评估我们的查询处理算法。我们将我们的方法的性能与两种替代方法进行了比较:HBase提供的对间隔查询的本地支持，以及使用Hive查询执行工具执行此类查询。我们的结果显示了显著的加速，远远超过了目前的水平。

{"title":"Interval indexing and querying on key-value cloud stores","authors":"G. Sfakianakis, I. Patlakas, Nikos Ntarmos, P. Triantafillou","doi":"10.1109/ICDE.2013.6544876","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544876","url":null,"abstract":"Cloud key-value stores are becoming increasingly more important. Challenging applications, requiring efficient and scalable access to massive data, arise every day. We focus on supporting interval queries (which are prevalent in several data intensive applications, such as temporal querying for temporal analytics), an efficient solution for which is lacking. We contribute a compound interval index structure, comprised of two tiers: (i) the MRSegmentTree (MRST), a key-value representation of the Segment Tree, and (ii) the Endpoints Index (EPI), a column family index that stores information for interval endpoints. In addition to the above, our contributions include: (i) algorithms for efficiently constructing and populating our indices using MapReduce jobs, (ii) techniques for efficient and scalable index maintenance, and (iii) algorithms for processing interval queries. We have implemented all algorithms using HBase and Hadoop, and conducted a detailed performance evaluation. We quantify the costs associated with the construction of the indices, and evaluate our query processing algorithms using queries on real data sets. We compare the performance of our approach to two alternatives: the native support for interval queries provided in HBase, and the execution of such queries using the Hive query execution tool. Our results show a significant speedup, far outperforming the state of the art.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125237387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22