OLAP over probabilistic data cubes I: Aggregating, materializing, and querying

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI:10.1109/ICDE.2016.7498291

Xike Xie, Xingjun Hao, T. Pedersen, Peiquan Jin, Jinchuan Chen

{"title":"OLAP over probabilistic data cubes I: Aggregating, materializing, and querying","authors":"Xike Xie, Xingjun Hao, T. Pedersen, Peiquan Jin, Jinchuan Chen","doi":"10.1109/ICDE.2016.7498291","DOIUrl":null,"url":null,"abstract":"On-Line Analytical Processing (OLAP) enables powerful analytics by quickly computing aggregate values of numerical measures over multiple hierarchical dimensions for massive datasets. However, many types of source data, e.g., from GPS, sensors, and other measurement devices, are intrinsically inaccurate (imprecise and/or uncertain) and thus OLAP cannot be readily applied. In this paper, we address the resulting data veracity problem in OLAP by proposing the concept of probabilistic data cubes. Such a cube is comprised of a set of probabilistic cuboids which summarize the aggregated values in the form of probability mass functions (pmfs in short) and thus offer insights into the underlying data quality and enable confidence-aware query evaluation and analysis. However, the probabilistic nature of data poses computational challenges as even simple operations are #P-hard under the possible world semantics. Even worse, it is hard to share computations among different cuboids, as aggregation functions that are distributive for traditional data cubes, e.g., SUM and COUNT, become holistic in probabilistic settings. In this paper, we propose a complete set of techniques for probabilistic data cubes, from cuboid aggregation, over cube materialization, to query evaluation. For aggregation, we focus on how to maximize the sharing of computation among cells and cuboids. We present two aggregation methods: convolution and sketch-based. The two methods scale down the time complexities of building a probabilistic cuboid to polynomial and linear, respectively. Each of the two supports both full and partial data cube materialization. Then, we devise a cost model which guides the aggregation methods to be deployed and combined during the cube materialization. We further provide algorithms for probabilistic slicing and dicing queries on the data cube. Extensive experiments over real and synthetic datasets are conducted to show that the techniques are effective and scalable.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"51 1","pages":"799-810"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2016.7498291","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

On-Line Analytical Processing (OLAP) enables powerful analytics by quickly computing aggregate values of numerical measures over multiple hierarchical dimensions for massive datasets. However, many types of source data, e.g., from GPS, sensors, and other measurement devices, are intrinsically inaccurate (imprecise and/or uncertain) and thus OLAP cannot be readily applied. In this paper, we address the resulting data veracity problem in OLAP by proposing the concept of probabilistic data cubes. Such a cube is comprised of a set of probabilistic cuboids which summarize the aggregated values in the form of probability mass functions (pmfs in short) and thus offer insights into the underlying data quality and enable confidence-aware query evaluation and analysis. However, the probabilistic nature of data poses computational challenges as even simple operations are #P-hard under the possible world semantics. Even worse, it is hard to share computations among different cuboids, as aggregation functions that are distributive for traditional data cubes, e.g., SUM and COUNT, become holistic in probabilistic settings. In this paper, we propose a complete set of techniques for probabilistic data cubes, from cuboid aggregation, over cube materialization, to query evaluation. For aggregation, we focus on how to maximize the sharing of computation among cells and cuboids. We present two aggregation methods: convolution and sketch-based. The two methods scale down the time complexities of building a probabilistic cuboid to polynomial and linear, respectively. Each of the two supports both full and partial data cube materialization. Then, we devise a cost model which guides the aggregation methods to be deployed and combined during the cube materialization. We further provide algorithms for probabilistic slicing and dicing queries on the data cube. Extensive experiments over real and synthetic datasets are conducted to show that the techniques are effective and scalable.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于概率数据集的OLAP I:聚合、具体化和查询

在线分析处理(OLAP)通过快速计算多个层次维度上的数值测量的聚合值，为大规模数据集提供强大的分析功能。然而，许多类型的源数据，例如来自GPS、传感器和其他测量设备的源数据，本质上是不准确的(不精确和/或不确定)，因此OLAP不能轻易应用。在本文中，我们通过提出概率数据立方体的概念来解决OLAP中产生的数据准确性问题。这样的多维数据集由一组概率长方体组成，这些概率长方体以概率质量函数(简称pmfs)的形式总结聚合值，从而提供对底层数据质量的洞察，并支持对置信度敏感的查询评估和分析。然而，数据的概率性质带来了计算上的挑战，因为在可能世界语义下，即使是简单的操作也是#P-hard。更糟糕的是，很难在不同的长方体之间共享计算，因为传统数据立方体的分布聚合函数，例如SUM和COUNT，在概率设置中变得整体。在本文中，我们提出了一套完整的概率数据立方体技术，从立方体聚合、立方体物化到查询评估。对于聚合，我们关注的是如何最大化单元和长方体之间的计算共享。我们提出了卷积和基于草图的两种聚合方法。这两种方法分别将构建概率长方体的时间复杂度降低到多项式和线性。两者都支持完整和部分数据立方体物化。然后，我们设计了一个成本模型来指导在多维数据集实体化过程中部署和组合的聚合方法。我们进一步提供了对数据立方体进行概率切片和切块查询的算法。在真实和合成数据集上进行的大量实验表明，该技术是有效的和可扩展的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

自引率

0.00%

发文量

期刊最新文献

Data profiling SEED: A system for entity exploration and debugging in large-scale knowledge graphs TemProRA: Top-k temporal-probabilistic results analysis Durable graph pattern queries on historical graphs SCouT: Scalable coupled matrix-tensor factorization - algorithm and discoveries