2014 IEEE 30th International Conference on Data Engineering最新文献

英文中文

Distributed and interactive cube exploration 分布式和交互式多维数据集探索

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816674

N. Kamat, Prasanth Jayachandran, Karthik Tunga, Arnab Nandi

Interactive ad-hoc analytics over large datasets has become an increasingly popular use case. We detail the challenges encountered when building a distributed system that allows the interactive exploration of a data cube. We introduce DICE, a distributed system that uses a novel session-oriented model for data cube exploration, designed to provide the user with interactive sub-second latencies for specified accuracy levels. A novel framework is provided that combines three concepts: faceted exploration of data cubes, speculative execution of queries and query execution over subsets of data. We discuss design considerations, implementation details and optimizations of our system. Experiments demonstrate that DICE provides a sub-second interactive cube exploration experience at the billion-tuple scale that is at least 33% faster than current approaches.

大型数据集的交互式特别分析已经成为越来越流行的用例。我们详细介绍了在构建允许对数据多维数据集进行交互式探索的分布式系统时遇到的挑战。我们介绍DICE，这是一个分布式系统，它使用一种新颖的面向会话的模型进行数据立方体探索，旨在为用户提供指定精度级别的交互式亚秒级延迟。提供了一个新的框架，它结合了三个概念:数据多维数据集的分面探索、查询的推测执行和对数据子集的查询执行。我们讨论了系统的设计考虑、实现细节和优化。实验表明，DICE在十亿元规模下提供了亚秒级的交互式立方体探索体验，比目前的方法至少快33%。

引用次数: 139

We can learn your #hashtags: Connecting tweets to explicit topics 我们可以学习你的#标签:将推文与明确的主题联系起来

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816706

W. Feng, Jianyong Wang

In Twitter, users can annotate tweets with hashtags to indicate the ongoing topics. Hashtags provide users a convenient way to categorize tweets. From the system's perspective, hashtags play an important role in tweet retrieval, event detection, topic tracking, and advertising, etc. Annotating tweets with the right hashtags can lead to a better user experience. However, two problems remain unsolved during an annotation: (1) Before the user decides to create a new hashtag, is there any way to help her/him find out whether some related hashtags have already been created and widely used? (2) Different users may have different preferences for categorizing tweets. However, few work has been done to study the personalization issue in hashtag recommendation. To address the above problems, we propose a statistical model for personalized hashtag recommendation in this paper. With millions of <;tweet, hashtag> pairs being published everyday, we are able to learn the complex mappings from tweets to hashtags with the wisdom of the crowd. Two questions are answered in the model: (1) Different from traditional item recommendation data, users and tweets in Twitter have rich auxiliary information like URLs, mentions, locations, social relations, etc. How can we incorporate these features for hashtag recommendation? (2) Different hashtags have different temporal characteristics. Hashtags related to breaking events in the physical world have strong rise-and-fall temporal pattern while some other hashtags remain stable in the system. How can we incorporate hashtag related features to serve for hashtag recommendation? With all the above factors considered, we show that our model successfully outperforms existing methods on real datasets crawled from Twitter.

在Twitter上，用户可以用标签标注推文，以指示正在进行的主题。标签为用户提供了一种方便的方式来对tweet进行分类。从系统的角度来看，标签在tweet检索、事件检测、话题跟踪、广告投放等方面发挥着重要作用。用正确的标签标注推文可以带来更好的用户体验。然而，在标注过程中，有两个问题没有得到解决:(1)在用户决定创建一个新的标签之前，是否有办法帮助他/她发现一些相关的标签是否已经被创建并广泛使用?(2)不同用户对tweets分类的偏好可能不同。然而，关于标签推荐中的个性化问题的研究却很少。针对上述问题，本文提出了一种个性化标签推荐的统计模型。每天发布数百万对，我们可以用人群的智慧学习从推特到标签的复杂映射。模型回答了两个问题:(1)与传统的项目推荐数据不同，Twitter中的用户和推文具有丰富的辅助信息，如url、提及、位置、社会关系等。我们如何将这些功能整合到标签推荐中呢?(2)不同的标签具有不同的时间特征。与物理世界中的突发事件相关的标签具有很强的涨落时间模式，而其他一些标签在系统中保持稳定。我们如何整合与标签相关的功能来为标签推荐服务?考虑到上述所有因素，我们表明我们的模型在从Twitter抓取的真实数据集上成功地优于现有方法。

{"title":"We can learn your #hashtags: Connecting tweets to explicit topics","authors":"W. Feng, Jianyong Wang","doi":"10.1109/ICDE.2014.6816706","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816706","url":null,"abstract":"In Twitter, users can annotate tweets with hashtags to indicate the ongoing topics. Hashtags provide users a convenient way to categorize tweets. From the system's perspective, hashtags play an important role in tweet retrieval, event detection, topic tracking, and advertising, etc. Annotating tweets with the right hashtags can lead to a better user experience. However, two problems remain unsolved during an annotation: (1) Before the user decides to create a new hashtag, is there any way to help her/him find out whether some related hashtags have already been created and widely used? (2) Different users may have different preferences for categorizing tweets. However, few work has been done to study the personalization issue in hashtag recommendation. To address the above problems, we propose a statistical model for personalized hashtag recommendation in this paper. With millions of <;tweet, hashtag> pairs being published everyday, we are able to learn the complex mappings from tweets to hashtags with the wisdom of the crowd. Two questions are answered in the model: (1) Different from traditional item recommendation data, users and tweets in Twitter have rich auxiliary information like URLs, mentions, locations, social relations, etc. How can we incorporate these features for hashtag recommendation? (2) Different hashtags have different temporal characteristics. Hashtags related to breaking events in the physical world have strong rise-and-fall temporal pattern while some other hashtags remain stable in the system. How can we incorporate hashtag related features to serve for hashtag recommendation? With all the above factors considered, we show that our model successfully outperforms existing methods on real datasets crawled from Twitter.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114648252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

On masking topical intent in keyword search 关键词搜索中主题意图的掩蔽

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816656

Peng Wang, C. Ravishankar

Text-based search queries reveal user intent to the search engine, compromising privacy. Topical Intent Obfuscation (TIO) is a promising new approach to preserving user privacy. TIO masks topical intent by mixing real user queries with dummy queries matching various different topics. Dummy queries are generated using a Dummy Query Generation Algorithm (DGA). We demonstrate various shortcomings in current TIO schemes, and show how to correct them. Current schemes assume that DGA details are unknown to the adversary. We argue that this is a flawed assumption, and show how DGA details can be used to construct efficient attacks on TIO schemes, using an iterative DGA as an example. Our extensive experiments on real data sets show that our attacks can flag up to 80% of dummy queries. We also propose HDGA, a new DGA that we prove to be immune to the attacks based on DGA semantics that we describe.

基于文本的搜索查询向搜索引擎揭示了用户的意图，损害了隐私。局部意图混淆(TIO)是一种很有前途的保护用户隐私的新方法。TIO通过混合真实用户查询和匹配各种不同主题的虚拟查询来掩盖主题意图。虚拟查询是使用虚拟查询生成算法(DGA)生成的。我们展示了当前TIO方案的各种缺点，并展示了如何纠正它们。目前的方案假设对手不知道DGA的细节。我们认为这是一个有缺陷的假设，并展示了如何使用DGA细节来构建对TIO方案的有效攻击，并以迭代DGA为例。我们在真实数据集上的大量实验表明，我们的攻击可以标记高达80%的虚拟查询。我们还提出了HDGA，一种新的DGA，我们证明了它对基于我们描述的DGA语义的攻击免疫。

引用次数: 12

The Vertica Query Optimizer: The case for specialized query optimizers Vertica查询优化器:专用查询优化器的案例

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816727

Nga Tran, Andrew Lamb, Lakshmikant Shrinivas, Sreenath Bodagala, J. Dave

The Vertica SQL Query Optimizer was written from the ground up for the Vertica Analytic Database. Its design and the tradeoffs we encountered during its implementation argue that the full power of novel database systems can only be realized with a carefully crafted custom Query Optimizer written specifically for the system in which it operates.

Vertica SQL查询优化器是为Vertica分析数据库从头开始编写的。它的设计和我们在实现过程中遇到的权衡表明，新型数据库系统的全部功能只能通过精心设计的定制查询优化器来实现，该查询优化器专门为其运行的系统编写。

引用次数: 7

GLog: A high level graph analysis system using MapReduce GLog:使用MapReduce的高级图形分析系统

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816680

Jun Gao, Jiashuai Zhou, Chang Zhou, J. Yu

With the rapid growth of graphs in different applications, it is inevitable to leverage existing distributed data processing frameworks in managing large graphs. Although these frameworks ease the developing cost, it is still cumbersome and error-prone for developers to implement complex graph analysis tasks in distributed environments. Additionally, developers have to learn the details of these frameworks quite well, which is a key to improve the performance of distributed jobs. This paper introduces a high level query language called GLog and proposes its evaluation method to overcome these limitations. Specifically, we first design a RG (Relational-Graph) data model to mix relational data and graph data, and extend Datalog to GLog on RG tables to support various graph analysis tasks. Second, we define operations on RG tables, and show translation templates to convert a GLog query into a sequence of MapReduce jobs. Third, we propose two strategies, namely rule merging and iteration rewriting, to optimize the translated jobs. The final experiments show that GLog can not only express various graph analysis tasks in a more succinct way, but also achieve a better performance for most of the graph analysis tasks than Pig, another high level dataflow system.

随着不同应用程序中图形的快速增长，利用现有的分布式数据处理框架来管理大型图形是不可避免的。尽管这些框架降低了开发成本，但对于开发人员来说，在分布式环境中实现复杂的图分析任务仍然很麻烦，而且容易出错。此外，开发人员必须很好地了解这些框架的细节，这是提高分布式作业性能的关键。本文介绍了一种高级查询语言GLog，并提出了克服这些局限性的评估方法。具体来说，我们首先设计了一个RG (relationship - graph)数据模型来混合关系数据和图形数据，并在RG表上将Datalog扩展到GLog，以支持各种图形分析任务。其次，我们定义RG表上的操作，并显示将GLog查询转换为MapReduce作业序列的转换模板。第三，我们提出了规则合并和迭代重写两种策略来优化翻译作业。最后的实验表明，GLog不仅能够以更简洁的方式表达各种图形分析任务，而且在大多数图形分析任务上都取得了比另一个高级数据流系统Pig更好的性能。

{"title":"GLog: A high level graph analysis system using MapReduce","authors":"Jun Gao, Jiashuai Zhou, Chang Zhou, J. Yu","doi":"10.1109/ICDE.2014.6816680","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816680","url":null,"abstract":"With the rapid growth of graphs in different applications, it is inevitable to leverage existing distributed data processing frameworks in managing large graphs. Although these frameworks ease the developing cost, it is still cumbersome and error-prone for developers to implement complex graph analysis tasks in distributed environments. Additionally, developers have to learn the details of these frameworks quite well, which is a key to improve the performance of distributed jobs. This paper introduces a high level query language called GLog and proposes its evaluation method to overcome these limitations. Specifically, we first design a RG (Relational-Graph) data model to mix relational data and graph data, and extend Datalog to GLog on RG tables to support various graph analysis tasks. Second, we define operations on RG tables, and show translation templates to convert a GLog query into a sequence of MapReduce jobs. Third, we propose two strategies, namely rule merging and iteration rewriting, to optimize the translated jobs. The final experiments show that GLog can not only express various graph analysis tasks in a more succinct way, but also achieve a better performance for most of the graph analysis tasks than Pig, another high level dataflow system.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123370546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Automatic generation of question answer pairs from noisy case logs 从嘈杂的案例日志中自动生成问题答案对

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816671

J. Ajmera, Sachindra Joshi, Ashish Verma, Amol Mittal

In a customer support scenario, a lot of valuable information is recorded in the form of `case logs'. Case logs are primarily written for future references or manual inspections and therefore are written in a hasty manner and are very noisy. In this paper, we propose techniques that exploit these case logs to mine real customer concerns or problems and then map them to well written knowledge articles for that enterprise. This mapping results into generation of question-answer (QA) pairs. These QA pairs can be used for a variety of applications such as dynamically updating the frequently-asked-questions (FAQs), updating the knowledge repository etc. In this paper we show the utility of these discovered QA pairs as training data for a question-answering system. Our approach for mining the case logs is based on a composite model consisting of two generative models, viz, hidden Markov model (HMM) and latent Dirichlet allocation (LDA) model. The LDA model explains the long-range dependencies across words due to their semantic similarity and HMM models the sequential patterns present in these case logs. Such processing results in crisp `problem statement' segments which are indicative of the real customer concerns. Our experiments show that this approach finds crisp problem-statements in 56% of the cases and outperforms other alternate methods for segmentation such as HMM, LDA and conditional random field (CRF). After finding these crisp problem-statements, appropriate answers are looked up from an existing knowledge repository index forming candidate QA pairs. We show that considering only the problemstatement segments for which the answers can be found further improves the segmentation performance to 82%. Finally, we show that when these QA pairs are used as training data, the performance of a question-answering system can be improved significantly.

在客户支持场景中，以“案例日志”的形式记录了许多有价值的信息。案例记录主要是为了将来的参考或手工检查而写的，因此是以一种仓促的方式写的，而且非常嘈杂。在本文中，我们提出了利用这些案例日志来挖掘真正的客户关注点或问题的技术，然后将它们映射到为该企业编写的良好知识文章中。这种映射导致生成问答对。这些QA对可用于各种应用程序，例如动态更新常见问题(FAQs)、更新知识库等。在本文中，我们展示了这些发现的QA对作为问答系统的训练数据的效用。我们的案例日志挖掘方法是基于由两个生成模型组成的复合模型，即隐马尔可夫模型(HMM)和潜狄利克雷分配(LDA)模型。由于语义相似性，LDA模型解释了单词之间的长期依赖关系，HMM对这些案例日志中出现的顺序模式进行建模。这样的处理产生了清晰的“问题陈述”部分，这些部分表明了客户真正关心的问题。我们的实验表明，该方法在56%的情况下发现了清晰的问题陈述，并且优于其他分割方法，如HMM, LDA和条件随机场(CRF)。在找到这些清晰的问题陈述之后，从现有的知识库索引中查找合适的答案，形成候选QA对。我们表明，只考虑可以找到答案的问题陈述片段，进一步将分割性能提高到82%。最后，我们证明了当这些问答对被用作训练数据时，问答系统的性能可以得到显著提高。

{"title":"Automatic generation of question answer pairs from noisy case logs","authors":"J. Ajmera, Sachindra Joshi, Ashish Verma, Amol Mittal","doi":"10.1109/ICDE.2014.6816671","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816671","url":null,"abstract":"In a customer support scenario, a lot of valuable information is recorded in the form of `case logs'. Case logs are primarily written for future references or manual inspections and therefore are written in a hasty manner and are very noisy. In this paper, we propose techniques that exploit these case logs to mine real customer concerns or problems and then map them to well written knowledge articles for that enterprise. This mapping results into generation of question-answer (QA) pairs. These QA pairs can be used for a variety of applications such as dynamically updating the frequently-asked-questions (FAQs), updating the knowledge repository etc. In this paper we show the utility of these discovered QA pairs as training data for a question-answering system. Our approach for mining the case logs is based on a composite model consisting of two generative models, viz, hidden Markov model (HMM) and latent Dirichlet allocation (LDA) model. The LDA model explains the long-range dependencies across words due to their semantic similarity and HMM models the sequential patterns present in these case logs. Such processing results in crisp `problem statement' segments which are indicative of the real customer concerns. Our experiments show that this approach finds crisp problem-statements in 56% of the cases and outperforms other alternate methods for segmentation such as HMM, LDA and conditional random field (CRF). After finding these crisp problem-statements, appropriate answers are looked up from an existing knowledge repository index forming candidate QA pairs. We show that considering only the problemstatement segments for which the answers can be found further improves the segmentation performance to 82%. Finally, we show that when these QA pairs are used as training data, the performance of a question-answering system can be improved significantly.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121459465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

SILVERBACK: Scalable association mining for temporal data in columnar probabilistic databases SILVERBACK:用于柱状概率数据库中时态数据的可扩展关联挖掘

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816724

Yusheng Xie, Diana Palsetia, Goce Trajcevski, Ankit Agrawal, A. Choudhary

We address the problem of large scale probabilistic association rule mining and consider the trade-offs between accuracy of the mining results and quest of scalability on modest hardware infrastructure. We demonstrate how extensions and adaptations of research findings can be integrated in an industrial application, and we present the commercially deployed SILVERBACK framework, developed at Voxsup Inc. SILVERBACK tackles the storage efficiency problem by proposing a probabilistic columnar infrastructure and using Bloom filters and reservoir sampling techniques. In addition, a probabilistic pruning technique has been introduced based on Apriori for mining frequent item-sets. The proposed target-driven technique yields a significant reduction on the size of the frequent item-set candidates. We present extensive experimental evaluations which demonstrate the benefits of a context-aware incorporation of infrastructure limitations into corresponding research techniques. The experiments indicate that, when compared to the traditional Hadoop-based approach for improving scalability by adding more hosts, SILVERBACK - which has been commercially deployed and developed at Voxsup Inc. since May 2011 - has much better run-time performance with negligible accuracy sacrifices.

我们解决了大规模概率关联规则挖掘的问题，并考虑了挖掘结果的准确性和在适度硬件基础设施上追求可扩展性之间的权衡。我们演示了如何将研究成果的扩展和调整集成到工业应用程序中，并介绍了由Voxsup公司开发的商业部署SILVERBACK框架。SILVERBACK解决了存储效率问题，提出了一个概率柱状基础设施，并使用了Bloom过滤器和储层采样技术。此外，还引入了一种基于Apriori的概率剪枝技术，用于频繁项集的挖掘。所提出的目标驱动技术显著减少了频繁项集候选项的大小。我们提出了广泛的实验评估，证明了将基础设施限制纳入相应研究技术的上下文感知的好处。实验表明，与传统的基于hadoop的方法相比，通过增加更多的主机来提高可伸缩性，SILVERBACK(自2011年5月以来已经在Voxsup Inc.进行了商业部署和开发)具有更好的运行时性能，而精度的牺牲可以忽略不计。

{"title":"SILVERBACK: Scalable association mining for temporal data in columnar probabilistic databases","authors":"Yusheng Xie, Diana Palsetia, Goce Trajcevski, Ankit Agrawal, A. Choudhary","doi":"10.1109/ICDE.2014.6816724","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816724","url":null,"abstract":"We address the problem of large scale probabilistic association rule mining and consider the trade-offs between accuracy of the mining results and quest of scalability on modest hardware infrastructure. We demonstrate how extensions and adaptations of research findings can be integrated in an industrial application, and we present the commercially deployed SILVERBACK framework, developed at Voxsup Inc. SILVERBACK tackles the storage efficiency problem by proposing a probabilistic columnar infrastructure and using Bloom filters and reservoir sampling techniques. In addition, a probabilistic pruning technique has been introduced based on Apriori for mining frequent item-sets. The proposed target-driven technique yields a significant reduction on the size of the frequent item-set candidates. We present extensive experimental evaluations which demonstrate the benefits of a context-aware incorporation of infrastructure limitations into corresponding research techniques. The experiments indicate that, when compared to the traditional Hadoop-based approach for improving scalability by adding more hosts, SILVERBACK - which has been commercially deployed and developed at Voxsup Inc. since May 2011 - has much better run-time performance with negligible accuracy sacrifices.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115893655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Data quality: The other face of Big Data 数据质量:大数据的另一面

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816764

B. Saha, D. Srivastava

In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth `V' of big data is increasingly being recognized. In this tutorial, we highlight the substantial challenges that the first three `V's, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data. This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.

在大数据时代，数据正在以前所未有的规模产生、收集和分析，数据驱动的决策正在席卷社会的方方面面。最近的研究表明，在大型数据库和网络上，低质量的数据普遍存在。由于低质量的数据可能会对数据分析结果造成严重后果，因此准确性的重要性，即大数据的第四个“V”正日益得到认可。在本教程中，我们将重点介绍前三个“V”(volume, velocity和variety)在处理大数据的准确性时所带来的重大挑战。由于数据的庞大数量和速度，需要以可扩展和及时的方式理解和(可能)修复错误数据。由于数据的多样性，通常来自不同的来源，数据质量规则不能先验地指定;为了发现数据的语义，需要让“数据自己说话”。本教程介绍了与大数据质量管理相关的最新成果，重点关注两个主要方面:(i)从数据本身发现质量问题，(ii)权衡准确性与效率，并为社区确定了一系列开放问题。

{"title":"Data quality: The other face of Big Data","authors":"B. Saha, D. Srivastava","doi":"10.1109/ICDE.2014.6816764","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816764","url":null,"abstract":"In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth `V' of big data is increasingly being recognized. In this tutorial, we highlight the substantial challenges that the first three `V's, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data. This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132775227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 217

Locality-sensitive operators for parallel main-memory database clusters 用于并行主存数据库集群的位置敏感操作符

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816684

Wolf Rödiger, Tobias Mühlbauer, Philipp Unterbrunner, Angelika Reiser, A. Kemper, Thomas Neumann

The growth in compute speed has outpaced the growth in network bandwidth over the last decades. This has led to an increasing performance gap between local and distributed processing. A parallel database cluster thus has to maximize the locality of query processing. A common technique to this end is to co-partition relations to avoid expensive data shuffling across the network. However, this is limited to one attribute per relation and is expensive to maintain in the face of updates. Other attributes often exhibit a fuzzy co-location due to correlations with the distribution key but current approaches do not leverage this. In this paper, we introduce locality-sensitive data shuffling, which can dramatically reduce the amount of network communication for distributed operators such as join and aggregation. We present four novel techniques: (i) optimal partition assignment exploits locality to reduce the network phase duration; (ii) communication scheduling avoids bandwidth underutilization due to cross traffic; (iii) adaptive radix partitioning retains locality during data repartitioning and handles value skew gracefully; and (iv) selective broadcast reduces network communication in the presence of extreme value skew or large numbers of duplicates. We present comprehensive experimental results, which show that our techniques can improve performance by up to factor of 5 for fuzzy co-location and a factor of 3 for inputs with value skew.

在过去的几十年里，计算速度的增长已经超过了网络带宽的增长。这导致本地处理和分布式处理之间的性能差距越来越大。因此，并行数据库集群必须最大化查询处理的局部性。实现这一目的的常用技术是共分区关系，以避免在网络上进行昂贵的数据变换。但是，这仅限于每个关系的一个属性，并且在面对更新时维护成本很高。由于与分布键的相关性，其他属性通常表现出模糊的共定位，但目前的方法没有利用这一点。在本文中，我们引入了位置敏感的数据变换，它可以大大减少分布式操作(如连接和聚合)的网络通信量。我们提出了四种新技术:(i)最优分区分配利用局部性来减少网络相位持续时间;(ii)通信调度避免了由于交叉流量导致的带宽利用率不足;(iii)自适应基数分区在数据重分区期间保持局域性，并优雅地处理值倾斜;(iv)选择性广播减少了存在极端值偏差或大量重复的网络通信。我们提供了全面的实验结果，表明我们的技术可以将模糊共定位的性能提高5倍，对于具有值偏差的输入可以提高3倍。

{"title":"Locality-sensitive operators for parallel main-memory database clusters","authors":"Wolf Rödiger, Tobias Mühlbauer, Philipp Unterbrunner, Angelika Reiser, A. Kemper, Thomas Neumann","doi":"10.1109/ICDE.2014.6816684","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816684","url":null,"abstract":"The growth in compute speed has outpaced the growth in network bandwidth over the last decades. This has led to an increasing performance gap between local and distributed processing. A parallel database cluster thus has to maximize the locality of query processing. A common technique to this end is to co-partition relations to avoid expensive data shuffling across the network. However, this is limited to one attribute per relation and is expensive to maintain in the face of updates. Other attributes often exhibit a fuzzy co-location due to correlations with the distribution key but current approaches do not leverage this. In this paper, we introduce locality-sensitive data shuffling, which can dramatically reduce the amount of network communication for distributed operators such as join and aggregation. We present four novel techniques: (i) optimal partition assignment exploits locality to reduce the network phase duration; (ii) communication scheduling avoids bandwidth underutilization due to cross traffic; (iii) adaptive radix partitioning retains locality during data repartitioning and handles value skew gracefully; and (iv) selective broadcast reduces network communication in the presence of extreme value skew or large numbers of duplicates. We present comprehensive experimental results, which show that our techniques can improve performance by up to factor of 5 for fuzzy co-location and a factor of 3 for inputs with value skew.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114389554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 61

History-aware query optimization with materialized intermediate views 具有物化中间视图的历史感知查询优化

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816678

L. Perez, C. Jermaine

The use of materialized views derived from the intermediate results of frequently executed queries is a popular strategy for improving performance in query workloads. Optimizers capable of matching such views with inbound queries can generate alternative execution plans that read the materialized contents directly instead of re-computing the corresponding subqueries, which tends to result in reduced query execution times. In this paper, we introduce an architecture called Hawc that extends a cost-based logical optimizer with the capability to use history information to identify query plans that, if executed, produce intermediate result sets that can be used to create materialized views with the potential to reduce the execution time of future queries. We present techniques for using knowledge of past queries to assist the query optimizer and match, generate and select useful materialized views. Experimental results indicate that these techniques provide substantial improvements in workload execution time.

使用从频繁执行的查询的中间结果派生的物化视图是提高查询工作负载性能的一种流行策略。能够将此类视图与入站查询匹配的优化器可以生成替代执行计划，直接读取物化的内容，而不是重新计算相应的子查询，这往往会减少查询执行时间。在本文中，我们介绍了一个名为Hawc的体系结构，它扩展了一个基于成本的逻辑优化器，该优化器具有使用历史信息识别查询计划的能力，如果执行这些查询计划，将产生中间结果集，这些结果集可用于创建物化视图，从而有可能减少未来查询的执行时间。我们介绍了使用过去查询的知识来帮助查询优化器匹配、生成和选择有用的物化视图的技术。实验结果表明，这些技术在工作负载执行时间上有很大的改进。

引用次数: 39

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2014 IEEE 30th International Conference on Data Engineering

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀