Query optimization using column statistics in hive

Proceedings. International Database Engineering and Applications Symposium Pub Date : 2011-09-21 DOI:10.1145/2076623.2076636

Anja Gruenheid, E. Omiecinski, L. Mark

{"title":"Query optimization using column statistics in hive","authors":"Anja Gruenheid, E. Omiecinski, L. Mark","doi":"10.1145/2076623.2076636","DOIUrl":null,"url":null,"abstract":"Hive is a data warehousing solution on top of the Hadoop MapReduce framework that has been designed to handle large amounts of data and store them in tables like a relational database management system or a conventional data warehouse while using the parallelization and batch processing functionalities of the Hadoop MapReduce framework to speed up the execution of queries. Data inserted into Hive is stored in the Hadoop FileSystem (HDFS), which is part of the Hadoop MapReduce framework. To make the data accessible to the user, Hive uses a query language similar to SQL, which is called HiveQL. When a query is issued in HiveQL, it is translated by a parser into a query execution plan that is optimized and then turned into a series of map and reduce iterations. These iterations are then executed on the data stored in the HDFS, writing the output to a file.\n The goal of this work is to to develop an approach for improving the performance of the HiveQL queries executed in the Hive framework. For that purpose, we introduce an extension to the Hive MetaStore which stores metadata that has been extracted on the column level of the user database. These column level statistics are then used for example in combination with join ordering algorithms which are adapted to the specific needs of the Hadoop MapReduce environment to improve the overall performance of the HiveQL query execution.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"31 1","pages":"97-105"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. International Database Engineering and Applications Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2076623.2076636","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 34

Abstract

Hive is a data warehousing solution on top of the Hadoop MapReduce framework that has been designed to handle large amounts of data and store them in tables like a relational database management system or a conventional data warehouse while using the parallelization and batch processing functionalities of the Hadoop MapReduce framework to speed up the execution of queries. Data inserted into Hive is stored in the Hadoop FileSystem (HDFS), which is part of the Hadoop MapReduce framework. To make the data accessible to the user, Hive uses a query language similar to SQL, which is called HiveQL. When a query is issued in HiveQL, it is translated by a parser into a query execution plan that is optimized and then turned into a series of map and reduce iterations. These iterations are then executed on the data stored in the HDFS, writing the output to a file. The goal of this work is to to develop an approach for improving the performance of the HiveQL queries executed in the Hive framework. For that purpose, we introduce an extension to the Hive MetaStore which stores metadata that has been extracted on the column level of the user database. These column level statistics are then used for example in combination with join ordering algorithms which are adapted to the specific needs of the Hadoop MapReduce environment to improve the overall performance of the HiveQL query execution.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在hive中使用列统计进行查询优化

Hive是一个基于Hadoop MapReduce框架的数据仓库解决方案，它被设计用于处理大量数据并将它们存储在表中，就像关系数据库管理系统或传统的数据仓库一样，同时使用Hadoop MapReduce框架的并行化和批处理功能来加速查询的执行。插入到Hive中的数据存储在HDFS (Hadoop FileSystem)中，HDFS是Hadoop MapReduce框架的一部分。为了让用户能够访问数据，Hive使用了一种类似SQL的查询语言，称为HiveQL。在HiveQL中发出查询时，解析器将其转换为查询执行计划，该计划经过优化，然后转换为一系列map和reduce迭代。然后对存储在HDFS中的数据执行这些迭代，将输出写入文件。这项工作的目标是开发一种方法来提高Hive框架中执行的HiveQL查询的性能。为此，我们向Hive MetaStore引入了一个扩展，该扩展用于存储在用户数据库的列级别上提取的元数据。例如，这些列级统计数据可以与连接排序算法结合使用，这些算法可以适应Hadoop MapReduce环境的特定需求，从而提高HiveQL查询执行的整体性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings. International Database Engineering and Applications Symposium

自引率

0.00%

发文量