Revisiting aggregation techniques for big data

International Workshop on Data Warehousing and OLAP Pub Date : 2013-10-28 DOI:10.1145/2513190.2517827

V. Tsotras

{"title":"Revisiting aggregation techniques for big data","authors":"V. Tsotras","doi":"10.1145/2513190.2517827","DOIUrl":null,"url":null,"abstract":"In this talk we first present an introduction to AsterixDB [1], a parallel, semistructured platform to ingest, store, index, query, analyze, and publish \"big data\" (http://asterixdb.ics.uci.edu) and the various challenges we addressed while building it. AsterixDB combines ideas from semistructured data management, parallel database systems, and first-generation data-intensive computing platforms (MapReduce and Hadoop). The full AsterixDB software stack provides support for big data applications from the storage and processing engine (Hyracks [2] available at: http://hyracks.googlecode.com), to the exible query optimization layer (Algebricks), to the interfaces for user-level interaction (AQL, HiveQL, Pregelix, etc.) Hyracks is a partitioned-parallel engine for data intensive computing jobs in the form of DAGs. Algebricks is a model-agnostic, algebraic layer for compiling and optimizing parallel queries to be processed by Hyracks. Queries for AsterixDB can be expressed by either popular higher-level data analysis languages like Pig, Hive or Jaql, or by its native query language (AQL) and data model (ADM) with support for semi-structured information and fuzzy data.\n Fundamental data processing operations, like joins and aggregations, are natively supported in AsterixDB. The second part of the talk focuses on our experiences while designing efficient local (per node) aggregation algorithms for AsterixDB. In particular, there are two challenges for local aggregations in a big data system: first, if the aggregation is group-based (like the \"group-by\" in SQL), the aggregation result may not fit in main memory; second, in order to allow multiple operations being processed simultaneously, an aggregation operation should work within a strict memory budget provided by the platform. Despite its importance and challenges, the design and evaluation of local aggregation algorithms has not received the same level of attention that other basic operators, such as joins, have received in the literature. Facing a lack of \"off the shelf\" local aggregation algorithms for big data, we present low-level implementation details for engineering the aggregation operator, utilizing (i) sort-based, (ii) hash-based, and (iii) sort-hash-hybrid approaches. We present six algorithms all of which work within a strictly bounded memory budget, and can easily adapt between in-memory and external processing. Among them, two are novel and four are based on extending existing join algorithms.\n We deployed all algorithms as operators in the Hyracks platform and evaluated their performance through extensive experimentation. Our experiments cover many different performance factors, including input cardinality, memory, data distribution, and hash table structure. Our study guided our selection of the local aggregation algorithms supported in the recent release of AsterixDB, namely: the hybrid-hash. Pre-Partitioning algorithm for its tolerance on the estimation of the input grouping key cardinality, the Hash-Sort algorithm for its good performance when aggregating skewed data, and the Sort-Based algorithm when the input data is already sorted. This local aggregation work is the first part of a two-part big data aggregation study, as it addresses the \"map\" phase. Our findings provide the foundation for the global aggregation strategy we are currently investigating for the \"reduce\" phase. We hope our experience can help developers of other Big Data platforms to build a solid local aggregation operator.","PeriodicalId":335396,"journal":{"name":"International Workshop on Data Warehousing and OLAP","volume":"106 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Workshop on Data Warehousing and OLAP","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2513190.2517827","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

In this talk we first present an introduction to AsterixDB [1], a parallel, semistructured platform to ingest, store, index, query, analyze, and publish "big data" (http://asterixdb.ics.uci.edu) and the various challenges we addressed while building it. AsterixDB combines ideas from semistructured data management, parallel database systems, and first-generation data-intensive computing platforms (MapReduce and Hadoop). The full AsterixDB software stack provides support for big data applications from the storage and processing engine (Hyracks [2] available at: http://hyracks.googlecode.com), to the exible query optimization layer (Algebricks), to the interfaces for user-level interaction (AQL, HiveQL, Pregelix, etc.) Hyracks is a partitioned-parallel engine for data intensive computing jobs in the form of DAGs. Algebricks is a model-agnostic, algebraic layer for compiling and optimizing parallel queries to be processed by Hyracks. Queries for AsterixDB can be expressed by either popular higher-level data analysis languages like Pig, Hive or Jaql, or by its native query language (AQL) and data model (ADM) with support for semi-structured information and fuzzy data. Fundamental data processing operations, like joins and aggregations, are natively supported in AsterixDB. The second part of the talk focuses on our experiences while designing efficient local (per node) aggregation algorithms for AsterixDB. In particular, there are two challenges for local aggregations in a big data system: first, if the aggregation is group-based (like the "group-by" in SQL), the aggregation result may not fit in main memory; second, in order to allow multiple operations being processed simultaneously, an aggregation operation should work within a strict memory budget provided by the platform. Despite its importance and challenges, the design and evaluation of local aggregation algorithms has not received the same level of attention that other basic operators, such as joins, have received in the literature. Facing a lack of "off the shelf" local aggregation algorithms for big data, we present low-level implementation details for engineering the aggregation operator, utilizing (i) sort-based, (ii) hash-based, and (iii) sort-hash-hybrid approaches. We present six algorithms all of which work within a strictly bounded memory budget, and can easily adapt between in-memory and external processing. Among them, two are novel and four are based on extending existing join algorithms. We deployed all algorithms as operators in the Hyracks platform and evaluated their performance through extensive experimentation. Our experiments cover many different performance factors, including input cardinality, memory, data distribution, and hash table structure. Our study guided our selection of the local aggregation algorithms supported in the recent release of AsterixDB, namely: the hybrid-hash. Pre-Partitioning algorithm for its tolerance on the estimation of the input grouping key cardinality, the Hash-Sort algorithm for its good performance when aggregating skewed data, and the Sort-Based algorithm when the input data is already sorted. This local aggregation work is the first part of a two-part big data aggregation study, as it addresses the "map" phase. Our findings provide the foundation for the global aggregation strategy we are currently investigating for the "reduce" phase. We hope our experience can help developers of other Big Data platforms to build a solid local aggregation operator.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

重新审视大数据聚合技术

在这次演讲中，我们首先介绍了AsterixDB[1]，它是一个并行的、半结构化的平台，用于摄取、存储、索引、查询、分析和发布“大数据”(http://asterixdb.ics.uci.edu)，以及我们在构建它时所面临的各种挑战。AsterixDB结合了半结构化数据管理、并行数据库系统和第一代数据密集型计算平台(MapReduce和Hadoop)的思想。完整的AsterixDB软件栈提供了对大数据应用的支持，从存储和处理引擎(Hyracks[2]可在:http://hyracks.googlecode.com)，到灵活的查询优化层(Algebricks)，再到用户级交互接口(AQL, HiveQL, Pregelix等)。Hyracks是一个分区并行引擎，以dag的形式用于数据密集型计算任务。Algebricks是一个模型不可知的代数层，用于编译和优化hyrack处理的并行查询。对AsterixDB的查询可以通过流行的高级数据分析语言(如Pig、Hive或Jaql)来表达，也可以通过其本地查询语言(AQL)和数据模型(ADM)来表达，并支持半结构化信息和模糊数据。基本的数据处理操作，如连接和聚合，在AsterixDB中是本地支持的。演讲的第二部分着重于我们为AsterixDB设计高效的本地(每个节点)聚合算法的经验。特别是，大数据系统中的本地聚合存在两个挑战:首先，如果聚合是基于组的(如SQL中的“group-by”)，聚合结果可能不适合主存;其次，为了允许同时处理多个操作，聚合操作应该在平台提供的严格内存预算内工作。尽管其重要性和挑战，局部聚合算法的设计和评估并没有得到文献中其他基本运算符(如连接)所得到的同等重视。面对缺乏“现成的”大数据本地聚合算法，我们提出了设计聚合运算符的底层实现细节，利用(i)基于排序的，(ii)基于哈希的，和(iii)排序-哈希混合方法。我们提出了六种算法，它们都在严格限定的内存预算内工作，并且可以很容易地在内存和外部处理之间进行调整。其中，两个是新的，四个是基于扩展现有的连接算法。我们将所有算法作为操作符部署在Hyracks平台上，并通过大量实验评估了它们的性能。我们的实验涵盖了许多不同的性能因素，包括输入基数、内存、数据分布和哈希表结构。我们的研究指导我们选择最近发布的AsterixDB支持的本地聚合算法，即:hybrid-hash。Pre-Partitioning算法，因为它对输入分组关键字基数的估计具有容错性;Hash-Sort算法，因为它在聚合倾斜数据时具有良好的性能;这个本地聚合工作是两部分大数据聚合研究的第一部分，因为它解决了“地图”阶段。我们的发现为我们目前正在研究的“减少”阶段的全球聚合策略提供了基础。我们希望我们的经验可以帮助其他大数据平台的开发者建立一个坚实的本地聚合运营商。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Workshop on Data Warehousing and OLAP

自引率

0.00%

发文量

期刊最新文献

An Advanced Data Warehouse for Integrating Large Sets of GPS Data Optimization of Data-intensive Flows: Is it Needed? Is it Solved? A Framework for User-Centered Declarative ETL What can Emerging Hardware do for your DBMS Buffer? A Semantic Model for Movement Data Warehouses