Peeking into the optimization of data flow programs with MapReduce-style UDFs

2013 IEEE 29th International Conference on Data Engineering (ICDE) Pub Date : 2013-04-08 DOI:10.1109/ICDE.2013.6544927

Fabian Hueske, Mathias Peters, Aljoscha Krettek, M. Ringwald, K. Tzoumas, V. Markl, J. Freytag

{"title":"Peeking into the optimization of data flow programs with MapReduce-style UDFs","authors":"Fabian Hueske, Mathias Peters, Aljoscha Krettek, M. Ringwald, K. Tzoumas, V. Markl, J. Freytag","doi":"10.1109/ICDE.2013.6544927","DOIUrl":null,"url":null,"abstract":"Data flows are a popular abstraction to define dataintensive processing tasks. In order to support a wide range of use cases, many data processing systems feature MapReduce-style user-defined functions (UDFs). In contrast to UDFs as known from relational DBMS, MapReduce-style UDFs have less strict templates. These templates do not alone provide all the information needed to decide whether they can be reordered with relational operators and other UDFs. However, it is well-known that reordering operators such as filters, joins, and aggregations can yield runtime improvements by orders of magnitude. We demonstrate an optimizer for data flows that is able to reorder operators with MapReduce-style UDFs written in an imperative language. Our approach leverages static code analysis to extract information from UDFs which is used to reason about the reorderbility of UDF operators. This information is sufficient to enumerate a large fraction of the search space covered by conventional RDBMS optimizers including filter and aggregation push-down, bushy join orders, and choice of physical execution strategies based on interesting properties. We demonstrate our optimizer and a job submission client that allows users to peek step-by-step into each phase of the optimization process: the static code analysis of UDFs, the enumeration of reordered candidate data flows, the generation of physical execution plans, and their parallel execution. For the demonstration, we provide a selection of relational and nonrelational data flow programs which highlight the salient features of our approach.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2013.6544927","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 32

Abstract

Data flows are a popular abstraction to define dataintensive processing tasks. In order to support a wide range of use cases, many data processing systems feature MapReduce-style user-defined functions (UDFs). In contrast to UDFs as known from relational DBMS, MapReduce-style UDFs have less strict templates. These templates do not alone provide all the information needed to decide whether they can be reordered with relational operators and other UDFs. However, it is well-known that reordering operators such as filters, joins, and aggregations can yield runtime improvements by orders of magnitude. We demonstrate an optimizer for data flows that is able to reorder operators with MapReduce-style UDFs written in an imperative language. Our approach leverages static code analysis to extract information from UDFs which is used to reason about the reorderbility of UDF operators. This information is sufficient to enumerate a large fraction of the search space covered by conventional RDBMS optimizers including filter and aggregation push-down, bushy join orders, and choice of physical execution strategies based on interesting properties. We demonstrate our optimizer and a job submission client that allows users to peek step-by-step into each phase of the optimization process: the static code analysis of UDFs, the enumeration of reordered candidate data flows, the generation of physical execution plans, and their parallel execution. For the demonstration, we provide a selection of relational and nonrelational data flow programs which highlight the salient features of our approach.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用mapreduce风格的udf窥视数据流程序的优化

数据流是定义数据密集型处理任务的流行抽象。为了支持广泛的用例，许多数据处理系统都具有mapreduce风格的用户定义函数(udf)。与关系型DBMS中的udf相比，mapreduce风格的udf没有那么严格的模板。这些模板并不单独提供决定是否可以使用关系操作符和其他udf重新排序所需的所有信息。然而，众所周知，重新排序操作符(如过滤器、连接和聚合)可以产生数量级的运行时改进。我们演示了一个数据流优化器，它能够用命令式语言编写的mapreduce风格的udf对操作符进行重新排序。我们的方法利用静态代码分析从UDF中提取信息，用于推断UDF操作符的可重排序性。这些信息足以列举传统RDBMS优化器所涵盖的大部分搜索空间，包括过滤器和聚合下推、密集连接顺序以及基于感兴趣的属性选择物理执行策略。我们演示了我们的优化器和一个作业提交客户机，它允许用户逐步了解优化过程的每个阶段:udf的静态代码分析、重新排序的候选数据流的枚举、物理执行计划的生成以及它们的并行执行。为了演示，我们提供了一些关系和非关系数据流程序，这些程序突出了我们方法的显著特征。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2013 IEEE 29th International Conference on Data Engineering (ICDE)

自引率

0.00%

发文量

期刊最新文献

Big data integration T-share: A large-scale dynamic taxi ridesharing service Coupled clustering ensemble: Incorporating coupling relationships both between base clusterings and objects The adaptive radix tree: ARTful indexing for main-memory databases Learning to rank from distant supervision: Exploiting noisy redundancy for relational entity search