Peeking into the optimization of data flow programs with MapReduce-style UDFs

Fabian Hueske, Mathias Peters, Aljoscha Krettek, M. Ringwald, K. Tzoumas, V. Markl, J. Freytag
{"title":"Peeking into the optimization of data flow programs with MapReduce-style UDFs","authors":"Fabian Hueske, Mathias Peters, Aljoscha Krettek, M. Ringwald, K. Tzoumas, V. Markl, J. Freytag","doi":"10.1109/ICDE.2013.6544927","DOIUrl":null,"url":null,"abstract":"Data flows are a popular abstraction to define dataintensive processing tasks. In order to support a wide range of use cases, many data processing systems feature MapReduce-style user-defined functions (UDFs). In contrast to UDFs as known from relational DBMS, MapReduce-style UDFs have less strict templates. These templates do not alone provide all the information needed to decide whether they can be reordered with relational operators and other UDFs. However, it is well-known that reordering operators such as filters, joins, and aggregations can yield runtime improvements by orders of magnitude. We demonstrate an optimizer for data flows that is able to reorder operators with MapReduce-style UDFs written in an imperative language. Our approach leverages static code analysis to extract information from UDFs which is used to reason about the reorderbility of UDF operators. This information is sufficient to enumerate a large fraction of the search space covered by conventional RDBMS optimizers including filter and aggregation push-down, bushy join orders, and choice of physical execution strategies based on interesting properties. We demonstrate our optimizer and a job submission client that allows users to peek step-by-step into each phase of the optimization process: the static code analysis of UDFs, the enumeration of reordered candidate data flows, the generation of physical execution plans, and their parallel execution. For the demonstration, we provide a selection of relational and nonrelational data flow programs which highlight the salient features of our approach.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2013.6544927","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 32

Abstract

Data flows are a popular abstraction to define dataintensive processing tasks. In order to support a wide range of use cases, many data processing systems feature MapReduce-style user-defined functions (UDFs). In contrast to UDFs as known from relational DBMS, MapReduce-style UDFs have less strict templates. These templates do not alone provide all the information needed to decide whether they can be reordered with relational operators and other UDFs. However, it is well-known that reordering operators such as filters, joins, and aggregations can yield runtime improvements by orders of magnitude. We demonstrate an optimizer for data flows that is able to reorder operators with MapReduce-style UDFs written in an imperative language. Our approach leverages static code analysis to extract information from UDFs which is used to reason about the reorderbility of UDF operators. This information is sufficient to enumerate a large fraction of the search space covered by conventional RDBMS optimizers including filter and aggregation push-down, bushy join orders, and choice of physical execution strategies based on interesting properties. We demonstrate our optimizer and a job submission client that allows users to peek step-by-step into each phase of the optimization process: the static code analysis of UDFs, the enumeration of reordered candidate data flows, the generation of physical execution plans, and their parallel execution. For the demonstration, we provide a selection of relational and nonrelational data flow programs which highlight the salient features of our approach.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用mapreduce风格的udf窥视数据流程序的优化
数据流是定义数据密集型处理任务的流行抽象。为了支持广泛的用例,许多数据处理系统都具有mapreduce风格的用户定义函数(udf)。与关系型DBMS中的udf相比,mapreduce风格的udf没有那么严格的模板。这些模板并不单独提供决定是否可以使用关系操作符和其他udf重新排序所需的所有信息。然而,众所周知,重新排序操作符(如过滤器、连接和聚合)可以产生数量级的运行时改进。我们演示了一个数据流优化器,它能够用命令式语言编写的mapreduce风格的udf对操作符进行重新排序。我们的方法利用静态代码分析从UDF中提取信息,用于推断UDF操作符的可重排序性。这些信息足以列举传统RDBMS优化器所涵盖的大部分搜索空间,包括过滤器和聚合下推、密集连接顺序以及基于感兴趣的属性选择物理执行策略。我们演示了我们的优化器和一个作业提交客户机,它允许用户逐步了解优化过程的每个阶段:udf的静态代码分析、重新排序的候选数据流的枚举、物理执行计划的生成以及它们的并行执行。为了演示,我们提供了一些关系和非关系数据流程序,这些程序突出了我们方法的显著特征。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Big data integration T-share: A large-scale dynamic taxi ridesharing service Coupled clustering ensemble: Incorporating coupling relationships both between base clusterings and objects The adaptive radix tree: ARTful indexing for main-memory databases Learning to rank from distant supervision: Exploiting noisy redundancy for relational entity search
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1