截止日期驱动的清管器项目性能建模与优化

IF 2.2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE ACM Transactions on Autonomous and Adaptive Systems Pub Date : 2013-09-01 DOI:10.1145/2518017.2518019

Zhuoyao Zhang, L. Cherkasova, Abhishek Verma, B. T. Loo

{"title":"截止日期驱动的清管器项目性能建模与优化","authors":"Zhuoyao Zhang, L. Cherkasova, Abhishek Verma, B. T. Loo","doi":"10.1145/2518017.2518019","DOIUrl":null,"url":null,"abstract":"Many applications associated with live business intelligence are written as complex data analysis programs defined by directed acyclic graphs of MapReduce jobs, for example, using Pig, Hive, or Scope frameworks. An increasing number of these applications have additional requirements for completion time guarantees. In this article, we consider the popular Pig framework that provides a high-level SQL-like abstraction on top of MapReduce engine for processing large data sets. There is a lack of performance models and analysis tools for automated performance management of such MapReduce jobs. We offer a performance modeling environment for Pig programs that automatically profiles jobs from the past runs and aims to solve the following inter-related problems: (i) estimating the completion time of a Pig program as a function of allocated resources; (ii) estimating the amount of resources (a number of map and reduce slots) required for completing a Pig program with a given (soft) deadline. First, we design a basic performance model that accurately predicts completion time and required resource allocation for a Pig program that is defined as a sequence of MapReduce jobs: predicted completion times are within 10% of the measured ones. Second, we optimize a Pig program execution by enforcing the optimal schedule of its concurrent jobs. For DAGs with concurrent jobs, this optimization helps reducing the program completion time: 10%--27% in our experiments. Moreover, it eliminates possible nondeterminism of concurrent jobs’ execution in the Pig program, and therefore, enables a more accurate performance model for Pig programs. Third, based on these optimizations, we propose a refined performance model for Pig programs with concurrent jobs. The proposed approach leads to significant resource savings (20%--60% in our experiments) compared with the original, unoptimized solution. We validate our solution using a 66-node Hadoop cluster and a diverse set of workloads: PigMix benchmark, TPC-H queries, and customized queries mining a collection of HP Labs’ web proxy logs.","PeriodicalId":50919,"journal":{"name":"ACM Transactions on Autonomous and Adaptive Systems","volume":"71 1","pages":"14:1-14:28"},"PeriodicalIF":2.2000,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":"{\"title\":\"Performance Modeling and Optimization of Deadline-Driven Pig Programs\",\"authors\":\"Zhuoyao Zhang, L. Cherkasova, Abhishek Verma, B. T. Loo\",\"doi\":\"10.1145/2518017.2518019\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many applications associated with live business intelligence are written as complex data analysis programs defined by directed acyclic graphs of MapReduce jobs, for example, using Pig, Hive, or Scope frameworks. An increasing number of these applications have additional requirements for completion time guarantees. In this article, we consider the popular Pig framework that provides a high-level SQL-like abstraction on top of MapReduce engine for processing large data sets. There is a lack of performance models and analysis tools for automated performance management of such MapReduce jobs. We offer a performance modeling environment for Pig programs that automatically profiles jobs from the past runs and aims to solve the following inter-related problems: (i) estimating the completion time of a Pig program as a function of allocated resources; (ii) estimating the amount of resources (a number of map and reduce slots) required for completing a Pig program with a given (soft) deadline. First, we design a basic performance model that accurately predicts completion time and required resource allocation for a Pig program that is defined as a sequence of MapReduce jobs: predicted completion times are within 10% of the measured ones. Second, we optimize a Pig program execution by enforcing the optimal schedule of its concurrent jobs. For DAGs with concurrent jobs, this optimization helps reducing the program completion time: 10%--27% in our experiments. Moreover, it eliminates possible nondeterminism of concurrent jobs’ execution in the Pig program, and therefore, enables a more accurate performance model for Pig programs. Third, based on these optimizations, we propose a refined performance model for Pig programs with concurrent jobs. The proposed approach leads to significant resource savings (20%--60% in our experiments) compared with the original, unoptimized solution. We validate our solution using a 66-node Hadoop cluster and a diverse set of workloads: PigMix benchmark, TPC-H queries, and customized queries mining a collection of HP Labs’ web proxy logs.\",\"PeriodicalId\":50919,\"journal\":{\"name\":\"ACM Transactions on Autonomous and Adaptive Systems\",\"volume\":\"71 1\",\"pages\":\"14:1-14:28\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2013-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"17\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Autonomous and Adaptive Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/2518017.2518019\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Autonomous and Adaptive Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/2518017.2518019","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 17

摘要

许多与实时商业智能相关的应用程序被编写为复杂的数据分析程序，由MapReduce作业的有向无环图定义，例如，使用Pig、Hive或Scope框架。越来越多的此类应用程序对完成时间保证有额外的要求。在本文中，我们考虑流行的Pig框架，它在MapReduce引擎之上提供类似sql的高级抽象，用于处理大型数据集。目前还缺乏对这类MapReduce作业进行自动化性能管理的性能模型和分析工具。我们为Pig程序提供了一个性能建模环境，可以自动分析过去运行的作业，旨在解决以下相互关联的问题:(i)估计Pig程序的完成时间作为分配资源的函数;(ii)估算在给定(软)截止日期内完成Pig程序所需的资源量(地图和减少槽的数量)。首先，我们设计了一个基本的性能模型，可以准确地预测一个Pig程序的完成时间和所需的资源分配，该程序被定义为一系列MapReduce作业:预测的完成时间在实际完成时间的10%以内。其次，我们通过执行并发作业的最佳调度来优化Pig程序的执行。对于具有并发作业的dag，此优化有助于减少程序完成时间:在我们的实验中减少了10%- 27%。此外，它消除了Pig程序中并发作业执行的不确定性，因此可以为Pig程序提供更准确的性能模型。第三，基于这些优化，我们提出了具有并发作业的Pig程序的改进性能模型。与原始的、未优化的解决方案相比，所提出的方法可以显著节省资源(在我们的实验中为20%- 60%)。我们使用一个66节点的Hadoop集群和一组不同的工作负载来验证我们的解决方案:PigMix基准测试，TPC-H查询，以及挖掘HP实验室web代理日志集合的自定义查询。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Performance Modeling and Optimization of Deadline-Driven Pig Programs

Many applications associated with live business intelligence are written as complex data analysis programs defined by directed acyclic graphs of MapReduce jobs, for example, using Pig, Hive, or Scope frameworks. An increasing number of these applications have additional requirements for completion time guarantees. In this article, we consider the popular Pig framework that provides a high-level SQL-like abstraction on top of MapReduce engine for processing large data sets. There is a lack of performance models and analysis tools for automated performance management of such MapReduce jobs. We offer a performance modeling environment for Pig programs that automatically profiles jobs from the past runs and aims to solve the following inter-related problems: (i) estimating the completion time of a Pig program as a function of allocated resources; (ii) estimating the amount of resources (a number of map and reduce slots) required for completing a Pig program with a given (soft) deadline. First, we design a basic performance model that accurately predicts completion time and required resource allocation for a Pig program that is defined as a sequence of MapReduce jobs: predicted completion times are within 10% of the measured ones. Second, we optimize a Pig program execution by enforcing the optimal schedule of its concurrent jobs. For DAGs with concurrent jobs, this optimization helps reducing the program completion time: 10%--27% in our experiments. Moreover, it eliminates possible nondeterminism of concurrent jobs’ execution in the Pig program, and therefore, enables a more accurate performance model for Pig programs. Third, based on these optimizations, we propose a refined performance model for Pig programs with concurrent jobs. The proposed approach leads to significant resource savings (20%--60% in our experiments) compared with the original, unoptimized solution. We validate our solution using a 66-node Hadoop cluster and a diverse set of workloads: PigMix benchmark, TPC-H queries, and customized queries mining a collection of HP Labs’ web proxy logs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Autonomous and Adaptive Systems 工程技术-计算机：理论方法

CiteScore

4.80

自引率

7.40%

发文量

审稿时长

>12 weeks

期刊介绍： TAAS addresses research on autonomous and adaptive systems being undertaken by an increasingly interdisciplinary research community -- and provides a common platform under which this work can be published and disseminated. TAAS encourages contributions aimed at supporting the understanding, development, and control of such systems and of their behaviors. TAAS addresses research on autonomous and adaptive systems being undertaken by an increasingly interdisciplinary research community - and provides a common platform under which this work can be published and disseminated. TAAS encourages contributions aimed at supporting the understanding, development, and control of such systems and of their behaviors. Contributions are expected to be based on sound and innovative theoretical models, algorithms, engineering and programming techniques, infrastructures and systems, or technological and application experiences.