Tyrex: Size-Based Resource Allocation in MapReduce Frameworks

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) Pub Date : 2016-05-16 DOI:10.1109/CCGrid.2016.82

Bogdan Ghit, D. Epema

{"title":"Tyrex: Size-Based Resource Allocation in MapReduce Frameworks","authors":"Bogdan Ghit, D. Epema","doi":"10.1109/CCGrid.2016.82","DOIUrl":null,"url":null,"abstract":"Many large-scale data analytics infrastructures are employed for a wide variety of jobs, ranging from short interactive queries to large data analysis jobs that may take hours or even days to complete. As a consequence, data-processing frameworks like MapReduce may have workloads consisting of jobs with heavy-tailed processing requirements. With such workloads, short jobs may experience slowdowns that are an order of magnitude larger than large jobs do, while the users may expect slowdowns that are more in proportion with the job sizes. To address this problem of large job slowdown variability in MapReduce frameworks, we design a scheduling system called TYREX that is inspired by the well-known TAGS task assignment policy in distributed-server systems. In particular, TYREX partitions the resources of a MapReduce framework, allowing any job running in any partition to read data stored on any machine, imposes runtime limits in the partitions, and successively executes parts of jobs in a work-conserving way in these partitions until they can run to completion. We develop a statistical model for dynamically setting the runtime limits that achieves near optimal job slowdown performance, and we empirically evaluate TYREX on a cluster system with workloads consisting of both synthetic and real-world benchmarks. We find that TYREX cuts in half the job slowdown variability while preserving the median job slowdown when compared to state-of-the-art MapReduce schedulers such as FIFO and FAIR. Furthermore, TYREX reduces the job slowdown at the 95th percentile by more than 50% when compared to FIFO and by 20-40% when compared to FAIR.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid.2016.82","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Many large-scale data analytics infrastructures are employed for a wide variety of jobs, ranging from short interactive queries to large data analysis jobs that may take hours or even days to complete. As a consequence, data-processing frameworks like MapReduce may have workloads consisting of jobs with heavy-tailed processing requirements. With such workloads, short jobs may experience slowdowns that are an order of magnitude larger than large jobs do, while the users may expect slowdowns that are more in proportion with the job sizes. To address this problem of large job slowdown variability in MapReduce frameworks, we design a scheduling system called TYREX that is inspired by the well-known TAGS task assignment policy in distributed-server systems. In particular, TYREX partitions the resources of a MapReduce framework, allowing any job running in any partition to read data stored on any machine, imposes runtime limits in the partitions, and successively executes parts of jobs in a work-conserving way in these partitions until they can run to completion. We develop a statistical model for dynamically setting the runtime limits that achieves near optimal job slowdown performance, and we empirically evaluate TYREX on a cluster system with workloads consisting of both synthetic and real-world benchmarks. We find that TYREX cuts in half the job slowdown variability while preserving the median job slowdown when compared to state-of-the-art MapReduce schedulers such as FIFO and FAIR. Furthermore, TYREX reduces the job slowdown at the 95th percentile by more than 50% when compared to FIFO and by 20-40% when compared to FAIR.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Tyrex: MapReduce框架中基于大小的资源分配

许多大型数据分析基础设施用于各种各样的工作，从简短的交互式查询到可能需要数小时甚至数天才能完成的大型数据分析工作。因此，像MapReduce这样的数据处理框架的工作负载可能由具有大量处理需求的作业组成。对于这样的工作负载，短作业可能会遇到比大型作业大一个数量级的减速，而用户可能期望的减速与作业大小成正比。为了解决MapReduce框架中作业速度变化大的问题，我们设计了一个名为TYREX的调度系统，该系统的灵感来自于分布式服务器系统中众所周知的TAGS任务分配策略。特别是，TYREX对MapReduce框架的资源进行分区，允许在任何分区中运行的任何作业读取存储在任何机器上的数据，在分区中施加运行时限制，并在这些分区中以节省工作的方式连续执行部分作业，直到它们可以运行完成。我们开发了一个统计模型，用于动态设置运行时限制，以实现接近最佳的作业减速性能，并且我们在包含合成基准和实际基准的工作负载的集群系统上对TYREX进行了经验评估。我们发现，与先进的MapReduce调度器(如FIFO和FAIR)相比，TYREX减少了一半的作业速度可变性，同时保持了作业速度的中位数。此外，与FIFO相比，TYREX在第95百分位的作业速度降低了50%以上，与FAIR相比降低了20-40%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

自引率

0.00%

发文量