Load Balancing in MapReduce Based on Scalable Cardinality Estimates

2012 IEEE 28th International Conference on Data Engineering Pub Date : 2012-04-01 DOI:10.1109/ICDE.2012.58

B. Gufler, Nikolaus Augsten, Angelika Reiser, A. Kemper

{"title":"Load Balancing in MapReduce Based on Scalable Cardinality Estimates","authors":"B. Gufler, Nikolaus Augsten, Angelika Reiser, A. Kemper","doi":"10.1109/ICDE.2012.58","DOIUrl":null,"url":null,"abstract":"MapReduce has emerged as a popular tool for distributed and scalable processing of massive data sets and is being used increasingly in e-science applications. Unfortunately, the performance of MapReduce systems strongly depends on an even data distribution while scientific data sets are often highly skewed. The resulting load imbalance, which raises the processing time, is even amplified by high runtime complexity of the reducer tasks. An adaptive load balancing strategy is required for appropriate skew handling. In this paper, we address the problem of estimating the cost of the tasks that are distributed to the reducers based on a given cost model. An accurate cost estimation is the basis for adaptive load balancing algorithms and requires to gather statistics from the mappers. This is challenging: (a) Since the statistics from all mappers must be integrated, the mapper statistics must be small. (b) Although each mapper sees only a small fraction of the data, the integrated statistics must capture the global data distribution. (c) The mappers terminate after sending the statistics to the controller, and no second round is possible. Our solution to these challenges consists of two components. First, a monitoring component executed on every mapper captures the local data distribution and identifies its most relevant subset for cost estimation. Second, an integration component aggregates these subsets approximating the global data distribution.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"140 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"142","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 28th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2012.58","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 142

Abstract

MapReduce has emerged as a popular tool for distributed and scalable processing of massive data sets and is being used increasingly in e-science applications. Unfortunately, the performance of MapReduce systems strongly depends on an even data distribution while scientific data sets are often highly skewed. The resulting load imbalance, which raises the processing time, is even amplified by high runtime complexity of the reducer tasks. An adaptive load balancing strategy is required for appropriate skew handling. In this paper, we address the problem of estimating the cost of the tasks that are distributed to the reducers based on a given cost model. An accurate cost estimation is the basis for adaptive load balancing algorithms and requires to gather statistics from the mappers. This is challenging: (a) Since the statistics from all mappers must be integrated, the mapper statistics must be small. (b) Although each mapper sees only a small fraction of the data, the integrated statistics must capture the global data distribution. (c) The mappers terminate after sending the statistics to the controller, and no second round is possible. Our solution to these challenges consists of two components. First, a monitoring component executed on every mapper captures the local data distribution and identifies its most relevant subset for cost estimation. Second, an integration component aggregates these subsets approximating the global data distribution.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于可伸缩基数估计的MapReduce负载平衡

MapReduce已经成为分布式和可扩展处理海量数据集的流行工具，并越来越多地用于电子科学应用。不幸的是，MapReduce系统的性能在很大程度上依赖于均匀的数据分布，而科学数据集通常是高度倾斜的。由此产生的负载不平衡(这会增加处理时间)甚至会被reducer任务的高运行时复杂性放大。适当的倾斜处理需要自适应负载平衡策略。在本文中，我们基于给定的成本模型，解决了分配给reducer的任务的成本估计问题。准确的成本估计是自适应负载平衡算法的基础，需要从映射器中收集统计信息。这是具有挑战性的:(a)由于必须综合所有制图器的统计数据，制图器的统计数据必须很小。(b)虽然每个制图者只看到数据的一小部分，但综合统计必须反映全球数据的分布情况。(c)映射器在向控制器发送统计数据后终止，不可能进行第二轮。我们应对这些挑战的解决方案包括两个部分。首先，在每个映射器上执行的监视组件捕获本地数据分布，并确定其最相关的子集以进行成本估算。其次，集成组件聚合这些近似全局数据分布的子集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2012 IEEE 28th International Conference on Data Engineering

自引率

0.00%

发文量

期刊最新文献

Keyword Query Reformulation on Structured Data Accuracy-Aware Uncertain Stream Databases Extracting Analyzing and Visualizing Triangle K-Core Motifs within Networks Project Daytona: Data Analytics as a Cloud Service Automatic Extraction of Structured Web Data with Domain Knowledge