Optimizing Shuffle in Wide-Area Data Analytics

2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS) Pub Date : 2017-06-05 DOI:10.1109/ICDCS.2017.131

Shuhao Liu, Hao Wang, Baochun Li

{"title":"Optimizing Shuffle in Wide-Area Data Analytics","authors":"Shuhao Liu, Hao Wang, Baochun Li","doi":"10.1109/ICDCS.2017.131","DOIUrl":null,"url":null,"abstract":"As increasingly large volumes of raw data are generated at geographically distributed datacenters, they need to be efficiently processed by data analytic jobs spanning multiple datacenters across wide-area networks. Designed for a single datacenter, existing data processing frameworks, such as Apache Spark, are not able to deliver satisfactory performance when these wide-area analytic jobs are executed. As wide-area networks interconnecting datacenters may not be congestion free, there is a compelling need for a new system framework that is optimized for wide-area data analytics. In this paper, we design and implement a new proactive data aggregation framework based on Apache Spark, with a focus on optimizing the network traffic incurred in shuffle stages of data analytic jobs. The objective of this framework is to strategically and proactively aggregate the output data of mapper tasks to a subset of worker datacenters, as a replacement to Spark's original passive fetch mechanism across datacenters. It improves the performance of wide-area analytic jobs by avoiding repetitive data transfers, which improves the utilization of inter-datacenter links. Our extensive experimental results using standard benchmarks across six Amazon EC2 regions have shown that our proposed framework is able to reduce job completion times by up to 73%, as compared to the existing baseline implementation in Spark.","PeriodicalId":127689,"journal":{"name":"2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS.2017.131","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

Abstract

As increasingly large volumes of raw data are generated at geographically distributed datacenters, they need to be efficiently processed by data analytic jobs spanning multiple datacenters across wide-area networks. Designed for a single datacenter, existing data processing frameworks, such as Apache Spark, are not able to deliver satisfactory performance when these wide-area analytic jobs are executed. As wide-area networks interconnecting datacenters may not be congestion free, there is a compelling need for a new system framework that is optimized for wide-area data analytics. In this paper, we design and implement a new proactive data aggregation framework based on Apache Spark, with a focus on optimizing the network traffic incurred in shuffle stages of data analytic jobs. The objective of this framework is to strategically and proactively aggregate the output data of mapper tasks to a subset of worker datacenters, as a replacement to Spark's original passive fetch mechanism across datacenters. It improves the performance of wide-area analytic jobs by avoiding repetitive data transfers, which improves the utilization of inter-datacenter links. Our extensive experimental results using standard benchmarks across six Amazon EC2 regions have shown that our proposed framework is able to reduce job completion times by up to 73%, as compared to the existing baseline implementation in Spark.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

优化洗牌广域数据分析

随着地理分布的数据中心产生越来越多的大量原始数据，需要跨广域网的多个数据中心的数据分析作业有效地处理这些数据。现有的数据处理框架(如Apache Spark)是为单个数据中心设计的，当执行这些广域分析作业时，它们无法提供令人满意的性能。由于连接数据中心的广域网可能不会没有拥塞，因此迫切需要针对广域数据分析进行优化的新系统框架。本文设计并实现了一个新的基于Apache Spark的主动数据聚合框架，重点对数据分析作业shuffle阶段产生的网络流量进行优化。该框架的目标是战略性地、主动地将mapper任务的输出数据聚合到工作数据中心的一个子集，作为Spark原始的跨数据中心被动获取机制的替代。它通过避免重复的数据传输提高了广域分析工作的性能，从而提高了数据中心间链路的利用率。我们在六个Amazon EC2区域使用标准基准进行了广泛的实验，结果表明，与Spark中现有的基线实现相比，我们提出的框架能够将作业完成时间减少73%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)

自引率

0.00%

发文量