ASC: Improving spark driver performance with automatic spark checkpoint

2016 18th International Conference on Advanced Communication Technology (ICACT) Pub Date : 2016-03-03 DOI:10.1109/ICACT.2016.7423490

Weirong Zhu, Hao-peng Chen, Fei Hu

{"title":"ASC: Improving spark driver performance with automatic spark checkpoint","authors":"Weirong Zhu, Hao-peng Chen, Fei Hu","doi":"10.1109/ICACT.2016.7423490","DOIUrl":null,"url":null,"abstract":"Many great big data processing platforms, for example Hadoop Map Reduce, are keeping improving large-scale data processing performance which make big data processing focus of IT industry. Among them Spark has become increasingly popular big data processing framework since it was presented in 2010 first time. Spark use RDD for its data abstraction, targeting at the multiple iteration large-scale data processing with reuse of data, the in-memory feature of RDD make Spark faster than many other non-in-memory big data processing platform. However in-memory feature also bring the volatile problem, a failure or a missing RDD will cause Spark to recompute all the missing RDD on the lineage. And a long lineage will also increasing the time cost and memory usage of Driver analysing the lineage. A checkpoint will cut off the lineage and save the data which is required in the coming computing, the frequency to make a checkpoint and the RDDs which are selected to save will significantly influence the performance. In this paper, we are presenting an automatic checkpoint algorithm on Spark to help solve the long lineage problem with less influence on the performance. The automatic checkpoint will select the necessary RDD to save and bring an acceptable overhead and improve the time performance for multiple iteration.","PeriodicalId":125854,"journal":{"name":"2016 18th International Conference on Advanced Communication Technology (ICACT)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 18th International Conference on Advanced Communication Technology (ICACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACT.2016.7423490","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

Many great big data processing platforms, for example Hadoop Map Reduce, are keeping improving large-scale data processing performance which make big data processing focus of IT industry. Among them Spark has become increasingly popular big data processing framework since it was presented in 2010 first time. Spark use RDD for its data abstraction, targeting at the multiple iteration large-scale data processing with reuse of data, the in-memory feature of RDD make Spark faster than many other non-in-memory big data processing platform. However in-memory feature also bring the volatile problem, a failure or a missing RDD will cause Spark to recompute all the missing RDD on the lineage. And a long lineage will also increasing the time cost and memory usage of Driver analysing the lineage. A checkpoint will cut off the lineage and save the data which is required in the coming computing, the frequency to make a checkpoint and the RDDs which are selected to save will significantly influence the performance. In this paper, we are presenting an automatic checkpoint algorithm on Spark to help solve the long lineage problem with less influence on the performance. The automatic checkpoint will select the necessary RDD to save and bring an acceptable overhead and improve the time performance for multiple iteration.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ASC:提高火花驱动性能与自动火花检查点

许多伟大的大数据处理平台，如Hadoop Map Reduce，都在不断提高大规模数据处理性能，使得大数据处理成为IT行业的焦点。其中Spark自2010年首次提出以来，已经成为越来越受欢迎的大数据处理框架。Spark使用RDD进行数据抽象，针对数据重用的多次迭代大规模数据处理，RDD的内存特性使Spark比许多其他非内存大数据处理平台更快。然而，内存特性也带来了易失性问题，一个故障或丢失的RDD将导致Spark重新计算沿袭上所有丢失的RDD。长谱系也会增加驱动程序分析谱系的时间成本和内存使用。检查点将切断沿袭并保存后续计算所需的数据，检查点的频率和选择保存的rdd将对性能产生重大影响。在本文中，我们提出了一种基于Spark的自动检查点算法，以帮助解决对性能影响较小的长沿袭问题。自动检查点将选择必要的RDD来保存，并带来可接受的开销，并改善多次迭代的时间性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 18th International Conference on Advanced Communication Technology (ICACT)

自引率

0.00%

发文量