ASC: Improving spark driver performance with automatic spark checkpoint

Weirong Zhu, Hao-peng Chen, Fei Hu
{"title":"ASC: Improving spark driver performance with automatic spark checkpoint","authors":"Weirong Zhu, Hao-peng Chen, Fei Hu","doi":"10.1109/ICACT.2016.7423490","DOIUrl":null,"url":null,"abstract":"Many great big data processing platforms, for example Hadoop Map Reduce, are keeping improving large-scale data processing performance which make big data processing focus of IT industry. Among them Spark has become increasingly popular big data processing framework since it was presented in 2010 first time. Spark use RDD for its data abstraction, targeting at the multiple iteration large-scale data processing with reuse of data, the in-memory feature of RDD make Spark faster than many other non-in-memory big data processing platform. However in-memory feature also bring the volatile problem, a failure or a missing RDD will cause Spark to recompute all the missing RDD on the lineage. And a long lineage will also increasing the time cost and memory usage of Driver analysing the lineage. A checkpoint will cut off the lineage and save the data which is required in the coming computing, the frequency to make a checkpoint and the RDDs which are selected to save will significantly influence the performance. In this paper, we are presenting an automatic checkpoint algorithm on Spark to help solve the long lineage problem with less influence on the performance. The automatic checkpoint will select the necessary RDD to save and bring an acceptable overhead and improve the time performance for multiple iteration.","PeriodicalId":125854,"journal":{"name":"2016 18th International Conference on Advanced Communication Technology (ICACT)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 18th International Conference on Advanced Communication Technology (ICACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACT.2016.7423490","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

Abstract

Many great big data processing platforms, for example Hadoop Map Reduce, are keeping improving large-scale data processing performance which make big data processing focus of IT industry. Among them Spark has become increasingly popular big data processing framework since it was presented in 2010 first time. Spark use RDD for its data abstraction, targeting at the multiple iteration large-scale data processing with reuse of data, the in-memory feature of RDD make Spark faster than many other non-in-memory big data processing platform. However in-memory feature also bring the volatile problem, a failure or a missing RDD will cause Spark to recompute all the missing RDD on the lineage. And a long lineage will also increasing the time cost and memory usage of Driver analysing the lineage. A checkpoint will cut off the lineage and save the data which is required in the coming computing, the frequency to make a checkpoint and the RDDs which are selected to save will significantly influence the performance. In this paper, we are presenting an automatic checkpoint algorithm on Spark to help solve the long lineage problem with less influence on the performance. The automatic checkpoint will select the necessary RDD to save and bring an acceptable overhead and improve the time performance for multiple iteration.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
ASC:提高火花驱动性能与自动火花检查点
许多伟大的大数据处理平台,如Hadoop Map Reduce,都在不断提高大规模数据处理性能,使得大数据处理成为IT行业的焦点。其中Spark自2010年首次提出以来,已经成为越来越受欢迎的大数据处理框架。Spark使用RDD进行数据抽象,针对数据重用的多次迭代大规模数据处理,RDD的内存特性使Spark比许多其他非内存大数据处理平台更快。然而,内存特性也带来了易失性问题,一个故障或丢失的RDD将导致Spark重新计算沿袭上所有丢失的RDD。长谱系也会增加驱动程序分析谱系的时间成本和内存使用。检查点将切断沿袭并保存后续计算所需的数据,检查点的频率和选择保存的rdd将对性能产生重大影响。在本文中,我们提出了一种基于Spark的自动检查点算法,以帮助解决对性能影响较小的长沿袭问题。自动检查点将选择必要的RDD来保存,并带来可接受的开销,并改善多次迭代的时间性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
DNSNA: DNS name autoconfiguration for Internet of Things devices A novel multi-carrier waveform with high spectral efficiency: Semi-orthogonal frequency division multiplexing Adaptive spectral co-clustering for multiview data Efficient Doppler mitigation for high-speed rail communications Supply and demand management system based on consumption pattern analysis and tariff for cost minimization
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1