Quick Execution Time Predictions for Spark Applications

Sarah Shah, Yasaman Amannejad, Diwakar Krishnamurthy, Mea Wang
{"title":"Quick Execution Time Predictions for Spark Applications","authors":"Sarah Shah, Yasaman Amannejad, Diwakar Krishnamurthy, Mea Wang","doi":"10.23919/CNSM46954.2019.9012752","DOIUrl":null,"url":null,"abstract":"The Apache Spark cluster computing platform is being increasingly used to develop big data analytics applications. There are many scenarios that require quick estimates of the execution time of any given Spark application. For example, users and operators of a Spark cluster often require quick insights on how the execution time of an application is likely to be impacted by the resources allocated to the application, e.g., the number of Spark executor cores assigned, and the size of the data to be processed. Job schedulers can benefit from fast estimates at runtime that would allow them to quickly conFigure a Spark application for a desired execution time using the least amount of resources. While others have developed models to predict the execution time of Spark applications, such models typically require extensive prior executions of applications under various resource allocation settings and data sizes. Consequently, these techniques are not suited for situations where quick predictions are required and very little cluster resources are available for the experimentation needed to build a model. This paper proposes an alternative approach called PERIDOT that addresses this limitation. The approach involves executing a given application under a fixed resource allocation setting with two different-sized, small subsets of its input data. It analyzes logs from these two executions to estimate the dependencies between internal stages in the application. Information on these dependencies combined with knowledge of Spark’s data partitioning mechanisms is used to derive an analytic model that can predict execution times for other resource allocation settings and input data sizes. We show that deriving a model using just these two reference executions allows PERIDOT to accurately predict the performance of a variety of Spark applications spanning text analytics, linear algebra, machine learning and Spark SQL. In contrast, we show that a state-of-the-art machine learning based execution time prediction algorithm performs poorly when presented with such limited training data.","PeriodicalId":273818,"journal":{"name":"2019 15th International Conference on Network and Service Management (CNSM)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 15th International Conference on Network and Service Management (CNSM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/CNSM46954.2019.9012752","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 15

Abstract

The Apache Spark cluster computing platform is being increasingly used to develop big data analytics applications. There are many scenarios that require quick estimates of the execution time of any given Spark application. For example, users and operators of a Spark cluster often require quick insights on how the execution time of an application is likely to be impacted by the resources allocated to the application, e.g., the number of Spark executor cores assigned, and the size of the data to be processed. Job schedulers can benefit from fast estimates at runtime that would allow them to quickly conFigure a Spark application for a desired execution time using the least amount of resources. While others have developed models to predict the execution time of Spark applications, such models typically require extensive prior executions of applications under various resource allocation settings and data sizes. Consequently, these techniques are not suited for situations where quick predictions are required and very little cluster resources are available for the experimentation needed to build a model. This paper proposes an alternative approach called PERIDOT that addresses this limitation. The approach involves executing a given application under a fixed resource allocation setting with two different-sized, small subsets of its input data. It analyzes logs from these two executions to estimate the dependencies between internal stages in the application. Information on these dependencies combined with knowledge of Spark’s data partitioning mechanisms is used to derive an analytic model that can predict execution times for other resource allocation settings and input data sizes. We show that deriving a model using just these two reference executions allows PERIDOT to accurately predict the performance of a variety of Spark applications spanning text analytics, linear algebra, machine learning and Spark SQL. In contrast, we show that a state-of-the-art machine learning based execution time prediction algorithm performs poorly when presented with such limited training data.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Spark应用程序的快速执行时间预测
Apache Spark集群计算平台越来越多地用于开发大数据分析应用程序。有许多场景需要快速估计任何给定Spark应用程序的执行时间。例如,Spark集群的用户和操作人员通常需要快速了解应用程序的执行时间可能如何受到分配给应用程序的资源的影响,例如,分配的Spark执行器内核的数量,以及要处理的数据的大小。作业调度器可以从运行时的快速估计中获益,这将允许它们使用最少的资源快速配置Spark应用程序以获得所需的执行时间。虽然其他人已经开发了模型来预测Spark应用程序的执行时间,但这些模型通常需要在各种资源分配设置和数据大小下大量预先执行应用程序。因此,这些技术不适合需要快速预测的情况,并且用于构建模型所需的实验的集群资源非常少。本文提出了一种称为PERIDOT的替代方法来解决这一限制。该方法涉及在固定的资源分配设置下执行给定的应用程序,并使用其输入数据的两个不同大小的小子集。它分析这两次执行的日志,以估计应用程序内部阶段之间的依赖关系。这些依赖关系的信息结合Spark数据分区机制的知识,可以用来导出一个分析模型,该模型可以预测其他资源分配设置和输入数据大小的执行时间。我们表明,仅使用这两个参考执行来推导模型,PERIDOT就可以准确地预测各种Spark应用程序的性能,包括文本分析、线性代数、机器学习和Spark SQL。相比之下,我们表明,最先进的基于机器学习的执行时间预测算法在提供如此有限的训练数据时表现不佳。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Flow-based Throughput Prediction using Deep Learning and Real-World Network Traffic Learning From Evolving Network Data for Dependable Botnet Detection Exploring NAT Detection and Host Identification Using Machine Learning Lumped Markovian Estimation for Wi-Fi Channel Utilization Prediction An Access Control Implementation Targeting Resource-constrained Environments
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1