Monetary cost optimizations for MPI-based HPC applications on Amazon clouds: checkpoints and replicated execution

Yifan Gong, Bingsheng He, Amelie Chi Zhou
{"title":"Monetary cost optimizations for MPI-based HPC applications on Amazon clouds: checkpoints and replicated execution","authors":"Yifan Gong, Bingsheng He, Amelie Chi Zhou","doi":"10.1145/2807591.2807612","DOIUrl":null,"url":null,"abstract":"In this paper, we propose monetary cost optimizations for MPI-based applications with deadline constraints on Amazon EC2. Particularly, we consider to utilize two kinds of Amazon EC2 instances (on-demand and spot instances). As a spot instance can fail at any time due to out-of-bid events, fault tolerant executions are necessary. Through detailed studies, we have found that two common fault tolerant mechanisms, i.e., checkpoints and replicated executions, are complementary for cost-effective MPI executions on spot instances. We formulate the optimization problem and propose a novel cost model to minimize the expected monetary cost. The experimental results with NPB benchmarks on Amazon EC2 demonstrate that 1) it is feasible to run MPI applications with performance constraints on spot instances, 2) our proposal achieves significant monetary cost reduction compared to the state-of-the-art algorithm and 3) it is necessary to adaptively choose checkpoint and replication techniques for cost-effective and reliable MPI executions on Amazon EC2.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2807591.2807612","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 28

Abstract

In this paper, we propose monetary cost optimizations for MPI-based applications with deadline constraints on Amazon EC2. Particularly, we consider to utilize two kinds of Amazon EC2 instances (on-demand and spot instances). As a spot instance can fail at any time due to out-of-bid events, fault tolerant executions are necessary. Through detailed studies, we have found that two common fault tolerant mechanisms, i.e., checkpoints and replicated executions, are complementary for cost-effective MPI executions on spot instances. We formulate the optimization problem and propose a novel cost model to minimize the expected monetary cost. The experimental results with NPB benchmarks on Amazon EC2 demonstrate that 1) it is feasible to run MPI applications with performance constraints on spot instances, 2) our proposal achieves significant monetary cost reduction compared to the state-of-the-art algorithm and 3) it is necessary to adaptively choose checkpoint and replication techniques for cost-effective and reliable MPI executions on Amazon EC2.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Amazon云上基于mpi的HPC应用程序的货币成本优化:检查点和复制执行
在本文中,我们提出了基于mpi的应用程序的货币成本优化,这些应用程序在Amazon EC2上具有截止日期限制。特别是,我们考虑利用两种Amazon EC2实例(按需实例和现货实例)。由于现货实例随时可能由于超出出价的事件而失败,因此容错执行是必要的。通过详细的研究,我们发现两种常见的容错机制,即检查点和复制执行,对于在现场实例上执行具有成本效益的MPI是互补的。提出了一种新的成本模型,使期望货币成本最小化。在Amazon EC2上使用NPB基准测试的实验结果表明:1)在现场实例上运行具有性能约束的MPI应用程序是可行的;2)与最先进的算法相比,我们的建议实现了显著的货币成本降低;3)有必要自适应地选择检查点和复制技术,以便在Amazon EC2上高效可靠地执行MPI。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Optimal scheduling of in-situ analysis for large-scale scientific simulations Monetary cost optimizations for MPI-based HPC applications on Amazon clouds: checkpoints and replicated execution IOrchestra: supporting high-performance data-intensive applications in the cloud via collaborative virtualization An input-adaptive and in-place approach to dense tensor-times-matrix multiply Scalable sparse tensor decompositions in distributed memory systems
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1