Local recovery and failure masking for stencil-based applications at extreme scales

Marc Gamell, K. Teranishi, M. Heroux, J. Mayo, H. Kolla, Jacqueline H. Chen, M. Parashar
{"title":"Local recovery and failure masking for stencil-based applications at extreme scales","authors":"Marc Gamell, K. Teranishi, M. Heroux, J. Mayo, H. Kolla, Jacqueline H. Chen, M. Parashar","doi":"10.1145/2807591.2807672","DOIUrl":null,"url":null,"abstract":"Application resilience is a key challenge that has to be addressed to realize the exascale vision. Online recovery, even when it involves all processes, can dramatically reduce the overhead of failures as compared to the more traditional approach where the job is terminated and restarted from the last checkpoint. In this paper we explore how local recovery can be used for certain classes of applications to further reduce overheads due to resilience. Specifically we develop programming support and scalable runtime mechanisms to enable online and transparent local recovery for stencil-based parallel applications on current leadership class systems. We also show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution. We integrate these mechanisms with the S3D combustion simulation, and experimentally demonstrate (using the Titan Cray-XK7 system at ORNL) the ability to tolerate high failure rates (i.e., node failures every 5 seconds) with low overhead while sustaining performance, at scales up to 262144 cores.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2807591.2807672","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 36

Abstract

Application resilience is a key challenge that has to be addressed to realize the exascale vision. Online recovery, even when it involves all processes, can dramatically reduce the overhead of failures as compared to the more traditional approach where the job is terminated and restarted from the last checkpoint. In this paper we explore how local recovery can be used for certain classes of applications to further reduce overheads due to resilience. Specifically we develop programming support and scalable runtime mechanisms to enable online and transparent local recovery for stencil-based parallel applications on current leadership class systems. We also show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution. We integrate these mechanisms with the S3D combustion simulation, and experimentally demonstrate (using the Titan Cray-XK7 system at ORNL) the ability to tolerate high failure rates (i.e., node failures every 5 seconds) with low overhead while sustaining performance, at scales up to 262144 cores.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
本地恢复和故障屏蔽在极端规模的基于模板的应用程序
应用程序弹性是实现百亿亿级远景必须解决的关键挑战。在线恢复,即使涉及到所有进程,与更传统的终止作业并从最后一个检查点重新启动的方法相比,也可以显著减少故障的开销。在本文中,我们探讨了如何将本地恢复用于某些类型的应用程序,以进一步减少由于弹性而产生的开销。具体来说,我们开发了编程支持和可扩展的运行时机制,以便在当前的领导类系统上为基于模板的并行应用程序实现在线和透明的本地恢复。我们还展示了如何掩盖多个独立的故障,以有效地减少对解决方案总时间的影响。我们将这些机制与S3D燃烧模拟集成在一起,并通过实验证明(使用ORNL的Titan Cray-XK7系统)能够容忍高故障率(即每5秒发生一次节点故障),同时保持性能,规模高达262144核。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Optimal scheduling of in-situ analysis for large-scale scientific simulations Monetary cost optimizations for MPI-based HPC applications on Amazon clouds: checkpoints and replicated execution IOrchestra: supporting high-performance data-intensive applications in the cloud via collaborative virtualization An input-adaptive and in-place approach to dense tensor-times-matrix multiply Scalable sparse tensor decompositions in distributed memory systems
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1