Local recovery and failure masking for stencil-based applications at extreme scales

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2015-11-15 DOI:10.1145/2807591.2807672

Marc Gamell, K. Teranishi, M. Heroux, J. Mayo, H. Kolla, Jacqueline H. Chen, M. Parashar

{"title":"Local recovery and failure masking for stencil-based applications at extreme scales","authors":"Marc Gamell, K. Teranishi, M. Heroux, J. Mayo, H. Kolla, Jacqueline H. Chen, M. Parashar","doi":"10.1145/2807591.2807672","DOIUrl":null,"url":null,"abstract":"Application resilience is a key challenge that has to be addressed to realize the exascale vision. Online recovery, even when it involves all processes, can dramatically reduce the overhead of failures as compared to the more traditional approach where the job is terminated and restarted from the last checkpoint. In this paper we explore how local recovery can be used for certain classes of applications to further reduce overheads due to resilience. Specifically we develop programming support and scalable runtime mechanisms to enable online and transparent local recovery for stencil-based parallel applications on current leadership class systems. We also show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution. We integrate these mechanisms with the S3D combustion simulation, and experimentally demonstrate (using the Titan Cray-XK7 system at ORNL) the ability to tolerate high failure rates (i.e., node failures every 5 seconds) with low overhead while sustaining performance, at scales up to 262144 cores.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2807591.2807672","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 36

Abstract

Application resilience is a key challenge that has to be addressed to realize the exascale vision. Online recovery, even when it involves all processes, can dramatically reduce the overhead of failures as compared to the more traditional approach where the job is terminated and restarted from the last checkpoint. In this paper we explore how local recovery can be used for certain classes of applications to further reduce overheads due to resilience. Specifically we develop programming support and scalable runtime mechanisms to enable online and transparent local recovery for stencil-based parallel applications on current leadership class systems. We also show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution. We integrate these mechanisms with the S3D combustion simulation, and experimentally demonstrate (using the Titan Cray-XK7 system at ORNL) the ability to tolerate high failure rates (i.e., node failures every 5 seconds) with low overhead while sustaining performance, at scales up to 262144 cores.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

本地恢复和故障屏蔽在极端规模的基于模板的应用程序

应用程序弹性是实现百亿亿级远景必须解决的关键挑战。在线恢复，即使涉及到所有进程，与更传统的终止作业并从最后一个检查点重新启动的方法相比，也可以显著减少故障的开销。在本文中，我们探讨了如何将本地恢复用于某些类型的应用程序，以进一步减少由于弹性而产生的开销。具体来说，我们开发了编程支持和可扩展的运行时机制，以便在当前的领导类系统上为基于模板的并行应用程序实现在线和透明的本地恢复。我们还展示了如何掩盖多个独立的故障，以有效地减少对解决方案总时间的影响。我们将这些机制与S3D燃烧模拟集成在一起，并通过实验证明(使用ORNL的Titan Cray-XK7系统)能够容忍高故障率(即每5秒发生一次节点故障)，同时保持性能，规模高达262144核。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

自引率

0.00%

发文量