Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI:10.1109/IPDPS.2011.95

Amina Guermouche, Thomas Ropars, E. Brunet, M. Snir, F. Cappello

{"title":"Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications","authors":"Amina Guermouche, Thomas Ropars, E. Brunet, M. Snir, F. Cappello","doi":"10.1109/IPDPS.2011.95","DOIUrl":null,"url":null,"abstract":"As reported by many recent studies, the mean time between failures of future post-petascale supercomputers is likely to reduce, compared to the current situation. The most popular fault tolerance approach for MPI applications on HPC Platforms relies on coordinated check pointing which raises two major issues: a) global restart wastes energy since all processes are forced to rollback even in the case of a single failure, b) checkpoint coordination may slow down the application execution because of congestions on I/O resources. Alternative approaches based on uncoordinated check pointing and message logging require logging all messages, imposing a high memory/storage occupation and a significant overhead on communications. It has recently been observed that many MPI HPC applications are \\emph{send-deterministic}, allowing to design new fault tolerance protocols. In this paper, we propose an uncoordinated check pointing protocol for send-deterministic MPI HPC applications that (i) logs only a subset of the application messages and (ii) does not require to restart systematically all processes when a failure occurs. We first describe our protocol and prove its correctness. Through experimental evaluations, we show that its implementation in MPICH2 has a negligible overhead on application performance. Then we perform a quantitative evaluation of the properties of our protocol using the NAS Benchmarks. Using a clustering approach, we demonstrate that this protocol actually succeeds to combine the two expected properties: a) it logs only a small fraction of the messages and b) it reduces by a factor approaching 2 the average number of processes to rollback compared to coordinated check pointing.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"516 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"122","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Parallel & Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2011.95","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 122

Abstract

As reported by many recent studies, the mean time between failures of future post-petascale supercomputers is likely to reduce, compared to the current situation. The most popular fault tolerance approach for MPI applications on HPC Platforms relies on coordinated check pointing which raises two major issues: a) global restart wastes energy since all processes are forced to rollback even in the case of a single failure, b) checkpoint coordination may slow down the application execution because of congestions on I/O resources. Alternative approaches based on uncoordinated check pointing and message logging require logging all messages, imposing a high memory/storage occupation and a significant overhead on communications. It has recently been observed that many MPI HPC applications are \emph{send-deterministic}, allowing to design new fault tolerance protocols. In this paper, we propose an uncoordinated check pointing protocol for send-deterministic MPI HPC applications that (i) logs only a subset of the application messages and (ii) does not require to restart systematically all processes when a failure occurs. We first describe our protocol and prove its correctness. Through experimental evaluations, we show that its implementation in MPICH2 has a negligible overhead on application performance. Then we perform a quantitative evaluation of the properties of our protocol using the NAS Benchmarks. Using a clustering approach, we demonstrate that this protocol actually succeeds to combine the two expected properties: a) it logs only a small fraction of the messages and b) it reduces by a factor approaching 2 the average number of processes to rollback compared to coordinated check pointing.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

发送确定性MPI应用程序无多米诺效应的非协调检查点

据最近的许多研究报道，与目前的情况相比，未来的千兆级超级计算机的平均故障间隔时间可能会减少。HPC平台上MPI应用程序最流行的容错方法依赖于协调的检查点，这引起了两个主要问题:a)全局重启浪费能源，因为即使在单个故障的情况下，所有进程都被迫回滚;b)检查点协调可能会减慢应用程序的执行，因为I/O资源拥挤。基于非协调的检查指向和消息日志记录的替代方法需要记录所有消息，这会占用大量内存/存储，并且在通信上产生很大的开销。最近观察到，许多MPI高性能计算应用程序是\emph{发送确定性}的，允许设计新的容错协议。在本文中，我们为发送确定性MPI HPC应用程序提出了一种非协调检查指向协议，该协议(i)仅记录应用程序消息的子集，(ii)当发生故障时不需要系统地重新启动所有进程。我们首先描述我们的协议并证明其正确性。通过实验评估，我们表明在MPICH2中实现它对应用程序性能的开销可以忽略不计。然后，我们使用NAS基准测试对协议的属性进行定量评估。使用集群方法，我们证明了该协议实际上成功地结合了两个预期的属性:a)它只记录一小部分消息;b)与协调检查指向相比，它减少了接近2的回滚进程的平均数量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2011 IEEE International Parallel & Distributed Processing Symposium

自引率

0.00%

发文量