Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

Amina Guermouche, Thomas Ropars, E. Brunet, M. Snir, F. Cappello
{"title":"Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications","authors":"Amina Guermouche, Thomas Ropars, E. Brunet, M. Snir, F. Cappello","doi":"10.1109/IPDPS.2011.95","DOIUrl":null,"url":null,"abstract":"As reported by many recent studies, the mean time between failures of future post-petascale supercomputers is likely to reduce, compared to the current situation. The most popular fault tolerance approach for MPI applications on HPC Platforms relies on coordinated check pointing which raises two major issues: a) global restart wastes energy since all processes are forced to rollback even in the case of a single failure, b) checkpoint coordination may slow down the application execution because of congestions on I/O resources. Alternative approaches based on uncoordinated check pointing and message logging require logging all messages, imposing a high memory/storage occupation and a significant overhead on communications. It has recently been observed that many MPI HPC applications are \\emph{send-deterministic}, allowing to design new fault tolerance protocols. In this paper, we propose an uncoordinated check pointing protocol for send-deterministic MPI HPC applications that (i) logs only a subset of the application messages and (ii) does not require to restart systematically all processes when a failure occurs. We first describe our protocol and prove its correctness. Through experimental evaluations, we show that its implementation in MPICH2 has a negligible overhead on application performance. Then we perform a quantitative evaluation of the properties of our protocol using the NAS Benchmarks. Using a clustering approach, we demonstrate that this protocol actually succeeds to combine the two expected properties: a) it logs only a small fraction of the messages and b) it reduces by a factor approaching 2 the average number of processes to rollback compared to coordinated check pointing.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"516 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"122","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Parallel & Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2011.95","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 122

Abstract

As reported by many recent studies, the mean time between failures of future post-petascale supercomputers is likely to reduce, compared to the current situation. The most popular fault tolerance approach for MPI applications on HPC Platforms relies on coordinated check pointing which raises two major issues: a) global restart wastes energy since all processes are forced to rollback even in the case of a single failure, b) checkpoint coordination may slow down the application execution because of congestions on I/O resources. Alternative approaches based on uncoordinated check pointing and message logging require logging all messages, imposing a high memory/storage occupation and a significant overhead on communications. It has recently been observed that many MPI HPC applications are \emph{send-deterministic}, allowing to design new fault tolerance protocols. In this paper, we propose an uncoordinated check pointing protocol for send-deterministic MPI HPC applications that (i) logs only a subset of the application messages and (ii) does not require to restart systematically all processes when a failure occurs. We first describe our protocol and prove its correctness. Through experimental evaluations, we show that its implementation in MPICH2 has a negligible overhead on application performance. Then we perform a quantitative evaluation of the properties of our protocol using the NAS Benchmarks. Using a clustering approach, we demonstrate that this protocol actually succeeds to combine the two expected properties: a) it logs only a small fraction of the messages and b) it reduces by a factor approaching 2 the average number of processes to rollback compared to coordinated check pointing.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
发送确定性MPI应用程序无多米诺效应的非协调检查点
据最近的许多研究报道,与目前的情况相比,未来的千兆级超级计算机的平均故障间隔时间可能会减少。HPC平台上MPI应用程序最流行的容错方法依赖于协调的检查点,这引起了两个主要问题:a)全局重启浪费能源,因为即使在单个故障的情况下,所有进程都被迫回滚;b)检查点协调可能会减慢应用程序的执行,因为I/O资源拥挤。基于非协调的检查指向和消息日志记录的替代方法需要记录所有消息,这会占用大量内存/存储,并且在通信上产生很大的开销。最近观察到,许多MPI高性能计算应用程序是\emph{发送确定性}的,允许设计新的容错协议。在本文中,我们为发送确定性MPI HPC应用程序提出了一种非协调检查指向协议,该协议(i)仅记录应用程序消息的子集,(ii)当发生故障时不需要系统地重新启动所有进程。我们首先描述我们的协议并证明其正确性。通过实验评估,我们表明在MPICH2中实现它对应用程序性能的开销可以忽略不计。然后,我们使用NAS基准测试对协议的属性进行定量评估。使用集群方法,我们证明了该协议实际上成功地结合了两个预期的属性:a)它只记录一小部分消息;b)与协调检查指向相比,它减少了接近2的回滚进程的平均数量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Large-Scale Semantic Concept Detection on Manycore Platforms for Multimedia Mining Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures A Study of Parallel Particle Tracing for Steady-State and Time-Varying Flow Fields Smith-Waterman Alignment of Huge Sequences with GPU in Linear Space CheCL: Transparent Checkpointing and Process Migration of OpenCL Applications
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1