Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver

Md. Mohsin Ali, James A. Southern, P. Strazdins, B. Harding
{"title":"Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver","authors":"Md. Mohsin Ali, James A. Southern, P. Strazdins, B. Harding","doi":"10.1109/IPDPSW.2014.132","DOIUrl":null,"url":null,"abstract":"A fault-tolerant version of Open Message Passing Interface (Open MPI), based on the draft User Level Failure Mitigation (ULFM) proposal of the MPI Forum's Fault Tolerance Working Group, is used to create fault-tolerant applications. This allows applications and libraries to design their own recovery methods and control them at the user level. However, only a limited amount of research work on user level failure recovery (including the implementation and performance evaluation of this prototype) has been carried out. This paper contributes a fault-tolerant implementation of an application solving 2D partial differential equations (PDEs) by means of a sparse grid combination technique which is capable of surviving multiple process failures caused by the faults. Our fault recovery involves reconstructing the faulty communicators without shrinking the global size by re-spawning failed MPI processes on the same physical processors where they were before the failure (for load balancing). It also involves restoring lost data from either exact check pointed data on disk, approximated data in memory (via an alternate sparse grid combination technique) or a near-exact copy of replicated data in memory. The experimental results show that the faulty communicator reconstruction time is currently large in the draft ULFM, especially for multiple process failures. They also show that the alternate combination technique has the lowest data recovery overhead, except on a system with very low disk write latency for which checkpointing has the lowest overhead. Furthermore, the errors due to the recovery of approximated data are within a factor of 10 in all cases, with the surprising result that the alternate combination technique being more accurate than the near-exact replication method. The contributed implementation details, including the analysis of the experimental results, of this paper will help application developers to resolve different issues of design and implementation of fault-tolerant applications by means of the Open MPI ULFM standard.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2014.132","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 30

Abstract

A fault-tolerant version of Open Message Passing Interface (Open MPI), based on the draft User Level Failure Mitigation (ULFM) proposal of the MPI Forum's Fault Tolerance Working Group, is used to create fault-tolerant applications. This allows applications and libraries to design their own recovery methods and control them at the user level. However, only a limited amount of research work on user level failure recovery (including the implementation and performance evaluation of this prototype) has been carried out. This paper contributes a fault-tolerant implementation of an application solving 2D partial differential equations (PDEs) by means of a sparse grid combination technique which is capable of surviving multiple process failures caused by the faults. Our fault recovery involves reconstructing the faulty communicators without shrinking the global size by re-spawning failed MPI processes on the same physical processors where they were before the failure (for load balancing). It also involves restoring lost data from either exact check pointed data on disk, approximated data in memory (via an alternate sparse grid combination technique) or a near-exact copy of replicated data in memory. The experimental results show that the faulty communicator reconstruction time is currently large in the draft ULFM, especially for multiple process failures. They also show that the alternate combination technique has the lowest data recovery overhead, except on a system with very low disk write latency for which checkpointing has the lowest overhead. Furthermore, the errors due to the recovery of approximated data are within a factor of 10 in all cases, with the surprising result that the alternate combination technique being more accurate than the near-exact replication method. The contributed implementation details, including the analysis of the experimental results, of this paper will help application developers to resolve different issues of design and implementation of fault-tolerant applications by means of the Open MPI ULFM standard.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
应用程序级故障恢复:在PDE求解器中使用容错开放MPI
基于MPI论坛容错工作组的用户级故障缓解(ULFM)提案草案的开放消息传递接口(Open MPI)容错版本用于创建容错应用程序。这允许应用程序和库设计自己的恢复方法,并在用户级别控制它们。然而,在用户级故障恢复(包括该原型的实现和性能评估)方面的研究工作非常有限。本文利用稀疏网格组合技术实现了一个求解二维偏微分方程的应用程序的容错实现,该应用程序能够承受由故障引起的多个过程故障。我们的故障恢复包括重建故障通信器,而不缩小全局大小,方法是在故障前的相同物理处理器上重新生成失败的MPI进程(用于负载平衡)。它还涉及从磁盘上的精确检查点数据、内存中的近似数据(通过另一种稀疏网格组合技术)或内存中复制数据的接近精确副本恢复丢失的数据。实验结果表明,目前在ULFM草案中,故障通信器的重建时间较大,特别是对于多进程故障。它们还表明,替代组合技术具有最低的数据恢复开销,除了在磁盘写延迟非常低的系统上,检查点具有最低的开销。此外,在所有情况下,由于恢复近似数据而产生的误差都在10倍以内,令人惊讶的结果是,交替组合技术比接近精确的复制方法更准确。本文提供的实现细节,包括对实验结果的分析,将帮助应用程序开发人员通过Open MPI ULFM标准解决设计和实现容错应用程序的各种问题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A New Parallel Algorithm for Two-Pass Connected Component Labeling RAW Introduction and Committees HPDIC Introduction and Committees An Evaluation of User Satisfaction Driven Scheduling in a Polymorphic Embedded System HPGC Introduction and Committees
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1