FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

G. Zheng, L. Shi, L. Kalé
{"title":"FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI","authors":"G. Zheng, L. Shi, L. Kalé","doi":"10.1109/CLUSTR.2004.1392606","DOIUrl":null,"url":null,"abstract":"As high performance clusters continue to grow in size, the mean time between failures shrinks. Thus, the issues of fault tolerance and reliability are becoming one of the challenging factors for application scalability. The traditional disk-based method of dealing with faults is to checkpoint the state of the entire application periodically to reliable storage and restart from the recent checkpoint. The recovery of the application from faults involves (often manually) restarting applications on all processors and having it read the data from disks on all processors. The restart can therefore take minutes after it has been initiated. Such a strategy requires that the failed processor can be replaced so that the number of processors at checkpoint-time and recovery-time are the same. We present FTC-Charms ++, a fault-tolerant runtime based on a scheme for fast and scalable in-memory checkpoint and restart. At restart, when there is no extra processor, the program can continue to run on the remaining processors while minimizing the performance penalty due to losing processors. The method is useful for applications whose memory footprint is small at the checkpoint state, while a variation of this scheme - in-disk checkpoint/restart can be applied to applications with large memory footprint. The scheme does not require any individual component to be fault-free. We have implemented this scheme for Charms++ and AMPI (an adaptive version of MPl). This work describes the scheme and shows performance data on a cluster using 128 processors.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"213","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTR.2004.1392606","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 213

Abstract

As high performance clusters continue to grow in size, the mean time between failures shrinks. Thus, the issues of fault tolerance and reliability are becoming one of the challenging factors for application scalability. The traditional disk-based method of dealing with faults is to checkpoint the state of the entire application periodically to reliable storage and restart from the recent checkpoint. The recovery of the application from faults involves (often manually) restarting applications on all processors and having it read the data from disks on all processors. The restart can therefore take minutes after it has been initiated. Such a strategy requires that the failed processor can be replaced so that the number of processors at checkpoint-time and recovery-time are the same. We present FTC-Charms ++, a fault-tolerant runtime based on a scheme for fast and scalable in-memory checkpoint and restart. At restart, when there is no extra processor, the program can continue to run on the remaining processors while minimizing the performance penalty due to losing processors. The method is useful for applications whose memory footprint is small at the checkpoint state, while a variation of this scheme - in-disk checkpoint/restart can be applied to applications with large memory footprint. The scheme does not require any individual component to be fault-free. We have implemented this scheme for Charms++ and AMPI (an adaptive version of MPl). This work describes the scheme and shows performance data on a cluster using 128 processors.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
FTC-Charm++:用于Charm++和MPI的基于内存检查点的容错运行时
随着高性能集群规模的持续增长,平均故障间隔时间会缩短。因此,容错和可靠性问题正成为应用程序可伸缩性的挑战因素之一。传统的基于磁盘的故障处理方法是将整个应用程序的状态定期检查点到可靠的存储,并从最近的检查点重新启动。从故障中恢复应用程序需要(通常是手动地)重新启动所有处理器上的应用程序,并让它从所有处理器上的磁盘读取数据。因此,重启在启动后可能需要几分钟。这种策略要求可以替换故障的处理器,以便在检查点时间和恢复时间的处理器数量相同。我们提出了FTC-Charms ++,一个基于快速和可扩展的内存检查点和重启方案的容错运行时。在重新启动时,当没有额外的处理器时,程序可以继续在剩余的处理器上运行,同时最小化由于丢失处理器而造成的性能损失。该方法对于在检查点状态下内存占用很小的应用程序很有用,而该方案的一个变体——磁盘内检查点/重启——可以应用于内存占用很大的应用程序。该方案不要求任何单个组件是无故障的。我们已经在Charms++和AMPI (MPl的自适应版本)中实现了该方案。本文描述了该方案,并展示了使用128个处理器的集群上的性能数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Kerrighed and data parallelism: cluster computing on single system image operating systems Management of grid jobs and data within SAMGrid MPIIMGEN - a code transformer that parallelizes image processing codes to run on a cluster of workstations FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI Bandwidth-aware co-allocating meta-schedulers for mini-grid architectures
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1