FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI:10.1109/CLUSTR.2004.1392606

G. Zheng, L. Shi, L. Kalé

{"title":"FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI","authors":"G. Zheng, L. Shi, L. Kalé","doi":"10.1109/CLUSTR.2004.1392606","DOIUrl":null,"url":null,"abstract":"As high performance clusters continue to grow in size, the mean time between failures shrinks. Thus, the issues of fault tolerance and reliability are becoming one of the challenging factors for application scalability. The traditional disk-based method of dealing with faults is to checkpoint the state of the entire application periodically to reliable storage and restart from the recent checkpoint. The recovery of the application from faults involves (often manually) restarting applications on all processors and having it read the data from disks on all processors. The restart can therefore take minutes after it has been initiated. Such a strategy requires that the failed processor can be replaced so that the number of processors at checkpoint-time and recovery-time are the same. We present FTC-Charms ++, a fault-tolerant runtime based on a scheme for fast and scalable in-memory checkpoint and restart. At restart, when there is no extra processor, the program can continue to run on the remaining processors while minimizing the performance penalty due to losing processors. The method is useful for applications whose memory footprint is small at the checkpoint state, while a variation of this scheme - in-disk checkpoint/restart can be applied to applications with large memory footprint. The scheme does not require any individual component to be fault-free. We have implemented this scheme for Charms++ and AMPI (an adaptive version of MPl). This work describes the scheme and shows performance data on a cluster using 128 processors.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"213","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTR.2004.1392606","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 213

Abstract

As high performance clusters continue to grow in size, the mean time between failures shrinks. Thus, the issues of fault tolerance and reliability are becoming one of the challenging factors for application scalability. The traditional disk-based method of dealing with faults is to checkpoint the state of the entire application periodically to reliable storage and restart from the recent checkpoint. The recovery of the application from faults involves (often manually) restarting applications on all processors and having it read the data from disks on all processors. The restart can therefore take minutes after it has been initiated. Such a strategy requires that the failed processor can be replaced so that the number of processors at checkpoint-time and recovery-time are the same. We present FTC-Charms ++, a fault-tolerant runtime based on a scheme for fast and scalable in-memory checkpoint and restart. At restart, when there is no extra processor, the program can continue to run on the remaining processors while minimizing the performance penalty due to losing processors. The method is useful for applications whose memory footprint is small at the checkpoint state, while a variation of this scheme - in-disk checkpoint/restart can be applied to applications with large memory footprint. The scheme does not require any individual component to be fault-free. We have implemented this scheme for Charms++ and AMPI (an adaptive version of MPl). This work describes the scheme and shows performance data on a cluster using 128 processors.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

FTC-Charm++:用于Charm++和MPI的基于内存检查点的容错运行时

随着高性能集群规模的持续增长，平均故障间隔时间会缩短。因此，容错和可靠性问题正成为应用程序可伸缩性的挑战因素之一。传统的基于磁盘的故障处理方法是将整个应用程序的状态定期检查点到可靠的存储，并从最近的检查点重新启动。从故障中恢复应用程序需要(通常是手动地)重新启动所有处理器上的应用程序，并让它从所有处理器上的磁盘读取数据。因此，重启在启动后可能需要几分钟。这种策略要求可以替换故障的处理器，以便在检查点时间和恢复时间的处理器数量相同。我们提出了FTC-Charms ++，一个基于快速和可扩展的内存检查点和重启方案的容错运行时。在重新启动时，当没有额外的处理器时，程序可以继续在剩余的处理器上运行，同时最小化由于丢失处理器而造成的性能损失。该方法对于在检查点状态下内存占用很小的应用程序很有用，而该方案的一个变体——磁盘内检查点/重启——可以应用于内存占用很大的应用程序。该方案不要求任何单个组件是无故障的。我们已经在Charms++和AMPI (MPl的自适应版本)中实现了该方案。本文描述了该方案，并展示了使用128个处理器的集群上的性能数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)

自引率

0.00%

发文量