异构故障可能性下内存检查点的最优放置

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2019-05-20 DOI:10.1109/IPDPS.2019.00098

Zaeem Hussain, T. Znati, R. Melhem

{"title":"异构故障可能性下内存检查点的最优放置","authors":"Zaeem Hussain, T. Znati, R. Melhem","doi":"10.1109/IPDPS.2019.00098","DOIUrl":null,"url":null,"abstract":"In-memory checkpointing has increased in popularity over the years because it significantly improves the time to take a checkpoint. It is usually accomplished by placing all or part of a processor's checkpoint into the local memory of a remote node within the cluster. If, however, the checkpointed node and the node containing its checkpoint both fail in quick succession, recovery using in-memory checkpoints becomes impossible. In this paper, we explore the problem of placing in-memory checkpoints among nodes whose individual failure likelihoods are not identical. We provide theoretical results on the optimal way to place in-memory checkpoints such that the probability of occurrence of a catastrophic failure, i.e. failure of a node as well as the node containing its checkpoint, is minimized. Using the failure logs spread over 5 years of a 49,152 node supercomputer, we show that checkpoint placement schemes that utilize knowledge of node failure likelihoods, and are guided by the theoretical results we provide, can significantly reduce the total number of such catastrophic failures when compared with placement schemes that are oblivious of the heterogeneity in nodes based on their failure likelihoods.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"14 43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Optimal Placement of In-memory Checkpoints Under Heterogeneous Failure Likelihoods\",\"authors\":\"Zaeem Hussain, T. Znati, R. Melhem\",\"doi\":\"10.1109/IPDPS.2019.00098\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In-memory checkpointing has increased in popularity over the years because it significantly improves the time to take a checkpoint. It is usually accomplished by placing all or part of a processor's checkpoint into the local memory of a remote node within the cluster. If, however, the checkpointed node and the node containing its checkpoint both fail in quick succession, recovery using in-memory checkpoints becomes impossible. In this paper, we explore the problem of placing in-memory checkpoints among nodes whose individual failure likelihoods are not identical. We provide theoretical results on the optimal way to place in-memory checkpoints such that the probability of occurrence of a catastrophic failure, i.e. failure of a node as well as the node containing its checkpoint, is minimized. Using the failure logs spread over 5 years of a 49,152 node supercomputer, we show that checkpoint placement schemes that utilize knowledge of node failure likelihoods, and are guided by the theoretical results we provide, can significantly reduce the total number of such catastrophic failures when compared with placement schemes that are oblivious of the heterogeneity in nodes based on their failure likelihoods.\",\"PeriodicalId\":403406,\"journal\":{\"name\":\"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"14 43 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS.2019.00098\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2019.00098","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

多年来，内存中的检查点越来越受欢迎，因为它显著地缩短了检查点的时间。它通常通过将处理器检查点的全部或部分放置到集群内远程节点的本地内存中来实现。但是，如果检查点节点和包含其检查点的节点都快速连续失败，那么使用内存中的检查点进行恢复将变得不可能。在本文中，我们探讨了在单个故障可能性不相同的节点之间放置内存检查点的问题。我们提供了关于在内存中放置检查点的最佳方法的理论结果，以便最小化发生灾难性故障的概率，即节点以及包含其检查点的节点的故障。使用49,152个节点的超级计算机5年的故障日志，我们表明，与基于故障可能性忽略节点异质性的放置方案相比，利用节点故障可能性知识并以我们提供的理论结果为指导的检查点放置方案可以显着减少此类灾难性故障的总数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Optimal Placement of In-memory Checkpoints Under Heterogeneous Failure Likelihoods

In-memory checkpointing has increased in popularity over the years because it significantly improves the time to take a checkpoint. It is usually accomplished by placing all or part of a processor's checkpoint into the local memory of a remote node within the cluster. If, however, the checkpointed node and the node containing its checkpoint both fail in quick succession, recovery using in-memory checkpoints becomes impossible. In this paper, we explore the problem of placing in-memory checkpoints among nodes whose individual failure likelihoods are not identical. We provide theoretical results on the optimal way to place in-memory checkpoints such that the probability of occurrence of a catastrophic failure, i.e. failure of a node as well as the node containing its checkpoint, is minimized. Using the failure logs spread over 5 years of a 49,152 node supercomputer, we show that checkpoint placement schemes that utilize knowledge of node failure likelihoods, and are guided by the theoretical results we provide, can significantly reduce the total number of such catastrophic failures when compared with placement schemes that are oblivious of the heterogeneity in nodes based on their failure likelihoods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量