LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2017-06-26 DOI:10.1145/3078597.3078609

Bo Fang, Qiang Guan, Nathan Debardeleben, K. Pattabiraman, M. Ripeanu

{"title":"LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures","authors":"Bo Fang, Qiang Guan, Nathan Debardeleben, K. Pattabiraman, M. Ripeanu","doi":"10.1145/3078597.3078609","DOIUrl":null,"url":null,"abstract":"Requirements for reliability, low power consumption, and performance place complex and conflicting demands on the design of high-performance computing (HPC) systems. Fault-tolerance techniques such as checkpoint/restart (C/R) protect HPC applications against hardware faults. These techniques, however, have non negligible overheads particularly when the fault rate exposed by the hardware is high: it is estimated that in future HPC systems, up to 60% of the computational cycles/power will be used for fault tolerance. To mitigate the overall overhead of fault-tolerance techniques, we propose LetGo, an approach that attempts to continue the execution of a HPC application when crashes would otherwise occur. Our hypothesis is that a class of HPC applications have good enough intrinsic fault tolerance so that its possible to re-purpose the default mechanism that terminates an application once a crash-causing error is signalled, and instead attempt to repair the corrupted application state, and continue the application execution. This paper explores this hypothesis, and quantifies the impact of using this observation in the context of checkpoint/restart (C/R) mechanisms. Our fault-injection experiments using a suite of five HPC applications show that, on average, LetGo is able to elide 62% of the crashes encountered by applications, of which 80% result in correct output, while incurring a negligible performance overhead. As a result, when LetGo is used in conjunction with a C/R scheme, it enables significantly higher efficiency thereby leading to faster time to solution.","PeriodicalId":436194,"journal":{"name":"Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3078597.3078609","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

Abstract

Requirements for reliability, low power consumption, and performance place complex and conflicting demands on the design of high-performance computing (HPC) systems. Fault-tolerance techniques such as checkpoint/restart (C/R) protect HPC applications against hardware faults. These techniques, however, have non negligible overheads particularly when the fault rate exposed by the hardware is high: it is estimated that in future HPC systems, up to 60% of the computational cycles/power will be used for fault tolerance. To mitigate the overall overhead of fault-tolerance techniques, we propose LetGo, an approach that attempts to continue the execution of a HPC application when crashes would otherwise occur. Our hypothesis is that a class of HPC applications have good enough intrinsic fault tolerance so that its possible to re-purpose the default mechanism that terminates an application once a crash-causing error is signalled, and instead attempt to repair the corrupted application state, and continue the application execution. This paper explores this hypothesis, and quantifies the impact of using this observation in the context of checkpoint/restart (C/R) mechanisms. Our fault-injection experiments using a suite of five HPC applications show that, on average, LetGo is able to elide 62% of the crashes encountered by applications, of which 80% result in correct output, while incurring a negligible performance overhead. As a result, when LetGo is used in conjunction with a C/R scheme, it enables significantly higher efficiency thereby leading to faster time to solution.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

LetGo:用于故障情况下HPC应用的轻量级连续框架

高性能计算(HPC)系统的设计对可靠性、低功耗和性能提出了复杂且相互矛盾的要求。诸如检查点/重启(C/R)之类的容错技术可以保护HPC应用程序免受硬件故障的影响。然而，这些技术有不可忽略的开销，特别是当硬件暴露的故障率很高时:据估计，在未来的HPC系统中，高达60%的计算周期/功率将用于容错。为了减轻容错技术的总体开销，我们提出了LetGo，这是一种尝试在崩溃发生时继续执行HPC应用程序的方法。我们的假设是，一类HPC应用程序具有足够好的内在容错性，因此可以重新使用默认机制，一旦发出导致崩溃的错误信号就终止应用程序，而不是尝试修复损坏的应用程序状态，并继续执行应用程序。本文探讨了这一假设，并量化了在检查点/重新启动(C/R)机制的背景下使用这一观察结果的影响。我们使用一组5个HPC应用程序进行的故障注入实验表明，平均而言，LetGo能够消除应用程序遇到的62%的崩溃，其中80%的崩溃会产生正确的输出，同时产生微不足道的性能开销。因此，当LetGo与C/R方案结合使用时，它可以显着提高效率，从而加快解决方案的时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

自引率

0.00%

发文量