LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures

Bo Fang, Qiang Guan, Nathan Debardeleben, K. Pattabiraman, M. Ripeanu
{"title":"LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures","authors":"Bo Fang, Qiang Guan, Nathan Debardeleben, K. Pattabiraman, M. Ripeanu","doi":"10.1145/3078597.3078609","DOIUrl":null,"url":null,"abstract":"Requirements for reliability, low power consumption, and performance place complex and conflicting demands on the design of high-performance computing (HPC) systems. Fault-tolerance techniques such as checkpoint/restart (C/R) protect HPC applications against hardware faults. These techniques, however, have non negligible overheads particularly when the fault rate exposed by the hardware is high: it is estimated that in future HPC systems, up to 60% of the computational cycles/power will be used for fault tolerance. To mitigate the overall overhead of fault-tolerance techniques, we propose LetGo, an approach that attempts to continue the execution of a HPC application when crashes would otherwise occur. Our hypothesis is that a class of HPC applications have good enough intrinsic fault tolerance so that its possible to re-purpose the default mechanism that terminates an application once a crash-causing error is signalled, and instead attempt to repair the corrupted application state, and continue the application execution. This paper explores this hypothesis, and quantifies the impact of using this observation in the context of checkpoint/restart (C/R) mechanisms. Our fault-injection experiments using a suite of five HPC applications show that, on average, LetGo is able to elide 62% of the crashes encountered by applications, of which 80% result in correct output, while incurring a negligible performance overhead. As a result, when LetGo is used in conjunction with a C/R scheme, it enables significantly higher efficiency thereby leading to faster time to solution.","PeriodicalId":436194,"journal":{"name":"Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3078597.3078609","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20

Abstract

Requirements for reliability, low power consumption, and performance place complex and conflicting demands on the design of high-performance computing (HPC) systems. Fault-tolerance techniques such as checkpoint/restart (C/R) protect HPC applications against hardware faults. These techniques, however, have non negligible overheads particularly when the fault rate exposed by the hardware is high: it is estimated that in future HPC systems, up to 60% of the computational cycles/power will be used for fault tolerance. To mitigate the overall overhead of fault-tolerance techniques, we propose LetGo, an approach that attempts to continue the execution of a HPC application when crashes would otherwise occur. Our hypothesis is that a class of HPC applications have good enough intrinsic fault tolerance so that its possible to re-purpose the default mechanism that terminates an application once a crash-causing error is signalled, and instead attempt to repair the corrupted application state, and continue the application execution. This paper explores this hypothesis, and quantifies the impact of using this observation in the context of checkpoint/restart (C/R) mechanisms. Our fault-injection experiments using a suite of five HPC applications show that, on average, LetGo is able to elide 62% of the crashes encountered by applications, of which 80% result in correct output, while incurring a negligible performance overhead. As a result, when LetGo is used in conjunction with a C/R scheme, it enables significantly higher efficiency thereby leading to faster time to solution.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
LetGo:用于故障情况下HPC应用的轻量级连续框架
高性能计算(HPC)系统的设计对可靠性、低功耗和性能提出了复杂且相互矛盾的要求。诸如检查点/重启(C/R)之类的容错技术可以保护HPC应用程序免受硬件故障的影响。然而,这些技术有不可忽略的开销,特别是当硬件暴露的故障率很高时:据估计,在未来的HPC系统中,高达60%的计算周期/功率将用于容错。为了减轻容错技术的总体开销,我们提出了LetGo,这是一种尝试在崩溃发生时继续执行HPC应用程序的方法。我们的假设是,一类HPC应用程序具有足够好的内在容错性,因此可以重新使用默认机制,一旦发出导致崩溃的错误信号就终止应用程序,而不是尝试修复损坏的应用程序状态,并继续执行应用程序。本文探讨了这一假设,并量化了在检查点/重新启动(C/R)机制的背景下使用这一观察结果的影响。我们使用一组5个HPC应用程序进行的故障注入实验表明,平均而言,LetGo能够消除应用程序遇到的62%的崩溃,其中80%的崩溃会产生正确的输出,同时产生微不足道的性能开销。因此,当LetGo与C/R方案结合使用时,它可以显着提高效率,从而加快解决方案的时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Deep Learning in Cancer and Infectious Disease: Novel Driver Problems for Future HPC Architecture LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures Explaining Wide Area Data Transfer Performance IOGP: An Incremental Online Graph Partitioning Algorithm for Distributed Graph Databases Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1