Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors

Pub Date : 2014-11-16 DOI:10.1145/2897189
A. Benoit, Aurélien Cavelan, Y. Robert, Hongyang Sun
{"title":"Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors","authors":"A. Benoit, Aurélien Cavelan, Y. Robert, Hongyang Sun","doi":"10.1145/2897189","DOIUrl":null,"url":null,"abstract":"In this article, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption. For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent errors before each checkpoint, hence extending the classical formula by Young and Daly for fail-stop errors only. We further extend the approach to include intermediate verifications, and to consider a bicriteria problem involving both time and energy (linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bicriteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via Dynamic Voltage and Frequency Scaling (DVFS). In this latter scenario, we determine the optimal checkpointing and verification locations, as well as the optimal speed pairs for each task segment between any two consecutive checkpoints. Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performance of each algorithm, showing that the best overall performance is achieved under the most flexible scenario using intermediate verifications and different speeds.","PeriodicalId":0,"journal":{"name":"","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2897189","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 34

Abstract

In this article, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption. For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent errors before each checkpoint, hence extending the classical formula by Young and Daly for fail-stop errors only. We further extend the approach to include intermediate verifications, and to consider a bicriteria problem involving both time and energy (linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bicriteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via Dynamic Voltage and Frequency Scaling (DVFS). In this latter scenario, we determine the optimal checkpointing and verification locations, as well as the optimal speed pairs for each task segment between any two consecutive checkpoints. Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performance of each algorithm, showing that the best overall performance is achieved under the most flexible scenario using intermediate verifications and different speeds.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
评估处理故障停止和静默错误的通用算法
在本文中,我们将传统的检查点和回滚恢复策略与验证机制结合起来,以处理故障停止和静默错误。目标是最小化完工时间和/或能源消耗。对于可分负载应用程序,我们使用一阶近似来找到最佳检查点周期以最小化执行时间,并使用额外的验证机制在每个检查点之前检测沉默错误,从而扩展了Young和Daly的经典公式,仅适用于失败停止错误。我们进一步扩展了该方法,以包括中间验证,并考虑涉及时间和能量的双标准问题(执行时间和能量消耗的线性组合)。然后,我们重点研究了应用程序工作流,其依赖图是一个线性的任务链。在这里,我们为双标准问题确定最优检查点和验证位置,有或没有中间验证。与在整个执行过程中使用单一速度不同,我们进一步引入了一个新的执行场景,该场景允许通过动态电压和频率缩放(DVFS)改变执行速度。在后一种场景中,我们确定最佳检查点和验证位置,以及任意两个连续检查点之间每个任务段的最佳速度对。最后,我们进行了一组广泛的仿真来支持理论研究,并评估了每种算法的性能,结果表明,在使用中间验证和不同速度的最灵活场景下,可以获得最佳的整体性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1