更全面地了解SDC传播

Jon C. Calhoun, M. Snir, Luke N. Olson, W. Gropp
{"title":"更全面地了解SDC传播","authors":"Jon C. Calhoun, M. Snir, Luke N. Olson, W. Gropp","doi":"10.1145/3078597.3078617","DOIUrl":null,"url":null,"abstract":"With the rate of errors that can silently effect an application's state/output expected to increase on future HPC machines, numerous application-level detection and recovery schemes have been proposed. Recovery is more efficient when errors are contained and affect only part of the computation's state. Containment is usually achieved by verifying all information leaking out of a statically defined containment domain, which is an expensive procedure. Alternatively, error propagation can be analyzed to bound the domain that is affected by a detected error. This paper investigates how silent data corruption (SDC) due to soft errors propagates through three HPC applications: HPCCG, Jacobi, and CoMD. To allow for more detailed view of error propagation, the paper tracks propagation at the instruction and application variable level. The impact of detection latency on error propagation is shown along with an application's ability to recover. Finally, the impact of compiler optimizations are explored along with the impact of local problem size on error propagation.","PeriodicalId":436194,"journal":{"name":"Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":"{\"title\":\"Towards a More Complete Understanding of SDC Propagation\",\"authors\":\"Jon C. Calhoun, M. Snir, Luke N. Olson, W. Gropp\",\"doi\":\"10.1145/3078597.3078617\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the rate of errors that can silently effect an application's state/output expected to increase on future HPC machines, numerous application-level detection and recovery schemes have been proposed. Recovery is more efficient when errors are contained and affect only part of the computation's state. Containment is usually achieved by verifying all information leaking out of a statically defined containment domain, which is an expensive procedure. Alternatively, error propagation can be analyzed to bound the domain that is affected by a detected error. This paper investigates how silent data corruption (SDC) due to soft errors propagates through three HPC applications: HPCCG, Jacobi, and CoMD. To allow for more detailed view of error propagation, the paper tracks propagation at the instruction and application variable level. The impact of detection latency on error propagation is shown along with an application's ability to recover. Finally, the impact of compiler optimizations are explored along with the impact of local problem size on error propagation.\",\"PeriodicalId\":436194,\"journal\":{\"name\":\"Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"25\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3078597.3078617\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3078597.3078617","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 25

摘要

在未来的高性能计算机器上,错误率可能会无声地影响应用程序的状态/输出,因此人们提出了许多应用程序级的检测和恢复方案。当包含错误并且只影响部分计算状态时,恢复会更有效。通常通过验证从静态定义的包含域泄漏的所有信息来实现包含,这是一个昂贵的过程。或者,可以分析错误传播以绑定受检测到的错误影响的域。本文研究了由软错误引起的静默数据损坏(SDC)如何通过三个高性能计算应用程序:HPCCG、Jacobi和CoMD传播。为了更详细地了解错误传播,本文跟踪了指令和应用程序变量级别的传播。检测延迟对错误传播的影响与应用程序的恢复能力一起显示。最后,探讨了编译器优化的影响以及局部问题大小对错误传播的影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Towards a More Complete Understanding of SDC Propagation
With the rate of errors that can silently effect an application's state/output expected to increase on future HPC machines, numerous application-level detection and recovery schemes have been proposed. Recovery is more efficient when errors are contained and affect only part of the computation's state. Containment is usually achieved by verifying all information leaking out of a statically defined containment domain, which is an expensive procedure. Alternatively, error propagation can be analyzed to bound the domain that is affected by a detected error. This paper investigates how silent data corruption (SDC) due to soft errors propagates through three HPC applications: HPCCG, Jacobi, and CoMD. To allow for more detailed view of error propagation, the paper tracks propagation at the instruction and application variable level. The impact of detection latency on error propagation is shown along with an application's ability to recover. Finally, the impact of compiler optimizations are explored along with the impact of local problem size on error propagation.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Deep Learning in Cancer and Infectious Disease: Novel Driver Problems for Future HPC Architecture LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures Explaining Wide Area Data Transfer Performance IOGP: An Incremental Online Graph Partitioning Algorithm for Distributed Graph Databases Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1