Sampling + DMR: Practical and low-overhead permanent fault detection

2011 38th Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2011-06-04 DOI:10.1145/2000064.2000089

Shuou Nomura, Matthew D. Sinclair, C. Ho, Venkatraman Govindaraju, M. Kruijf, K. Sankaralingam

{"title":"Sampling + DMR: Practical and low-overhead permanent fault detection","authors":"Shuou Nomura, Matthew D. Sinclair, C. Ho, Venkatraman Govindaraju, M. Kruijf, K. Sankaralingam","doi":"10.1145/2000064.2000089","DOIUrl":null,"url":null,"abstract":"With technology scaling, manufacture-time and in-field permanent faults are becoming a fundamental problem. Multi-core architectures with spares can tolerate them by detecting and isolating faulty cores, but the required fault detection coverage becomes effectively 100% as the number of permanent faults increases. Dual-modular redundancy(DMR) can provide 100% coverage without assuming device-level fault models, but its overhead is excessive. In this paper, we explore a simple and low-overhead mechanism we call Sampling-DMR: run in DMR mode for a small percentage (1% of the time for example) of each periodic execution window (5 million cycles for example). Although Sampling-DMR can leave some errors undetected, we argue the permanent fault coverage is 100% because it can detect all faults eventually. SamplingDMR thus introduces a system paradigm of restricting all permanent faults' effects to small finite windows of error occurrence. We prove an ultimate upper bound exists on total missed errors and develop a probabilistic model to analyze the distribution of the number of undetected errors and detection latency. The model is validated using full gate-level fault injection experiments for an actual processor running full application software. Sampling-DMR outperforms conventional techniques in terms of fault coverage, sustains similar detection latency guarantees, and limits energy and performance overheads to less than 2%.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"239 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"53","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2000064.2000089","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 53

Abstract

With technology scaling, manufacture-time and in-field permanent faults are becoming a fundamental problem. Multi-core architectures with spares can tolerate them by detecting and isolating faulty cores, but the required fault detection coverage becomes effectively 100% as the number of permanent faults increases. Dual-modular redundancy(DMR) can provide 100% coverage without assuming device-level fault models, but its overhead is excessive. In this paper, we explore a simple and low-overhead mechanism we call Sampling-DMR: run in DMR mode for a small percentage (1% of the time for example) of each periodic execution window (5 million cycles for example). Although Sampling-DMR can leave some errors undetected, we argue the permanent fault coverage is 100% because it can detect all faults eventually. SamplingDMR thus introduces a system paradigm of restricting all permanent faults' effects to small finite windows of error occurrence. We prove an ultimate upper bound exists on total missed errors and develop a probabilistic model to analyze the distribution of the number of undetected errors and detection latency. The model is validated using full gate-level fault injection experiments for an actual processor running full application software. Sampling-DMR outperforms conventional techniques in terms of fault coverage, sustains similar detection latency guarantees, and limits energy and performance overheads to less than 2%.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

采样+ DMR:实用且低开销的永久故障检测

随着技术的规模化，制造时间和现场永久故障已成为一个基本问题。带有备件的多核架构可以通过检测和隔离故障核来容忍它们，但随着永久故障数量的增加，所需的故障检测覆盖率实际上会达到100%。双模块冗余(DMR)可以在不假设设备级故障模型的情况下提供100%的覆盖，但其开销过大。在本文中，我们探索了一种简单且低开销的机制，我们称之为采样-DMR:在每个周期执行窗口(例如500万个周期)中以DMR模式运行一小部分时间(例如1%的时间)。尽管采样- dmr可能会留下一些未检测到的错误，但我们认为永久故障覆盖率为100%，因为它最终可以检测到所有故障。因此，采样dmr引入了一种系统范例，将所有永久故障的影响限制在错误发生的小有限窗口内。我们证明了总遗漏错误存在一个最终上界，并建立了一个概率模型来分析未检测错误数量和检测延迟的分布。通过运行完整应用软件的实际处理器的全门级故障注入实验，对模型进行了验证。采样- dmr在故障覆盖方面优于传统技术，保持类似的检测延迟保证，并将能量和性能开销限制在2%以下。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2011 38th Annual International Symposium on Computer Architecture (ISCA)

自引率

0.00%

发文量