Shuou Nomura, Matthew D. Sinclair, C. Ho, Venkatraman Govindaraju, M. Kruijf, K. Sankaralingam
{"title":"采样+ DMR:实用且低开销的永久故障检测","authors":"Shuou Nomura, Matthew D. Sinclair, C. Ho, Venkatraman Govindaraju, M. Kruijf, K. Sankaralingam","doi":"10.1145/2000064.2000089","DOIUrl":null,"url":null,"abstract":"With technology scaling, manufacture-time and in-field permanent faults are becoming a fundamental problem. Multi-core architectures with spares can tolerate them by detecting and isolating faulty cores, but the required fault detection coverage becomes effectively 100% as the number of permanent faults increases. Dual-modular redundancy(DMR) can provide 100% coverage without assuming device-level fault models, but its overhead is excessive. In this paper, we explore a simple and low-overhead mechanism we call Sampling-DMR: run in DMR mode for a small percentage (1% of the time for example) of each periodic execution window (5 million cycles for example). Although Sampling-DMR can leave some errors undetected, we argue the permanent fault coverage is 100% because it can detect all faults eventually. SamplingDMR thus introduces a system paradigm of restricting all permanent faults' effects to small finite windows of error occurrence. We prove an ultimate upper bound exists on total missed errors and develop a probabilistic model to analyze the distribution of the number of undetected errors and detection latency. The model is validated using full gate-level fault injection experiments for an actual processor running full application software. Sampling-DMR outperforms conventional techniques in terms of fault coverage, sustains similar detection latency guarantees, and limits energy and performance overheads to less than 2%.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"239 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"53","resultStr":"{\"title\":\"Sampling + DMR: Practical and low-overhead permanent fault detection\",\"authors\":\"Shuou Nomura, Matthew D. Sinclair, C. Ho, Venkatraman Govindaraju, M. Kruijf, K. Sankaralingam\",\"doi\":\"10.1145/2000064.2000089\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With technology scaling, manufacture-time and in-field permanent faults are becoming a fundamental problem. Multi-core architectures with spares can tolerate them by detecting and isolating faulty cores, but the required fault detection coverage becomes effectively 100% as the number of permanent faults increases. Dual-modular redundancy(DMR) can provide 100% coverage without assuming device-level fault models, but its overhead is excessive. In this paper, we explore a simple and low-overhead mechanism we call Sampling-DMR: run in DMR mode for a small percentage (1% of the time for example) of each periodic execution window (5 million cycles for example). Although Sampling-DMR can leave some errors undetected, we argue the permanent fault coverage is 100% because it can detect all faults eventually. SamplingDMR thus introduces a system paradigm of restricting all permanent faults' effects to small finite windows of error occurrence. We prove an ultimate upper bound exists on total missed errors and develop a probabilistic model to analyze the distribution of the number of undetected errors and detection latency. The model is validated using full gate-level fault injection experiments for an actual processor running full application software. Sampling-DMR outperforms conventional techniques in terms of fault coverage, sustains similar detection latency guarantees, and limits energy and performance overheads to less than 2%.\",\"PeriodicalId\":340732,\"journal\":{\"name\":\"2011 38th Annual International Symposium on Computer Architecture (ISCA)\",\"volume\":\"239 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-06-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"53\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 38th Annual International Symposium on Computer Architecture (ISCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2000064.2000089\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2000064.2000089","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Sampling + DMR: Practical and low-overhead permanent fault detection
With technology scaling, manufacture-time and in-field permanent faults are becoming a fundamental problem. Multi-core architectures with spares can tolerate them by detecting and isolating faulty cores, but the required fault detection coverage becomes effectively 100% as the number of permanent faults increases. Dual-modular redundancy(DMR) can provide 100% coverage without assuming device-level fault models, but its overhead is excessive. In this paper, we explore a simple and low-overhead mechanism we call Sampling-DMR: run in DMR mode for a small percentage (1% of the time for example) of each periodic execution window (5 million cycles for example). Although Sampling-DMR can leave some errors undetected, we argue the permanent fault coverage is 100% because it can detect all faults eventually. SamplingDMR thus introduces a system paradigm of restricting all permanent faults' effects to small finite windows of error occurrence. We prove an ultimate upper bound exists on total missed errors and develop a probabilistic model to analyze the distribution of the number of undetected errors and detection latency. The model is validated using full gate-level fault injection experiments for an actual processor running full application software. Sampling-DMR outperforms conventional techniques in terms of fault coverage, sustains similar detection latency guarantees, and limits energy and performance overheads to less than 2%.