Techniques to reduce the soft error rate of a high-performance microprocessor

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004. Pub Date : 2004-06-19 DOI:10.1145/1028176.1006723

Christopher T. Weaver, J. Emer, Shubhendu S. Mukherjee, S. Reinhardt

{"title":"Techniques to reduce the soft error rate of a high-performance microprocessor","authors":"Christopher T. Weaver, J. Emer, Shubhendu S. Mukherjee, S. Reinhardt","doi":"10.1145/1028176.1006723","DOIUrl":null,"url":null,"abstract":"Transient faults due to neutron and alpha particle strikes pose a significant obstacle to increasing processor transistor counts in future technologies. Although fault rates of individual transistors may not rise significantly, incorporating more transistors into a device makes that device more likely to encounter a fault. Hence, maintaining processor error rates at acceptable levels will require increasing design effort. This paper proposes two simple approaches to reduce error rates and evaluates their application to a microprocessor instruction queue. The first technique reduces the time instructions sit in vulnerable storage structures by selectively squashing instructions when long delays are encountered. A fault is less likely to cause an error if the structure it affects does not contain valid instructions. We introduce a new metric, MITF (Mean Instructions To Failure), to capture the trade-off between performance and reliability introduced by this approach. The second technique addresses false detected errors. In the absence of a fault detection mechanism, such errors would not have affected the final outcome of a program. For example, a fault affecting the result of a dynamically dead instruction would not change the final program output, but could still be flagged by the hardware as an error. To avoid signalling such false errors, we modify a pipeline's error detection logic to mark affected instructions and data as possibly incorrect rather than immediately signaling an error. Then, we signal an error only if we determine later that the possibly incorrect value could have affected the program's output.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"287","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1028176.1006723","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 287

Abstract

Transient faults due to neutron and alpha particle strikes pose a significant obstacle to increasing processor transistor counts in future technologies. Although fault rates of individual transistors may not rise significantly, incorporating more transistors into a device makes that device more likely to encounter a fault. Hence, maintaining processor error rates at acceptable levels will require increasing design effort. This paper proposes two simple approaches to reduce error rates and evaluates their application to a microprocessor instruction queue. The first technique reduces the time instructions sit in vulnerable storage structures by selectively squashing instructions when long delays are encountered. A fault is less likely to cause an error if the structure it affects does not contain valid instructions. We introduce a new metric, MITF (Mean Instructions To Failure), to capture the trade-off between performance and reliability introduced by this approach. The second technique addresses false detected errors. In the absence of a fault detection mechanism, such errors would not have affected the final outcome of a program. For example, a fault affecting the result of a dynamically dead instruction would not change the final program output, but could still be flagged by the hardware as an error. To avoid signalling such false errors, we modify a pipeline's error detection logic to mark affected instructions and data as possibly incorrect rather than immediately signaling an error. Then, we signal an error only if we determine later that the possibly incorrect value could have affected the program's output.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

降低高性能微处理器软错误率的技术

由于中子和α粒子撞击造成的瞬态故障对未来技术中增加处理器晶体管数量构成了重大障碍。虽然单个晶体管的故障率可能不会显著上升，但将更多的晶体管集成到一个器件中会使该器件更容易发生故障。因此，将处理器错误率维持在可接受的水平将需要增加设计工作。本文提出了两种降低错误率的简单方法，并评价了它们在微处理器指令队列中的应用。第一种技术通过在遇到长延迟时选择性地压缩指令来减少指令在易受攻击的存储结构中的时间。如果受其影响的结构不包含有效指令，则故障不太可能导致错误。我们引入了一个新的度量，MITF(平均失效指令)，以捕获这种方法引入的性能和可靠性之间的权衡。第二种技术处理假检测错误。在没有故障检测机制的情况下，这样的错误不会影响程序的最终结果。例如，影响动态失效指令结果的错误不会改变最终的程序输出，但仍然可以被硬件标记为错误。为了避免发出这种虚假错误的信号，我们修改了管道的错误检测逻辑，将受影响的指令和数据标记为可能不正确，而不是立即发出错误信号。然后，只有当我们后来确定可能不正确的值可能影响了程序的输出时，才发出错误信号。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

自引率

0.00%

发文量

期刊最新文献

A content aware integer register file organization The vector-thread architecture From sequences of dependent instructions to functions: an approach for improving performance without ILP or speculation Evaluating the Imagine stream architecture Wire delay is not a problem for SMT (in the near future)