BlackJack: Hard Error Detection with Redundant Threads on SMT

E. Schuchman, T. N. Vijaykumar
{"title":"BlackJack: Hard Error Detection with Redundant Threads on SMT","authors":"E. Schuchman, T. N. Vijaykumar","doi":"10.1109/DSN.2007.23","DOIUrl":null,"url":null,"abstract":"Testing is a difficult process that becomes more difficult with scaling. With smaller and faster devices, tolerance for errors shrinks and devices may act correctly under certain condition and not under others. As such, hard errors may exist but are only exercised by very specific machine state and signal pathways. Targeting these errors is difficult, and creating test cases that cover all machine states and pathways is not possible. In addition, new complications during burn-in may mean latent hard errors are not exposed in the fab and reach the customer before becoming active. To address this problem, we propose an architecture we call BlackJack that allows hard errors to be detected using redundant threads running on a single SMT core. This technique provides a safety-net that catches hard errors that were either latent during test or just not covered by the test cases at all. Like SRT, our technique works by executing redundant copies and verifying that their resulting machine states agree. Unlike SRT, BlackJack is able to achieve high hard error instruction coverage by executing redundant threads on different front and backend resources in the pipeline. We show that for a 15% performance penalty over SRT, BlackJack achieves 97% hard error instruction coverage compared to SRT's 35%.","PeriodicalId":405751,"journal":{"name":"37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN.2007.23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 25

Abstract

Testing is a difficult process that becomes more difficult with scaling. With smaller and faster devices, tolerance for errors shrinks and devices may act correctly under certain condition and not under others. As such, hard errors may exist but are only exercised by very specific machine state and signal pathways. Targeting these errors is difficult, and creating test cases that cover all machine states and pathways is not possible. In addition, new complications during burn-in may mean latent hard errors are not exposed in the fab and reach the customer before becoming active. To address this problem, we propose an architecture we call BlackJack that allows hard errors to be detected using redundant threads running on a single SMT core. This technique provides a safety-net that catches hard errors that were either latent during test or just not covered by the test cases at all. Like SRT, our technique works by executing redundant copies and verifying that their resulting machine states agree. Unlike SRT, BlackJack is able to achieve high hard error instruction coverage by executing redundant threads on different front and backend resources in the pipeline. We show that for a 15% performance penalty over SRT, BlackJack achieves 97% hard error instruction coverage compared to SRT's 35%.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
二十一点:SMT上冗余线程的硬错误检测
测试是一个困难的过程,随着规模的扩大而变得更加困难。有了更小更快的设备,对错误的容忍度就会缩小,设备可能在某些条件下正确工作,而在其他条件下则不正确。因此,硬错误可能存在,但只有在非常特定的机器状态和信号路径下才会发生。定位这些错误是困难的,并且创建覆盖所有机器状态和路径的测试用例是不可能的。此外,在老化过程中出现的新并发症可能意味着潜在的硬错误不会暴露在晶圆厂中,并在激活之前到达客户。为了解决这个问题,我们提出了一个架构,我们称之为BlackJack,它允许使用在单个SMT核心上运行的冗余线程来检测硬错误。该技术提供了一个安全网,可以捕获在测试过程中潜在的或者根本没有被测试用例覆盖的硬错误。与SRT一样,我们的技术通过执行冗余副本并验证它们的结果机器状态是否一致来工作。与SRT不同,BlackJack能够通过在管道中的不同前端和后端资源上执行冗余线程来实现高硬错误指令覆盖率。我们表明,与SRT相比,在15%的性能损失下,BlackJack实现了97%的硬错误指令覆盖率,而SRT为35%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Application of Software Watchdog as a Dependability Software Service for Automotive Safety Relevant Systems Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance DSN 2007 Tutorials Reliability Techniques for RFID-Based Object Tracking Applications Minimizing Response Time for Quorum-System Protocols over Wide-Area Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1