Transient Fault Tolerance on Chip Multiprocessor Based on Dual and Triple Core Redundancy

Rui Gong, Kui Dai, Zhiying Wang
{"title":"Transient Fault Tolerance on Chip Multiprocessor Based on Dual and Triple Core Redundancy","authors":"Rui Gong, Kui Dai, Zhiying Wang","doi":"10.1109/PRDC.2008.40","DOIUrl":null,"url":null,"abstract":"To address the increasing susceptibility of microprocessors to transient faults, many techniques have been proposed to exploit the core redundancy of chip multiprocessors (CMPs). But the inter-core communications become critical in these core redundancy based techniques. To reduce the inter-core communication bandwidth demand, two new approaches, dual core redundancy (DCR) and triple core redundancy (TCR), are proposed for fault tolerance in this paper. In DCR, only store instructions are compared before commit, so that the bandwidth demand can be largely reduced. And the fault recovery is achieved by context saving and recovery. While TCR applies triple modular redundancy (TMR) in the core level to efficiently exploit the core resources of CMPs for transient fault masking. In TCR, only the results of store instructions are compared to detect transient fault and reduce the inter-core communication bandwidth demand. Once detecting a single event upset (SEU), TCR can be reconfigured to execute with the two uncorrupted cores for fault detection.The experimental results demonstrate that compared to traditional transient fault recovery scheme CRTR, both DCR and TCR efficiently reduce inter-core bandwidth demand. DCR achieves transient fault recovery with reasonable performance overhead caused by context saving. TCR occupies more core resources and has the lowest performance overhead during normal execution.","PeriodicalId":369064,"journal":{"name":"2008 14th IEEE Pacific Rim International Symposium on Dependable Computing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 14th IEEE Pacific Rim International Symposium on Dependable Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PRDC.2008.40","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 15

Abstract

To address the increasing susceptibility of microprocessors to transient faults, many techniques have been proposed to exploit the core redundancy of chip multiprocessors (CMPs). But the inter-core communications become critical in these core redundancy based techniques. To reduce the inter-core communication bandwidth demand, two new approaches, dual core redundancy (DCR) and triple core redundancy (TCR), are proposed for fault tolerance in this paper. In DCR, only store instructions are compared before commit, so that the bandwidth demand can be largely reduced. And the fault recovery is achieved by context saving and recovery. While TCR applies triple modular redundancy (TMR) in the core level to efficiently exploit the core resources of CMPs for transient fault masking. In TCR, only the results of store instructions are compared to detect transient fault and reduce the inter-core communication bandwidth demand. Once detecting a single event upset (SEU), TCR can be reconfigured to execute with the two uncorrupted cores for fault detection.The experimental results demonstrate that compared to traditional transient fault recovery scheme CRTR, both DCR and TCR efficiently reduce inter-core bandwidth demand. DCR achieves transient fault recovery with reasonable performance overhead caused by context saving. TCR occupies more core resources and has the lowest performance overhead during normal execution.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于双核和三核冗余的芯片多处理器瞬态容错
为了解决微处理器对瞬态故障日益敏感的问题,人们提出了许多技术来利用芯片多处理器(cmp)的核心冗余。但在这些基于核心冗余的技术中,核心间通信变得至关重要。为了减少核间通信带宽需求,本文提出了两种新的容错方法:双核冗余(DCR)和三核冗余(TCR)。在DCR中,在提交之前只比较存储指令,因此可以大大减少带宽需求。通过上下文保存和恢复实现故障恢复。TCR在核心层采用三模冗余(triple modular redundancy, TMR),有效利用cmp的核心资源进行暂态故障屏蔽。在TCR中,只比较存储指令的结果,以检测瞬态故障,减少核间通信带宽需求。一旦检测到单个事件中断(SEU), TCR可以重新配置为使用两个未损坏的内核执行故障检测。实验结果表明,与传统的瞬态故障恢复方案CRTR相比,DCR和TCR都有效地降低了核间带宽需求。DCR通过合理的上下文保存带来的性能开销实现了瞬时故障恢复。TCR占用更多的核心资源,在正常执行时性能开销最低。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
RAS Modeling of an HPC Switch System Versatile and Efficient Techniques for Speeding-Up Circuit Level Simulated Fault-Injection Campaigns On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Transient Fault Tolerance on Chip Multiprocessor Based on Dual and Triple Core Redundancy A Peer-to-Peer Filter-Based Algorithm for Internal Clock Synchronization in Presence of Corrupted Processes
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1