The efficacy of error mitigation techniques for DRAM retention failures: a comparative experimental study

S. Khan, Donghyuk Lee, Yoongu Kim, Alaa R. Alameldeen, C. Wilkerson, O. Mutlu
{"title":"The efficacy of error mitigation techniques for DRAM retention failures: a comparative experimental study","authors":"S. Khan, Donghyuk Lee, Yoongu Kim, Alaa R. Alameldeen, C. Wilkerson, O. Mutlu","doi":"10.1145/2591971.2592000","DOIUrl":null,"url":null,"abstract":"As DRAM cells continue to shrink, they become more susceptible to retention failures. DRAM cells that permanently exhibit short retention times are fairly easy to identify and repair through the use of memory tests and row and column redundancy. However, the retention time of many cells may vary over time due to a property called Variable Retention Time (VRT). Since these cells intermittently transition between failing and non-failing states, they are particularly difficult to identify through memory tests alone. In addition, the high temperature packaging process may aggravate this problem as the susceptibility of cells to VRT increases after the assembly of DRAM chips. A promising alternative to manufacture-time testing is to detect and mitigate retention failures after the system has become operational. Such a system would require mechanisms to detect and mitigate retention failures in the field, but would be responsive to retention failures introduced after system assembly and could dramatically reduce the cost of testing, enabling much longer tests than are practical with manufacturer testing equipment.\n In this paper, we analyze the efficacy of three common error mitigation techniques (memory tests, guardbands, and error correcting codes (ECC)) in real DRAM chips exhibiting both intermittent and permanent retention failures. Our analysis allows us to quantify the efficacy of recent system-level error mitigation mechanisms that build upon these techniques. We revisit prior works in the context of the experimental data we present, showing that our measured results significantly impact these works' conclusions. We find that mitigation techniques that rely on run-time testing alone [38, 27, 50, 26] are unable to ensure reliable operation even after many months of testing. Techniques that incorporate ECC[4, 52], however, can ensure reliable DRAM operation after only a few hours of testing. For example, VS-ECC[4], which couples testing with variable strength codes to allocate the strongest codes to the most error-prone memory regions, can ensure reliable operation for 10 years after only 19 minutes of testing. We conclude that the viability of these mitigation techniques depend on efficient online profiling of DRAM performed without disrupting system operation.","PeriodicalId":306456,"journal":{"name":"Measurement and Modeling of Computer Systems","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"176","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Measurement and Modeling of Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2591971.2592000","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 176

Abstract

As DRAM cells continue to shrink, they become more susceptible to retention failures. DRAM cells that permanently exhibit short retention times are fairly easy to identify and repair through the use of memory tests and row and column redundancy. However, the retention time of many cells may vary over time due to a property called Variable Retention Time (VRT). Since these cells intermittently transition between failing and non-failing states, they are particularly difficult to identify through memory tests alone. In addition, the high temperature packaging process may aggravate this problem as the susceptibility of cells to VRT increases after the assembly of DRAM chips. A promising alternative to manufacture-time testing is to detect and mitigate retention failures after the system has become operational. Such a system would require mechanisms to detect and mitigate retention failures in the field, but would be responsive to retention failures introduced after system assembly and could dramatically reduce the cost of testing, enabling much longer tests than are practical with manufacturer testing equipment. In this paper, we analyze the efficacy of three common error mitigation techniques (memory tests, guardbands, and error correcting codes (ECC)) in real DRAM chips exhibiting both intermittent and permanent retention failures. Our analysis allows us to quantify the efficacy of recent system-level error mitigation mechanisms that build upon these techniques. We revisit prior works in the context of the experimental data we present, showing that our measured results significantly impact these works' conclusions. We find that mitigation techniques that rely on run-time testing alone [38, 27, 50, 26] are unable to ensure reliable operation even after many months of testing. Techniques that incorporate ECC[4, 52], however, can ensure reliable DRAM operation after only a few hours of testing. For example, VS-ECC[4], which couples testing with variable strength codes to allocate the strongest codes to the most error-prone memory regions, can ensure reliable operation for 10 years after only 19 minutes of testing. We conclude that the viability of these mitigation techniques depend on efficient online profiling of DRAM performed without disrupting system operation.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
错误缓解技术对DRAM保留故障的有效性:一项比较实验研究
随着DRAM单元的不断缩小,它们变得更容易出现保留故障。通过使用内存测试和行和列冗余,可以很容易地识别和修复永久显示较短保留时间的DRAM单元。然而,由于一种称为可变保留时间(VRT)的特性,许多细胞的保留时间可能随时间而变化。由于这些细胞间歇性地在失败状态和非失败状态之间转换,因此仅通过记忆测试来识别它们特别困难。此外,高温封装工艺可能会加剧这一问题,因为在DRAM芯片组装后,电池对VRT的敏感性增加。制造时测试的一个很有前途的替代方案是在系统投入运行后检测和减少保留故障。这样的系统需要检测和减轻现场滞留故障的机制,但可以对系统组装后出现的滞留故障做出反应,并且可以大大降低测试成本,比制造商测试设备的实际测试时间长得多。在本文中,我们分析了三种常见的错误缓解技术(内存测试,保护带和纠错码(ECC))在实际的DRAM芯片中表现出间歇性和永久保留故障的有效性。我们的分析使我们能够量化基于这些技术的系统级错误缓解机制的有效性。我们在实验数据的背景下重新审视以前的工作,表明我们的测量结果显著影响这些工作的结论。我们发现,仅依赖于运行时测试的缓解技术[38,27,50,26]即使经过数月的测试也无法确保可靠的运行。然而,采用ECC[4,52]的技术可以在仅几个小时的测试后确保可靠的DRAM操作。例如,VS-ECC[4]采用可变强度码耦合测试,将最强码分配到最容易出错的存储区域,只需19分钟的测试即可确保10年的可靠运行。我们得出结论,这些缓解技术的可行性取决于在不中断系统运行的情况下对DRAM进行有效的在线分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Queueing delays in buffered multistage interconnection networks Data dissemination performance in large-scale sensor networks Index policies for a multi-class queue with convex holding cost and abandonments Neighbor-cell assisted error correction for MLC NAND flash memories Collecting, organizing, and sharing pins in pinterest: interest-driven or social-driven?
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1