PARBOR: An Efficient System-Level Technique to Detect Data-Dependent Failures in DRAM

2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) Pub Date : 2016-06-01 DOI:10.1109/DSN.2016.30

S. Khan, Donghyuk Lee, O. Mutlu

{"title":"PARBOR: An Efficient System-Level Technique to Detect Data-Dependent Failures in DRAM","authors":"S. Khan, Donghyuk Lee, O. Mutlu","doi":"10.1109/DSN.2016.30","DOIUrl":null,"url":null,"abstract":"System-level detection and mitigation of DRAM failures offer a variety of system enhancements, such as better reliability, scalability, energy, and performance. Unfortunately, system-level detection is challenging for DRAM failures that depend on the data content of neighboring cells (data-dependent failures). DRAM vendors internally scramble/remap the system-level address space. Therefore, testing data-dependent failures using neighboring system-level addresses does not actually test the cells that are physically adjacent. In this work, we argue that one promising way to uncover data-dependent failures in the system is to determine the location of physically neighboring cells in the system address space. Unfortunately, if done naively, such a test takes 49 days to detect neighboring addresses even in a single memory row, making it infeasible in real systems. We develop PARBOR, an efficient system-level technique that determines the locations of the physically neighboring DRAM cells in the system address space and uses this information to detect data-dependent failures. To our knowledge, this is the first work that solves the challenge of detecting data-dependent failures in DRAM in the presence of DRAM-internal scrambling of system-level addresses. We experimentally demonstrate the effectiveness of PARBOR using 144 real DRAM chips from three major vendors. Our experimental evaluation shows that PARBOR 1) detects neighboring cell locations with only 66-90 tests, a 745,654X reduction compared to the naive test, and 2) uncovers 21.9% more failures compared to a random-pattern test that is unaware of the neighbor cell locations. We introduce a new mechanism that utilizes PARBOR to reduce refresh rate based on the data content of memory locations, thereby improving system performance and efficiency. We hope that our fast and efficient system-level detection technique enables other new ideas and mechanisms that improve the reliability, performance, and energy efficiency of DRAM-based memory systems.","PeriodicalId":102292,"journal":{"name":"2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"123","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN.2016.30","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 123

Abstract

System-level detection and mitigation of DRAM failures offer a variety of system enhancements, such as better reliability, scalability, energy, and performance. Unfortunately, system-level detection is challenging for DRAM failures that depend on the data content of neighboring cells (data-dependent failures). DRAM vendors internally scramble/remap the system-level address space. Therefore, testing data-dependent failures using neighboring system-level addresses does not actually test the cells that are physically adjacent. In this work, we argue that one promising way to uncover data-dependent failures in the system is to determine the location of physically neighboring cells in the system address space. Unfortunately, if done naively, such a test takes 49 days to detect neighboring addresses even in a single memory row, making it infeasible in real systems. We develop PARBOR, an efficient system-level technique that determines the locations of the physically neighboring DRAM cells in the system address space and uses this information to detect data-dependent failures. To our knowledge, this is the first work that solves the challenge of detecting data-dependent failures in DRAM in the presence of DRAM-internal scrambling of system-level addresses. We experimentally demonstrate the effectiveness of PARBOR using 144 real DRAM chips from three major vendors. Our experimental evaluation shows that PARBOR 1) detects neighboring cell locations with only 66-90 tests, a 745,654X reduction compared to the naive test, and 2) uncovers 21.9% more failures compared to a random-pattern test that is unaware of the neighbor cell locations. We introduce a new mechanism that utilizes PARBOR to reduce refresh rate based on the data content of memory locations, thereby improving system performance and efficiency. We hope that our fast and efficient system-level detection technique enables other new ideas and mechanisms that improve the reliability, performance, and energy efficiency of DRAM-based memory systems.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种有效的系统级技术来检测DRAM中与数据相关的故障

系统级的DRAM故障检测和缓解提供了各种系统增强功能，例如更好的可靠性、可伸缩性、能源和性能。不幸的是，对于依赖于相邻单元的数据内容的DRAM故障(数据相关故障)，系统级检测是具有挑战性的。DRAM供应商在内部打乱/重新映射系统级地址空间。因此，使用相邻的系统级地址测试与数据相关的故障实际上并没有测试物理上相邻的单元。在这项工作中，我们认为发现系统中数据相关故障的一种有希望的方法是确定系统地址空间中物理相邻单元的位置。不幸的是，如果简单地执行这样的测试，即使在单个内存行中检测相邻地址也需要49天，这使得它在实际系统中不可行的。我们开发了PARBOR，这是一种有效的系统级技术，可以确定系统地址空间中物理相邻的DRAM单元的位置，并使用该信息检测与数据相关的故障。据我们所知，这是第一个解决在存在系统级地址的DRAM内部乱置的情况下检测DRAM中数据相关故障的挑战的工作。我们通过实验证明了PARBOR的有效性，使用了来自三个主要供应商的144个真实DRAM芯片。我们的实验评估表明，PARBOR 1)仅用66-90次测试就能检测到邻近的细胞位置，与初始测试相比减少了745,654倍;2)与不知道邻近细胞位置的随机模式测试相比，发现的失败多21.9%。我们引入了一种新的机制，利用PARBOR根据内存位置的数据内容来降低刷新率，从而提高系统性能和效率。我们希望我们的快速和高效的系统级检测技术能够带来其他新的想法和机制，以提高基于dram的存储系统的可靠性，性能和能源效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

自引率

0.00%

发文量

期刊最新文献

ELZAR: Triple Modular Redundancy Using Intel AVX (Practical Experience Report) DomainProfiler: Discovering Domain Names Abused in Future OSIRIS: Efficient and Consistent Recovery of Compartmentalized Operating Systems HSFI: Accurate Fault Injection Scalable to Large Code Bases Secure and Efficient Multi-Variant Execution Using Hardware-Assisted Process Virtualization