Killi: Runtime Fault Classification to Deploy Low Voltage Caches without MBIST

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2019-02-01 DOI:10.1109/HPCA.2019.00046

Shrikanth Ganapathy, J. Kalamatianos, Bradford M. Beckmann, Steven E. Raasch, Lukasz G. Szafaryn

{"title":"Killi: Runtime Fault Classification to Deploy Low Voltage Caches without MBIST","authors":"Shrikanth Ganapathy, J. Kalamatianos, Bradford M. Beckmann, Steven E. Raasch, Lukasz G. Szafaryn","doi":"10.1109/HPCA.2019.00046","DOIUrl":null,"url":null,"abstract":"Supply voltage (VDD) scaling is one of the most effective mechanisms to reduce energy consumption in highperformance microprocessors. However, VDD scaling is challenging for SRAM-based on-chip memories such as caches due to persistent failures at low voltage (LV). Previously designed LV-enabling mechanisms require additional Memory Built-in Self-Test (MBIST) steps, employed either offline or online to identify persistent failures for every LV operating mode. However, these additional MBIST steps are time consuming, resulting in extended boot time or delayed power state transitions. Furthermore, most prior techniques combine MBIST-based solutions with customized Error Correction Codes (ECC), which suffer from non-trivial area or performance overheads. In this paper, we highlight the practical challenges for deploying LV techniques and propose a new low-cost error protection scheme, called Killi, which leverages conventional ECC and parity to enable LV operation. Foremost, the failing lines are discovered dynamically at runtime using both parity and ECC, negating the need for extra MBIST testing. Killi then provides on demand error protection by decoupling cheap error detection from expensive error correction. Killi provides error detection capability to all lines using parity but employs Single Error Correction, Double Error Detection (SECDED) ECC for a subset of the lines with a single LV fault. All lines with more than one fault are disabled. We evaluate this completely hardware enclosed solution on a GPU write-through L2 cache and show that the Vmin (minimum reliable VDD) can be reduced to 62.5% of nominal VDD when operating at 1GHz with only a maximum of 0.8% performance degradation. As a result, an 8CU GPU with Killi can reduce the power consumption of the L2 cache by 59.3% compared to the baseline L2 cache running at nominal VDD. In addition, Killi reduces the error protection area overhead by 50% compared to SECDED ECC. Keywords—cache, energy-efficiency, GPU, low voltage,","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2019.00046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Supply voltage (VDD) scaling is one of the most effective mechanisms to reduce energy consumption in highperformance microprocessors. However, VDD scaling is challenging for SRAM-based on-chip memories such as caches due to persistent failures at low voltage (LV). Previously designed LV-enabling mechanisms require additional Memory Built-in Self-Test (MBIST) steps, employed either offline or online to identify persistent failures for every LV operating mode. However, these additional MBIST steps are time consuming, resulting in extended boot time or delayed power state transitions. Furthermore, most prior techniques combine MBIST-based solutions with customized Error Correction Codes (ECC), which suffer from non-trivial area or performance overheads. In this paper, we highlight the practical challenges for deploying LV techniques and propose a new low-cost error protection scheme, called Killi, which leverages conventional ECC and parity to enable LV operation. Foremost, the failing lines are discovered dynamically at runtime using both parity and ECC, negating the need for extra MBIST testing. Killi then provides on demand error protection by decoupling cheap error detection from expensive error correction. Killi provides error detection capability to all lines using parity but employs Single Error Correction, Double Error Detection (SECDED) ECC for a subset of the lines with a single LV fault. All lines with more than one fault are disabled. We evaluate this completely hardware enclosed solution on a GPU write-through L2 cache and show that the Vmin (minimum reliable VDD) can be reduced to 62.5% of nominal VDD when operating at 1GHz with only a maximum of 0.8% performance degradation. As a result, an 8CU GPU with Killi can reduce the power consumption of the L2 cache by 59.3% compared to the baseline L2 cache running at nominal VDD. In addition, Killi reduces the error protection area overhead by 50% compared to SECDED ECC. Keywords—cache, energy-efficiency, GPU, low voltage,

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Killi:运行时故障分类，部署无MBIST的低压缓存

在高性能微处理器中，电源电压(VDD)缩放是降低能耗的最有效机制之一。然而，对于基于sram的片上存储器(如缓存)来说，由于低电压(LV)下的持续故障，VDD扩展是具有挑战性的。以前设计的LV启用机制需要额外的内存内置自检(MBIST)步骤，可以离线或在线使用，以识别每个LV工作模式的持续故障。然而，这些额外的MBIST步骤非常耗时，导致引导时间延长或电源状态转换延迟。此外，大多数先前的技术将基于mbist的解决方案与定制的纠错码(ECC)结合在一起，这会带来不小的面积或性能开销。在本文中，我们强调了部署低压技术的实际挑战，并提出了一种新的低成本错误保护方案，称为Killi，它利用传统的ECC和奇偶性来实现低压操作。最重要的是，在运行时使用奇偶校验和ECC动态地发现故障行，从而不需要额外的MBIST测试。然后，Killi通过将廉价的错误检测与昂贵的错误纠正分离，提供按需错误保护。Killi为所有使用奇偶校验的线路提供错误检测能力，但对具有单个低压故障的线路子集采用单错误校正，双错误检测(SECDED) ECC。所有有一个以上故障的线路都被禁用。我们在GPU write-through L2缓存上评估了这种完全硬件封闭的解决方案，并表明当工作在1GHz时，Vmin(最小可靠VDD)可以降低到名义VDD的62.5%，而性能下降最多仅为0.8%。因此，与在标称VDD下运行的基准L2缓存相比，带有Killi的8CU GPU可以将L2缓存的功耗降低59.3%。此外，与SECDED ECC相比，Killi减少了50%的错误保护区域开销。关键词:缓存，能效，GPU，低电压，

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量

期刊最新文献

Machine Learning at Facebook: Understanding Inference at the Edge Understanding the Future of Energy Efficiency in Multi-Module GPUs POWERT Channels: A Novel Class of Covert CommunicationExploiting Power Management Vulnerabilities The Accelerator Wall: Limits of Chip Specialization Featherlight Reuse-Distance Measurement