Shrikanth Ganapathy, J. Kalamatianos, Bradford M. Beckmann, Steven E. Raasch, Lukasz G. Szafaryn
{"title":"Killi:运行时故障分类,部署无MBIST的低压缓存","authors":"Shrikanth Ganapathy, J. Kalamatianos, Bradford M. Beckmann, Steven E. Raasch, Lukasz G. Szafaryn","doi":"10.1109/HPCA.2019.00046","DOIUrl":null,"url":null,"abstract":"Supply voltage (VDD) scaling is one of the most effective mechanisms to reduce energy consumption in highperformance microprocessors. However, VDD scaling is challenging for SRAM-based on-chip memories such as caches due to persistent failures at low voltage (LV). Previously designed LV-enabling mechanisms require additional Memory Built-in Self-Test (MBIST) steps, employed either offline or online to identify persistent failures for every LV operating mode. However, these additional MBIST steps are time consuming, resulting in extended boot time or delayed power state transitions. Furthermore, most prior techniques combine MBIST-based solutions with customized Error Correction Codes (ECC), which suffer from non-trivial area or performance overheads. In this paper, we highlight the practical challenges for deploying LV techniques and propose a new low-cost error protection scheme, called Killi, which leverages conventional ECC and parity to enable LV operation. Foremost, the failing lines are discovered dynamically at runtime using both parity and ECC, negating the need for extra MBIST testing. Killi then provides on demand error protection by decoupling cheap error detection from expensive error correction. Killi provides error detection capability to all lines using parity but employs Single Error Correction, Double Error Detection (SECDED) ECC for a subset of the lines with a single LV fault. All lines with more than one fault are disabled. We evaluate this completely hardware enclosed solution on a GPU write-through L2 cache and show that the Vmin (minimum reliable VDD) can be reduced to 62.5% of nominal VDD when operating at 1GHz with only a maximum of 0.8% performance degradation. As a result, an 8CU GPU with Killi can reduce the power consumption of the L2 cache by 59.3% compared to the baseline L2 cache running at nominal VDD. In addition, Killi reduces the error protection area overhead by 50% compared to SECDED ECC. Keywords—cache, energy-efficiency, GPU, low voltage,","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Killi: Runtime Fault Classification to Deploy Low Voltage Caches without MBIST\",\"authors\":\"Shrikanth Ganapathy, J. Kalamatianos, Bradford M. Beckmann, Steven E. Raasch, Lukasz G. Szafaryn\",\"doi\":\"10.1109/HPCA.2019.00046\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Supply voltage (VDD) scaling is one of the most effective mechanisms to reduce energy consumption in highperformance microprocessors. However, VDD scaling is challenging for SRAM-based on-chip memories such as caches due to persistent failures at low voltage (LV). Previously designed LV-enabling mechanisms require additional Memory Built-in Self-Test (MBIST) steps, employed either offline or online to identify persistent failures for every LV operating mode. However, these additional MBIST steps are time consuming, resulting in extended boot time or delayed power state transitions. Furthermore, most prior techniques combine MBIST-based solutions with customized Error Correction Codes (ECC), which suffer from non-trivial area or performance overheads. In this paper, we highlight the practical challenges for deploying LV techniques and propose a new low-cost error protection scheme, called Killi, which leverages conventional ECC and parity to enable LV operation. Foremost, the failing lines are discovered dynamically at runtime using both parity and ECC, negating the need for extra MBIST testing. Killi then provides on demand error protection by decoupling cheap error detection from expensive error correction. Killi provides error detection capability to all lines using parity but employs Single Error Correction, Double Error Detection (SECDED) ECC for a subset of the lines with a single LV fault. All lines with more than one fault are disabled. We evaluate this completely hardware enclosed solution on a GPU write-through L2 cache and show that the Vmin (minimum reliable VDD) can be reduced to 62.5% of nominal VDD when operating at 1GHz with only a maximum of 0.8% performance degradation. As a result, an 8CU GPU with Killi can reduce the power consumption of the L2 cache by 59.3% compared to the baseline L2 cache running at nominal VDD. In addition, Killi reduces the error protection area overhead by 50% compared to SECDED ECC. Keywords—cache, energy-efficiency, GPU, low voltage,\",\"PeriodicalId\":102050,\"journal\":{\"name\":\"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)\",\"volume\":\"71 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCA.2019.00046\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2019.00046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Killi: Runtime Fault Classification to Deploy Low Voltage Caches without MBIST
Supply voltage (VDD) scaling is one of the most effective mechanisms to reduce energy consumption in highperformance microprocessors. However, VDD scaling is challenging for SRAM-based on-chip memories such as caches due to persistent failures at low voltage (LV). Previously designed LV-enabling mechanisms require additional Memory Built-in Self-Test (MBIST) steps, employed either offline or online to identify persistent failures for every LV operating mode. However, these additional MBIST steps are time consuming, resulting in extended boot time or delayed power state transitions. Furthermore, most prior techniques combine MBIST-based solutions with customized Error Correction Codes (ECC), which suffer from non-trivial area or performance overheads. In this paper, we highlight the practical challenges for deploying LV techniques and propose a new low-cost error protection scheme, called Killi, which leverages conventional ECC and parity to enable LV operation. Foremost, the failing lines are discovered dynamically at runtime using both parity and ECC, negating the need for extra MBIST testing. Killi then provides on demand error protection by decoupling cheap error detection from expensive error correction. Killi provides error detection capability to all lines using parity but employs Single Error Correction, Double Error Detection (SECDED) ECC for a subset of the lines with a single LV fault. All lines with more than one fault are disabled. We evaluate this completely hardware enclosed solution on a GPU write-through L2 cache and show that the Vmin (minimum reliable VDD) can be reduced to 62.5% of nominal VDD when operating at 1GHz with only a maximum of 0.8% performance degradation. As a result, an 8CU GPU with Killi can reduce the power consumption of the L2 cache by 59.3% compared to the baseline L2 cache running at nominal VDD. In addition, Killi reduces the error protection area overhead by 50% compared to SECDED ECC. Keywords—cache, energy-efficiency, GPU, low voltage,