{"title":"松弛故障记忆修复","authors":"Dong-Wan Kim, M. Erez","doi":"10.1145/3007787.3001205","DOIUrl":null,"url":null,"abstract":"Memory system reliability is a serious concern in many systems today, and is becoming more worrisome as technology scales and system size grows. Stronger fault tolerance capability is therefore desirable, but often comes at high cost. In this paper, we propose a low-cost, fault-aware, hardware-only resilience mechanism, RelaxFault, that repairs the vast majority of memory faults using a small amount of the LLC to remap faulty memory locations. RelaxFault requires less than 100KiB of LLC capacity, has near-zero impact on performance and power. By repairing faults, RelaxFault relaxes the requirement for high fault tolerance of other mechanisms, such as ECC. A better tradeoff between resilience and overhead is made by exploiting an understanding of memory system architecture and fault characteristics. We show that RelaxFault provides better repair capability than prior work of similar cost, improves memory reliability to a greater extent, and significantly reduces the number of maintenance events and memory module replacements. We also propose a more refined memory fault model than prior work and demonstrate its importance.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"88 1","pages":"645-657"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"RelaxFault Memory Repair\",\"authors\":\"Dong-Wan Kim, M. Erez\",\"doi\":\"10.1145/3007787.3001205\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Memory system reliability is a serious concern in many systems today, and is becoming more worrisome as technology scales and system size grows. Stronger fault tolerance capability is therefore desirable, but often comes at high cost. In this paper, we propose a low-cost, fault-aware, hardware-only resilience mechanism, RelaxFault, that repairs the vast majority of memory faults using a small amount of the LLC to remap faulty memory locations. RelaxFault requires less than 100KiB of LLC capacity, has near-zero impact on performance and power. By repairing faults, RelaxFault relaxes the requirement for high fault tolerance of other mechanisms, such as ECC. A better tradeoff between resilience and overhead is made by exploiting an understanding of memory system architecture and fault characteristics. We show that RelaxFault provides better repair capability than prior work of similar cost, improves memory reliability to a greater extent, and significantly reduces the number of maintenance events and memory module replacements. We also propose a more refined memory fault model than prior work and demonstrate its importance.\",\"PeriodicalId\":6634,\"journal\":{\"name\":\"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)\",\"volume\":\"88 1\",\"pages\":\"645-657\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-06-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3007787.3001205\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3007787.3001205","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Memory system reliability is a serious concern in many systems today, and is becoming more worrisome as technology scales and system size grows. Stronger fault tolerance capability is therefore desirable, but often comes at high cost. In this paper, we propose a low-cost, fault-aware, hardware-only resilience mechanism, RelaxFault, that repairs the vast majority of memory faults using a small amount of the LLC to remap faulty memory locations. RelaxFault requires less than 100KiB of LLC capacity, has near-zero impact on performance and power. By repairing faults, RelaxFault relaxes the requirement for high fault tolerance of other mechanisms, such as ECC. A better tradeoff between resilience and overhead is made by exploiting an understanding of memory system architecture and fault characteristics. We show that RelaxFault provides better repair capability than prior work of similar cost, improves memory reliability to a greater extent, and significantly reduces the number of maintenance events and memory module replacements. We also propose a more refined memory fault model than prior work and demonstrate its importance.