2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)最新文献_第4页

Steal but No Force: Efficient Hardware Undo+Redo Logging for Persistent Memory Systems 偷但不强制:持久内存系统的高效硬件撤销+重做日志记录

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00037

Matheus A. Ogleari, E. Miller, Jishen Zhao

Persistent memory is a new tier of memory that functions as a hybrid of traditional storage systems and main memory. It combines the benefits of both: the data persistence of storage with the fast load/store interface of memory. Most previous persistent memory designs place careful control over the order of writes arriving at persistent memory. This can prevent caches and memory controllers from optimizing system performance through write coalescing and reordering. We identify that such write-order control can be relaxed by employing undo+redo logging for data in persistent memory systems. However, traditional software logging mechanisms are expensive to adopt in persistent memory due to performance and energy overheads. Previously proposed hardware logging schemes are inefficient and do not fully address the issues in software. To address these challenges, we propose a hardware undo+redo logging scheme which maintains data persistence by leveraging the write-back, write-allocate policies used in commodity caches. Furthermore, we develop a cache forcewrite-back mechanism in hardware to significantly reduce the performance and energy overheads from forcing data into persistent memory. Our evaluation across persistent memory microbenchmarks and real workloads demonstrates that our design significantly improves system throughput and reduces both dynamic energy and memory traffic. It also provides strong consistency guarantees compared to software approaches.

持久内存是一种新的内存层，它的功能是传统存储系统和主内存的混合体。它结合了两者的优点:存储的数据持久性和内存的快速加载/存储接口。大多数以前的持久内存设计都对到达持久内存的写顺序进行了仔细的控制。这可以防止缓存和内存控制器通过写合并和重新排序来优化系统性能。我们发现这种写顺序控制可以通过对持久内存系统中的数据使用撤销+重做日志来放松。然而，由于性能和能量开销，在持久内存中采用传统的软件日志机制是昂贵的。以前提出的硬件日志方案效率低下，不能完全解决软件中的问题。为了应对这些挑战，我们提出了一种硬件撤销+重做日志方案，该方案通过利用商品缓存中使用的回写、写分配策略来维护数据持久性。此外，我们在硬件中开发了一种缓存强制回写机制，以显着降低将数据强制写入持久内存的性能和能源开销。我们对持久内存微基准测试和实际工作负载的评估表明，我们的设计显著提高了系统吞吐量，减少了动态能量和内存流量。与软件方法相比，它还提供了强大的一致性保证。

{"title":"Steal but No Force: Efficient Hardware Undo+Redo Logging for Persistent Memory Systems","authors":"Matheus A. Ogleari, E. Miller, Jishen Zhao","doi":"10.1109/HPCA.2018.00037","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00037","url":null,"abstract":"Persistent memory is a new tier of memory that functions as a hybrid of traditional storage systems and main memory. It combines the benefits of both: the data persistence of storage with the fast load/store interface of memory. Most previous persistent memory designs place careful control over the order of writes arriving at persistent memory. This can prevent caches and memory controllers from optimizing system performance through write coalescing and reordering. We identify that such write-order control can be relaxed by employing undo+redo logging for data in persistent memory systems. However, traditional software logging mechanisms are expensive to adopt in persistent memory due to performance and energy overheads. Previously proposed hardware logging schemes are inefficient and do not fully address the issues in software. To address these challenges, we propose a hardware undo+redo logging scheme which maintains data persistence by leveraging the write-back, write-allocate policies used in commodity caches. Furthermore, we develop a cache forcewrite-back mechanism in hardware to significantly reduce the performance and energy overheads from forcing data into persistent memory. Our evaluation across persistent memory microbenchmarks and real workloads demonstrates that our design significantly improves system throughput and reduces both dynamic energy and memory traffic. It also provides strong consistency guarantees compared to software approaches.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114828981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 78

Characterizing Resource Sensitivity of Database Workloads 描述数据库工作负载的资源敏感性

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00062

Rathijit Sen, Karthik Ramachandra

The performance of real world database workloads is heavily influenced by the resources available to run the workload. Therefore, understanding the performance impact of changes in resource allocations on a workload is key to achieving predictable performance. In this work, we perform an in-depth study of the sensitivity of several database workloads, running on Microsoft SQL Server on Linux, to resources such as cores, caches, main memory, and non-volatile storage. We consider transactional, analytical, and hybrid workloads that model real-world systems, and use recommended configurations such as storage layouts and index organizations at different scale factors. Our study lays out the wide spectrum of resource sensitivities, and leads to several findings and insights that are highly valuable to computer architects, cloud DBaaS (Database-as-a-Service) providers, database researchers, and practitioners. For instance, our results indicate that throughput improves more with more cores than with more cache beyond a critical cache size; depending upon the compute vs. I/O activity of a workload, hyper-threading may be detrimental in some cases. We discuss our extensive experimental results and present insights based on a comprehensive analysis of query plans and various query execution statistics.

实际数据库工作负载的性能在很大程度上受到可用于运行工作负载的资源的影响。因此，了解资源分配变化对工作负载的性能影响是实现可预测性能的关键。在这项工作中，我们对运行在Linux上的Microsoft SQL Server上的几个数据库工作负载对诸如内核、缓存、主存和非易失性存储等资源的敏感性进行了深入研究。我们考虑事务性、分析性和混合性工作负载，对现实世界的系统进行建模，并使用推荐的配置，如存储布局和不同规模因素的索引组织。我们的研究列出了资源敏感性的广泛范围，并得出了一些对计算机架构师、云DBaaS(数据库即服务)提供商、数据库研究人员和从业者非常有价值的发现和见解。例如，我们的结果表明，在超过临界缓存大小的情况下，更多的内核比更多的缓存提高吞吐量更多;根据工作负载的计算与I/O活动的不同，超线程在某些情况下可能是有害的。我们讨论了广泛的实验结果，并基于对查询计划和各种查询执行统计数据的综合分析提出了见解。

{"title":"Characterizing Resource Sensitivity of Database Workloads","authors":"Rathijit Sen, Karthik Ramachandra","doi":"10.1109/HPCA.2018.00062","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00062","url":null,"abstract":"The performance of real world database workloads is heavily influenced by the resources available to run the workload. Therefore, understanding the performance impact of changes in resource allocations on a workload is key to achieving predictable performance. In this work, we perform an in-depth study of the sensitivity of several database workloads, running on Microsoft SQL Server on Linux, to resources such as cores, caches, main memory, and non-volatile storage. We consider transactional, analytical, and hybrid workloads that model real-world systems, and use recommended configurations such as storage layouts and index organizations at different scale factors. Our study lays out the wide spectrum of resource sensitivities, and leads to several findings and insights that are highly valuable to computer architects, cloud DBaaS (Database-as-a-Service) providers, database researchers, and practitioners. For instance, our results indicate that throughput improves more with more cores than with more cache beyond a critical cache size; depending upon the compute vs. I/O activity of a workload, hyper-threading may be detrimental in some cases. We discuss our extensive experimental results and present insights based on a comprehensive analysis of query plans and various query execution statistics.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133539448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

NACHOS: Software-Driven Hardware-Assisted Memory Disambiguation for Accelerators 加速器的软件驱动硬件辅助内存消歧

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00066

Naveen Vedula, Arrvindh Shriraman, Snehasish Kumar, Nick Sumner

Hardware accelerators have relied on the compiler to extract instruction parallelism but may waste significant energy in enforcing memory ordering and discovering memory parallelism. Accelerators tend to either serialize memory operations [43] or reuse power hungry load-store queues (LSQs) [8], [27]. Recent works [11], [15] use the compiler for scheduling but continue to rely on LSQs for memory disambiguation. NACHOS is a hardware assisted software-driven approach to memory disambiguation for accelerators. In NACHOS, the compiler classifies pairs of memory operations as NO alias (i.e., independent memory operations), MUST alias (i.e., ordering required), or MAY alias (i.e., compiler uncertain). We developed a compiler-only approach called NACHOS-SW that serializes memory operations both when the compiler is certain (MUST alias) and uncertain (MAY alias). Our study analyzes multiple stages of alias analysis on 135 acceleration regions extracted from SPEC2K, SPEC2k6, and PARSEC. NACHOS-SW is en- ergy efficient, but serialization limits performance; 18%–100% slowdown compared to an optimized LSQ. We then proposed NACHOS a low-overhead, scalable, hardware comparator assist that dynamically verifies MAY alias and executes independent memory operations in parallel. NACHOS is a pay-as-you-go approach where the compiler filters out memory operations to save dynamic energy, and the hardware dynamically checks to find MLP. NACHOS achieves performance comparable to an optimized LSQ; in fact, it improved performance in 6 benchmarks(6%—70%) by reducing load-to-use latency for cache hits. NACHOS imposes no energy overhead in 15 out of 27 benchmarks i.e., compiler accurately determines all memory dependencies; the average energy overhead is ?6% of total (accelerator and L1 cache); in comparison, an optimized LSQ consumes 27% of total energy. NACHOS is released as free and open source software. Github: https://github.com/sfu-arch/ nachos

硬件加速器依赖于编译器来提取指令并行性，但在强制执行内存排序和发现内存并行性方面可能会浪费大量精力。加速器倾向于序列化内存操作[43]或重用耗电的负载存储队列(LSQs)[8]，[27]。最近的工作[11]，[15]使用编译器进行调度，但继续依赖于lsql进行内存消歧。NACHOS是一种硬件辅助软件驱动的加速器内存消歧方法。在NACHOS中，编译器将内存操作对分类为无别名(即独立的内存操作)、必须别名(即需要排序)或可能别名(即编译器不确定)。我们开发了一种仅针对编译器的方法，称为NACHOS-SW，它可以在编译器确定(MUST别名)和不确定(MAY别名)时序列化内存操作。本文对从SPEC2K、SPEC2k6和PARSEC中提取的135个加速度区域进行了多阶段的混叠分析。NACHOS-SW节能，但串行化限制了性能;与优化后的LSQ相比，降低了18%-100%。然后，我们提出了NACHOS，这是一种低开销、可扩展的硬件比较器辅助工具，可以动态验证MAY别名并并行执行独立的内存操作。NACHOS是一种按需付费的方法，其中编译器过滤内存操作以节省动态能量，硬件动态检查以查找MLP。NACHOS实现了与优化后的LSQ相当的性能;事实上，它通过减少缓存命中的负载使用延迟，提高了6个基准测试的性能(6%-70%)。NACHOS在27个基准测试中的15个中没有施加能量开销，即编译器准确地确定所有内存依赖;平均能量开销是总开销的6%(加速器和L1缓存);相比之下，优化后的LSQ能耗为总能耗的27%。NACHOS作为免费开源软件发布。Github: https://github.com/sfu-arch/ nachos

{"title":"NACHOS: Software-Driven Hardware-Assisted Memory Disambiguation for Accelerators","authors":"Naveen Vedula, Arrvindh Shriraman, Snehasish Kumar, Nick Sumner","doi":"10.1109/HPCA.2018.00066","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00066","url":null,"abstract":"Hardware accelerators have relied on the compiler to extract instruction parallelism but may waste significant energy in enforcing memory ordering and discovering memory parallelism. Accelerators tend to either serialize memory operations [43] or reuse power hungry load-store queues (LSQs) [8], [27]. Recent works [11], [15] use the compiler for scheduling but continue to rely on LSQs for memory disambiguation. NACHOS is a hardware assisted software-driven approach to memory disambiguation for accelerators. In NACHOS, the compiler classifies pairs of memory operations as NO alias (i.e., independent memory operations), MUST alias (i.e., ordering required), or MAY alias (i.e., compiler uncertain). We developed a compiler-only approach called NACHOS-SW that serializes memory operations both when the compiler is certain (MUST alias) and uncertain (MAY alias). Our study analyzes multiple stages of alias analysis on 135 acceleration regions extracted from SPEC2K, SPEC2k6, and PARSEC. NACHOS-SW is en- ergy efficient, but serialization limits performance; 18%–100% slowdown compared to an optimized LSQ. We then proposed NACHOS a low-overhead, scalable, hardware comparator assist that dynamically verifies MAY alias and executes independent memory operations in parallel. NACHOS is a pay-as-you-go approach where the compiler filters out memory operations to save dynamic energy, and the hardware dynamically checks to find MLP. NACHOS achieves performance comparable to an optimized LSQ; in fact, it improved performance in 6 benchmarks(6%—70%) by reducing load-to-use latency for cache hits. NACHOS imposes no energy overhead in 15 out of 27 benchmarks i.e., compiler accurately determines all memory dependencies; the average energy overhead is ?6% of total (accelerator and L1 cache); in comparison, an optimized LSQ consumes 27% of total energy. NACHOS is released as free and open source software. Github: https://github.com/sfu-arch/ nachos","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125451906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

HeatWatch: Improving 3D NAND Flash Memory Device Reliability by Exploiting Self-Recovery and Temperature Awareness HeatWatch:通过利用自我恢复和温度感知来提高3D NAND闪存设备的可靠性

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00050

Yixin Luo, Saugata Ghose, Yu Cai, E. Haratsch, O. Mutlu

NAND flash memory density continues to scale to keep up with the increasing storage demands of data-intensive applications. Unfortunately, as a result of this scaling, the lifetime of NAND flash memory has been decreasing. Each cell in NAND flash memory can endure only a limited number of writes, due to the damage caused by each program and erase operation on the cell. This damage can be partially repaired on its own during the idle time between program or erase operations (known as the dwell time), via a phenomenon known as the self-recovery effect. Prior works study the self-recovery effect for planar (i.e., 2D) NAND flash memory, and propose to exploit it to improve flash lifetime, by applying high temperature to accelerate self-recovery. However, these findings may not be directly applicable to 3D NAND flash memory, due to significant changes in the design and manufacturing process that are required to enable practical 3D stacking for NAND flash memory. In this paper, we perform the first detailed experimental characterization of the effects of self-recovery and temperature on real, state-of-the-art 3D NAND flash memory devices. We show that these effects influence two major factors of NAND flash memory reliability: (1) retention loss speed (i.e., the speed at which a flash cell leaks charge), and (2) program variation (i.e., the difference in programming speed across flash cells). We find that self-recovery and temperature affect 3D NAND flash memory quite differently than they affect planar NAND flash memory, rendering prior models of self-recovery and temperature ineffective for 3D NAND flash memory. Using our characterization results, we develop a new model for 3D NAND flash memory reliability, which predicts how retention, wearout, self-recovery, and temperature affect raw bit error rates and cell threshold voltages. We show that our model is accurate, with an error of only 4.9%. Based on our experimental findings and our model, we propose HeatWatch, a new mechanism to improve 3D NAND flash memory reliability. The key idea of HeatWatch is to optimize the read reference voltage, i.e., the voltage applied to the cell during a read operation, by adapting it to the dwell time of the workload and the current operating temperature. HeatWatch (1) efficiently tracks flash memory temperature and dwell time online, (2) sends this information to our reliability model to predict the current voltages of flash cells, and (3) predicts the optimal read reference voltage based on the current cell voltages. Our detailed experimental evaluations show that HeatWatch improves flash lifetime by 3.85× over a baseline that uses a fixed read reference voltage, averaged across 28 real storage workload traces, and comes within 0.9% of the lifetime of an ideal read reference voltage selection mechanism.

NAND闪存密度不断扩大，以跟上数据密集型应用日益增长的存储需求。不幸的是，由于这种缩放，NAND闪存的寿命一直在减少。NAND闪存中的每个单元只能承受有限的写入次数，因为每个程序和对单元的擦除操作都会造成损坏。在程序或擦除操作之间的空闲时间(称为驻留时间)，这种损坏可以通过一种称为自我恢复效应的现象自行部分修复。先前的工作研究了平面(即2D) NAND闪存的自恢复效应，并提出利用它来提高闪存寿命，通过施加高温来加速自恢复。然而，这些发现可能并不直接适用于3D NAND闪存，因为在设计和制造过程中需要发生重大变化，才能实现NAND闪存的实际3D堆叠。在本文中，我们进行了第一个详细的实验表征的自我恢复和温度的影响，在真正的，最先进的3D NAND闪存设备。我们表明，这些影响影响NAND闪存可靠性的两个主要因素:(1)保留损耗速度(即闪存单元泄漏电荷的速度)，以及(2)程序变化(即跨闪存单元编程速度的差异)。我们发现自恢复和温度对3D NAND闪存的影响与对平面NAND闪存的影响完全不同，这使得先前的自恢复和温度模型对3D NAND闪存无效。利用我们的表征结果，我们开发了3D NAND闪存可靠性的新模型，该模型预测了保留、磨损、自我恢复和温度如何影响原始误码率和单元阈值电压。我们证明我们的模型是准确的，误差只有4.9%。基于我们的实验结果和我们的模型，我们提出了一种新的机制来提高3D NAND闪存的可靠性。HeatWatch的关键思想是优化读取参考电压，即在读取操作期间施加到电池上的电压，通过使其适应工作负载的停留时间和当前的工作温度。HeatWatch(1)有效地跟踪闪存温度和在线停留时间，(2)将这些信息发送给我们的可靠性模型来预测闪存单元的电流电压，(3)根据当前单元电压预测最佳读取参考电压。我们详细的实验评估表明，与使用固定读取参考电压的基线相比，HeatWatch将闪存寿命提高了3.85倍，平均在28个实际存储工作负载跟踪中，并且在理想读取参考电压选择机制的寿命范围内达到0.9%。

{"title":"HeatWatch: Improving 3D NAND Flash Memory Device Reliability by Exploiting Self-Recovery and Temperature Awareness","authors":"Yixin Luo, Saugata Ghose, Yu Cai, E. Haratsch, O. Mutlu","doi":"10.1109/HPCA.2018.00050","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00050","url":null,"abstract":"NAND flash memory density continues to scale to keep up with the increasing storage demands of data-intensive applications. Unfortunately, as a result of this scaling, the lifetime of NAND flash memory has been decreasing. Each cell in NAND flash memory can endure only a limited number of writes, due to the damage caused by each program and erase operation on the cell. This damage can be partially repaired on its own during the idle time between program or erase operations (known as the dwell time), via a phenomenon known as the self-recovery effect. Prior works study the self-recovery effect for planar (i.e., 2D) NAND flash memory, and propose to exploit it to improve flash lifetime, by applying high temperature to accelerate self-recovery. However, these findings may not be directly applicable to 3D NAND flash memory, due to significant changes in the design and manufacturing process that are required to enable practical 3D stacking for NAND flash memory. In this paper, we perform the first detailed experimental characterization of the effects of self-recovery and temperature on real, state-of-the-art 3D NAND flash memory devices. We show that these effects influence two major factors of NAND flash memory reliability: (1) retention loss speed (i.e., the speed at which a flash cell leaks charge), and (2) program variation (i.e., the difference in programming speed across flash cells). We find that self-recovery and temperature affect 3D NAND flash memory quite differently than they affect planar NAND flash memory, rendering prior models of self-recovery and temperature ineffective for 3D NAND flash memory. Using our characterization results, we develop a new model for 3D NAND flash memory reliability, which predicts how retention, wearout, self-recovery, and temperature affect raw bit error rates and cell threshold voltages. We show that our model is accurate, with an error of only 4.9%. Based on our experimental findings and our model, we propose HeatWatch, a new mechanism to improve 3D NAND flash memory reliability. The key idea of HeatWatch is to optimize the read reference voltage, i.e., the voltage applied to the cell during a read operation, by adapting it to the dwell time of the workload and the current operating temperature. HeatWatch (1) efficiently tracks flash memory temperature and dwell time online, (2) sends this information to our reliability model to predict the current voltages of flash cells, and (3) predicts the optimal read reference voltage based on the current cell voltages. Our detailed experimental evaluations show that HeatWatch improves flash lifetime by 3.85× over a baseline that uses a fixed read reference voltage, averaged across 28 real storage workload traces, and comes within 0.9% of the lifetime of an ideal read reference voltage selection mechanism.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115009164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 102

DUO: Exposing On-Chip Redundancy to Rank-Level ECC for High Reliability DUO:将片上冗余暴露于秩级ECC以实现高可靠性

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00064

Seong-Lyong Gong, Jungrae Kim, Sangkug Lym, Michael B. Sullivan, Howard David, M. Erez

DRAM row and column sparing cannot efficiently tolerate the increasing inherent fault rate caused by continued process scaling. In-DRAM ECC (IECC), an appealing alternative to sparing, can resolve inherent faults without significant changes to DRAM, but it is inefficient for highly-reliable systems where rank-level ECC (RECC) is already used against operational faults. In addition, DRAM design in the near future (possibly as early as DDR5) may transfer data in longer bursts, which complicates high-reliability RECC due to fewer devices being used per rank and increased fault granularity. We propose dual use of on-chip redundancy (DUO), a mech- anism that bypasses the IECC module and transfers on-chip redundancy to be used directly for RECC. Due to its increased redundancy budget, DUO enables a strong and novel RECC for highly-reliable systems, called DUO SDDC. The long codewords of DUO SDDC provide fundamentally higher detection and correction capabilities, and several novel secondary-correction techniques integrate together to further expand its correction capability. According to our evaluation results, DUO shows performance degradation on par with or better than IECC (average 2–3%), while consuming less DRAM energy than IECC (average 4–14% overheads). DUO provides higher reliability than either IECC or the state-of-the-art ECC technique. We show the robust reliability of DUO SDDC by comparing it to other ECC schemes using two different inherent fault-error models.

DRAM的行和列保留不能有效地容忍持续扩展进程所导致的不断增加的固有故障率。DRAM内ECC (IECC)是一种很有吸引力的替代方案，可以在不对DRAM进行重大更改的情况下解决固有故障，但对于已经使用秩级ECC (RECC)来解决操作故障的高可靠性系统来说，它的效率很低。此外，DRAM设计在不久的将来(可能早在DDR5)可能会以更长的突发传输数据，这使得高可靠性RECC变得复杂，因为每级使用的设备更少，故障粒度增加。我们建议双重使用片上冗余(DUO)，这是一种绕过IECC模块并将片上冗余直接用于RECC的机制。由于其增加的冗余预算，DUO为高可靠性系统提供了强大而新颖的RECC，称为DUO SDDC。DUO SDDC的长码字从根本上提供了更高的检测和校正能力，并将几种新的二次校正技术集成在一起，进一步扩展了其校正能力。根据我们的评估结果，DUO显示出与IECC相当或更好的性能下降(平均2-3%)，而消耗的DRAM能量比IECC少(平均4-14%的开销)。DUO提供比IECC或最先进的ECC技术更高的可靠性。我们通过使用两种不同的固有故障误差模型将DUO SDDC与其他ECC方案进行比较，证明了DUO SDDC的鲁棒可靠性。

{"title":"DUO: Exposing On-Chip Redundancy to Rank-Level ECC for High Reliability","authors":"Seong-Lyong Gong, Jungrae Kim, Sangkug Lym, Michael B. Sullivan, Howard David, M. Erez","doi":"10.1109/HPCA.2018.00064","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00064","url":null,"abstract":"DRAM row and column sparing cannot efficiently tolerate the increasing inherent fault rate caused by continued process scaling. In-DRAM ECC (IECC), an appealing alternative to sparing, can resolve inherent faults without significant changes to DRAM, but it is inefficient for highly-reliable systems where rank-level ECC (RECC) is already used against operational faults. In addition, DRAM design in the near future (possibly as early as DDR5) may transfer data in longer bursts, which complicates high-reliability RECC due to fewer devices being used per rank and increased fault granularity. We propose dual use of on-chip redundancy (DUO), a mech- anism that bypasses the IECC module and transfers on-chip redundancy to be used directly for RECC. Due to its increased redundancy budget, DUO enables a strong and novel RECC for highly-reliable systems, called DUO SDDC. The long codewords of DUO SDDC provide fundamentally higher detection and correction capabilities, and several novel secondary-correction techniques integrate together to further expand its correction capability. According to our evaluation results, DUO shows performance degradation on par with or better than IECC (average 2–3%), while consuming less DRAM energy than IECC (average 4–14% overheads). DUO provides higher reliability than either IECC or the state-of-the-art ECC technique. We show the robust reliability of DUO SDDC by comparing it to other ECC schemes using two different inherent fault-error models.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123007343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Making Memristive Neural Network Accelerators Reliable 使记忆性神经网络加速器可靠

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00015

Ben Feinberg, Shibo Wang, Engin Ipek

Deep neural networks (DNNs) have attracted substantial interest in recent years due to their superior performance on many classification and regression tasks as compared to other supervised learning models. DNNs often require a large amount of data movement, resulting in performance and energy overheads. One promising way to address this problem is to design an accelerator based on in-situ analog computing that leverages the fundamental electrical properties of memristive circuits to perform matrix-vector multiplication. Recent work on analog neural network accelerators has shown great potential in improving both the system performance and the energy efficiency. However, detecting and correcting the errors that occur during in-memory analog computation remains largely unexplored. The same electrical properties that provide the performance and energy improvements make these systems especially susceptible to errors, which can severely hurt the accuracy of the neural network accelerators. This paper examines a new error correction scheme for analog neural network accelerators based on arithmetic codes. The proposed scheme encodes the data through multiplication by an integer, which preserves addition operations through the distributive property. Error detection and correction are performed through a modulus operation and a correction table lookup. This basic scheme is further improved by data-aware encoding to exploit the state dependence of the errors, and by knowledge of how critical each portion of the computation is to overall system accuracy. By leveraging the observation that a physical row that contains fewer 1s is less susceptible to an error, the proposed scheme increases the effective error correction capability with less than 4.5% area and less than 4.7% energy overheads. When applied to a memristive DNN accelerator performing inference on the MNIST and ILSVRC-2012 datasets, the proposed technique reduces the respective misclassification rates by 1.5x and 1.1x.

与其他监督学习模型相比，深度神经网络(dnn)由于在许多分类和回归任务上的优异性能，近年来引起了人们的极大兴趣。dnn通常需要大量的数据移动，从而导致性能和能量开销。解决这个问题的一个有希望的方法是设计一个基于原位模拟计算的加速器，利用忆阻电路的基本电学特性来执行矩阵向量乘法。近年来，模拟神经网络加速器在提高系统性能和能效方面显示出巨大的潜力。然而，检测和纠正内存模拟计算期间发生的错误在很大程度上仍未被探索。提供性能和能量改进的相同电气特性使这些系统特别容易受到错误的影响，这可能严重损害神经网络加速器的准确性。本文研究了一种新的基于算术码的模拟神经网络加速器纠错方案。该方案通过一个整数的乘法对数据进行编码，通过分配律保留了加法运算。错误检测和修正是通过模数运算和修正表查找来执行的。通过数据感知编码来利用错误的状态依赖性，以及通过了解计算的每个部分对整个系统精度的重要性，进一步改进了这个基本方案。通过观察到包含较少15的物理行不易受到错误的影响，所提出的方案以小于4.5%的面积和小于4.7%的能量开销增加了有效纠错能力。当应用于记忆性DNN加速器对MNIST和ILSVRC-2012数据集进行推理时，所提出的技术将各自的误分类率降低了1.5倍和1.1倍。

{"title":"Making Memristive Neural Network Accelerators Reliable","authors":"Ben Feinberg, Shibo Wang, Engin Ipek","doi":"10.1109/HPCA.2018.00015","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00015","url":null,"abstract":"Deep neural networks (DNNs) have attracted substantial interest in recent years due to their superior performance on many classification and regression tasks as compared to other supervised learning models. DNNs often require a large amount of data movement, resulting in performance and energy overheads. One promising way to address this problem is to design an accelerator based on in-situ analog computing that leverages the fundamental electrical properties of memristive circuits to perform matrix-vector multiplication. Recent work on analog neural network accelerators has shown great potential in improving both the system performance and the energy efficiency. However, detecting and correcting the errors that occur during in-memory analog computation remains largely unexplored. The same electrical properties that provide the performance and energy improvements make these systems especially susceptible to errors, which can severely hurt the accuracy of the neural network accelerators. This paper examines a new error correction scheme for analog neural network accelerators based on arithmetic codes. The proposed scheme encodes the data through multiplication by an integer, which preserves addition operations through the distributive property. Error detection and correction are performed through a modulus operation and a correction table lookup. This basic scheme is further improved by data-aware encoding to exploit the state dependence of the errors, and by knowledge of how critical each portion of the computation is to overall system accuracy. By leveraging the observation that a physical row that contains fewer 1s is less susceptible to an error, the proposed scheme increases the effective error correction capability with less than 4.5% area and less than 4.7% energy overheads. When applied to a memristive DNN accelerator performing inference on the MNIST and ILSVRC-2012 datasets, the proposed technique reduces the respective misclassification rates by 1.5x and 1.1x.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115342218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 92

Architectural Support for Task Dependence Management with Flexible Software Scheduling 具有灵活软件调度的任务依赖管理的体系结构支持

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00033

Emilio Castillo, Lluc Alvarez, Miquel Moretó, Marc Casas, E. Vallejo, J. L. Bosque, R. Beivide, M. Valero

The growing complexity of multi-core architectures has motivated a wide range of software mechanisms to improve the orchestration of parallel executions. Task parallelism has become a very attractive approach thanks to its programmability, portability and potential for optimizations. However, with the expected increase in core counts, finer-grained tasking will be required to exploit the available parallelism, which will increase the overheads introduced by the runtime system. This work presents Task Dependence Manager (TDM), a hardware/software co-designed mechanism to mitigate runtime system overheads. TDM introduces a hardware unit, denoted Dependence Management Unit (DMU), and minimal ISA extensions that allow the runtime system to offload costly dependence tracking operations to the DMU and to still perform task scheduling in software. With lower hardware cost, TDM outperforms hardware-based solutions and enhances the flexibility, adaptability and composability of the system. Results show that TDM improves performance by 12.3% and reduces EDP by 20.4% on average with respect to a software runtime system. Compared to a runtime system fully implemented in hardware, TDM achieves an average speedup of 4.2% with 7.3x less area requirements and significant EDP reductions. In addition, five different software schedulers are evaluated with TDM, illustrating its flexibility and performance gains.

越来越复杂的多核体系结构激发了各种软件机制来改进并行执行的编排。由于其可编程性、可移植性和优化潜力，任务并行已经成为一种非常有吸引力的方法。然而，随着核心数量的预期增加，将需要更细粒度的任务来利用可用的并行性，这将增加运行时系统引入的开销。这项工作提出了任务依赖管理器(TDM)，这是一种硬件/软件协同设计的机制，可以减少运行时系统开销。TDM引入了一个硬件单元，称为依赖性管理单元(dependency Management unit, DMU)，以及最小的ISA扩展，这些扩展允许运行时系统将昂贵的依赖性跟踪操作卸载到DMU，并且仍然在软件中执行任务调度。TDM的硬件成本较低，优于基于硬件的解决方案，提高了系统的灵活性、适应性和可组合性。结果表明，相对于软件运行时系统，TDM平均提高了12.3%的性能，降低了20.4%的EDP。与完全在硬件上实现的运行时系统相比，TDM实现了4.2%的平均加速，占地面积减少了7.3倍，EDP显著降低。此外，还使用TDM对五种不同的软件调度器进行了评估，说明了TDM的灵活性和性能增益。

{"title":"Architectural Support for Task Dependence Management with Flexible Software Scheduling","authors":"Emilio Castillo, Lluc Alvarez, Miquel Moretó, Marc Casas, E. Vallejo, J. L. Bosque, R. Beivide, M. Valero","doi":"10.1109/HPCA.2018.00033","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00033","url":null,"abstract":"The growing complexity of multi-core architectures has motivated a wide range of software mechanisms to improve the orchestration of parallel executions. Task parallelism has become a very attractive approach thanks to its programmability, portability and potential for optimizations. However, with the expected increase in core counts, finer-grained tasking will be required to exploit the available parallelism, which will increase the overheads introduced by the runtime system. This work presents Task Dependence Manager (TDM), a hardware/software co-designed mechanism to mitigate runtime system overheads. TDM introduces a hardware unit, denoted Dependence Management Unit (DMU), and minimal ISA extensions that allow the runtime system to offload costly dependence tracking operations to the DMU and to still perform task scheduling in software. With lower hardware cost, TDM outperforms hardware-based solutions and enhances the flexibility, adaptability and composability of the system. Results show that TDM improves performance by 12.3% and reduces EDP by 20.4% on average with respect to a software runtime system. Compared to a runtime system fully implemented in hardware, TDM achieves an average speedup of 4.2% with 7.3x less area requirements and significant EDP reductions. In addition, five different software schedulers are evaluated with TDM, illustrating its flexibility and performance gains.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116222421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Adaptive Memory Fusion: Towards Transparent, Agile Integration of Persistent Memory 自适应记忆融合:迈向持久记忆的透明、敏捷集成

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00036

Dongliang Xue, Chao Li, Linpeng Huang, Chentao Wu, Tianyou Li

The great promise of in-memory computing inspires en-gineers to scale their main memory subsystems in a timely and efficient manner. Offering greatly expanded capacity at near-DRAM speed, today’s new-generation persistent memory (PM) module is no doubt an ideal candidate for system upgrade. However, integrating DRAM-comparable PMs in current enterprise systems faces big barriers in terms of huge system modifications for software compati-bility and complex runtime support. In addition, the very large PM capacity unavoidably results in massive metadata, which introduces significant performance and energy overhead. The inefficiency issue becomes even acute when the memory system reaches its capacity limit or the application requires large memory space allocation. In this paper we propose adaptive memory fusion (AMF), a novel PM integration scheme that jointly solves the above issues. Rather than struggle to adapt to the persistence property of PM through modifying the full software stack, we focus on exploiting the high capacity feature of emerging PM modules. AMF is designed to be totally transparent to user applications by carefully hiding PM devices and managing the available PM space in a DRAM-like way. To further improve the performance, we devise holistic optimization scheme that allows the system to efficiently utilize system resources. Specifically, AMF is able to adaptively release PM based on memory pressure status, smartly reclaim PM pages, and enable fast space expansion with direct PM pass-through. We implement AMF as a kernel subsystem in Linux. Compared to traditional approaches, AMF could decrease the page faults number of high-resident-set benchmarks by up to 67.8% with an average of 46.1%. Using realistic in-memory database, we show that AMF outperforms existing solutions by 57.7% on SQLite and 21.8% on Redis. Overall, AMF represents a more lightweight design approach and it would greatly encourage rapid and flexible adoption of PM in the near future.

内存计算的巨大前景激励工程师以及时有效的方式扩展其主内存子系统。今天的新一代持久内存(PM)模块以接近dram的速度提供了极大的扩展容量，无疑是系统升级的理想选择。然而，在当前的企业系统中集成可与dram相媲美的pm面临着巨大的障碍，因为软件兼容性和复杂的运行时支持需要进行大量的系统修改。此外，非常大的PM容量不可避免地导致大量元数据，这将带来显著的性能和能源开销。当内存系统达到其容量限制或应用程序需要分配大量内存空间时，低效率问题变得更加严重。在本文中，我们提出了一种新的PM集成方案——自适应记忆融合(AMF)，它共同解决了上述问题。我们没有通过修改整个软件栈来努力适应PM的持久性，而是专注于开发新兴PM模块的高容量特性。AMF被设计成对用户应用程序完全透明，它小心地隐藏PM设备，并以类似dram的方式管理可用的PM空间。为了进一步提高性能，我们设计了整体优化方案，使系统能够有效地利用系统资源。具体来说，AMF能够根据内存压力状态自适应地释放PM，巧妙地回收PM页面，并通过直接PM传递实现快速空间扩展。我们将AMF作为内核子系统在Linux中实现。与传统方法相比，AMF可以将高驻留设置基准的页面错误数量减少67.8%，平均减少46.1%。使用实际的内存数据库，我们发现AMF在SQLite上比现有解决方案高出57.7%，在Redis上高出21.8%。总的来说，AMF代表了一种更轻量级的设计方法，它将极大地鼓励在不久的将来快速和灵活地采用PM。

{"title":"Adaptive Memory Fusion: Towards Transparent, Agile Integration of Persistent Memory","authors":"Dongliang Xue, Chao Li, Linpeng Huang, Chentao Wu, Tianyou Li","doi":"10.1109/HPCA.2018.00036","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00036","url":null,"abstract":"The great promise of in-memory computing inspires en-gineers to scale their main memory subsystems in a timely and efficient manner. Offering greatly expanded capacity at near-DRAM speed, today’s new-generation persistent memory (PM) module is no doubt an ideal candidate for system upgrade. However, integrating DRAM-comparable PMs in current enterprise systems faces big barriers in terms of huge system modifications for software compati-bility and complex runtime support. In addition, the very large PM capacity unavoidably results in massive metadata, which introduces significant performance and energy overhead. The inefficiency issue becomes even acute when the memory system reaches its capacity limit or the application requires large memory space allocation. In this paper we propose adaptive memory fusion (AMF), a novel PM integration scheme that jointly solves the above issues. Rather than struggle to adapt to the persistence property of PM through modifying the full software stack, we focus on exploiting the high capacity feature of emerging PM modules. AMF is designed to be totally transparent to user applications by carefully hiding PM devices and managing the available PM space in a DRAM-like way. To further improve the performance, we devise holistic optimization scheme that allows the system to efficiently utilize system resources. Specifically, AMF is able to adaptively release PM based on memory pressure status, smartly reclaim PM pages, and enable fast space expansion with direct PM pass-through. We implement AMF as a kernel subsystem in Linux. Compared to traditional approaches, AMF could decrease the page faults number of high-resident-set benchmarks by up to 67.8% with an average of 46.1%. Using realistic in-memory database, we show that AMF outperforms existing solutions by 57.7% on SQLite and 21.8% on Redis. Overall, AMF represents a more lightweight design approach and it would greatly encourage rapid and flexible adoption of PM in the near future.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125673657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Reliability-Aware Data Placement for Heterogeneous Memory Architecture 异构内存体系结构的可靠性感知数据放置

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00056

Manish Gupta, Vilas Sridharan, D. Roberts, A. Prodromou, A. Venkat, D. Tullsen, Rajesh K. Gupta

System reliability is a first-class concern as technology continues to shrink, resulting in increased vulnerability to traditional sources of errors such as single event upsets. By tracking access counts and the Architectural Vulnerability Factor (AVF), application data can be partitioned into groups based on how frequently it is accessed (its "hotness") and its likelihood to cause program execution error (its "risk"). This is particularly useful for memory systems which exhibit heterogeneity in their performance and reliability such as Heterogeneous Memory Architectures – with a typical configuration combining slow, highly reliable memory with faster, less reliable memory. This work demonstrates that current state of the art, performance-focused data placement techniques affect reliability adversely. It shows that page risk is not necessarily correlated with its hotness; this makes it possible to identify pages that are both hot and low risk, enabling page placement strategies that can find a good balance of performance and reliability. This work explores heuristics to identify and monitor both hotness and risk at run-time, and further proposes static, dynamic, and program annotation-based reliability-aware data placement techniques. This enables an architect to choose among available memories with diverse performance and reliability characteristics. The proposed heuristic-based reliability-aware data placement improves reliability by a factor of 1.6x compared to performance-focused static placement while limiting the performance degradation to 1%. A dynamic reliability-aware migration scheme, which does not require prior knowledge about the application, improves reliability by a factor of 1.5x on average while limiting the performance loss to 4.9%. Finally, program annotation-based data placement improves the reliability by 1.3x at a performance cost of 1.1%.

随着技术的不断萎缩，系统可靠性是头等大事，这导致传统错误来源(如单事件干扰)的脆弱性增加。通过跟踪访问计数和体系结构漏洞因子(AVF)，可以根据访问的频率(其“热度”)和导致程序执行错误的可能性(其“风险”)将应用程序数据划分为组。这对于表现出性能和可靠性异质性的内存系统特别有用，例如异构内存体系结构——典型配置结合了慢速、高可靠的内存和快速、不太可靠的内存。这项工作表明，目前以性能为中心的数据放置技术对可靠性产生了不利影响。结果表明，页面风险与其热度之间没有必然的相关性;这使得识别高风险和低风险的页面成为可能，从而启用能够在性能和可靠性之间找到良好平衡的页面放置策略。这项工作探索了在运行时识别和监控热度和风险的启发式方法，并进一步提出了静态、动态和基于程序注释的可靠性感知数据放置技术。这使架构师能够在具有不同性能和可靠性特性的可用存储器中进行选择。与以性能为中心的静态数据放置相比，提出的基于启发式的可靠性感知数据放置将可靠性提高了1.6倍，同时将性能下降限制在1%以内。动态可靠性感知迁移方案不需要预先了解应用程序，可将可靠性平均提高1.5倍，同时将性能损失限制在4.9%。最后，基于程序注释的数据放置以1.1%的性能代价将可靠性提高了1.3倍。

{"title":"Reliability-Aware Data Placement for Heterogeneous Memory Architecture","authors":"Manish Gupta, Vilas Sridharan, D. Roberts, A. Prodromou, A. Venkat, D. Tullsen, Rajesh K. Gupta","doi":"10.1109/HPCA.2018.00056","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00056","url":null,"abstract":"System reliability is a first-class concern as technology continues to shrink, resulting in increased vulnerability to traditional sources of errors such as single event upsets. By tracking access counts and the Architectural Vulnerability Factor (AVF), application data can be partitioned into groups based on how frequently it is accessed (its \"hotness\") and its likelihood to cause program execution error (its \"risk\"). This is particularly useful for memory systems which exhibit heterogeneity in their performance and reliability such as Heterogeneous Memory Architectures – with a typical configuration combining slow, highly reliable memory with faster, less reliable memory. This work demonstrates that current state of the art, performance-focused data placement techniques affect reliability adversely. It shows that page risk is not necessarily correlated with its hotness; this makes it possible to identify pages that are both hot and low risk, enabling page placement strategies that can find a good balance of performance and reliability. This work explores heuristics to identify and monitor both hotness and risk at run-time, and further proposes static, dynamic, and program annotation-based reliability-aware data placement techniques. This enables an architect to choose among available memories with diverse performance and reliability characteristics. The proposed heuristic-based reliability-aware data placement improves reliability by a factor of 1.6x compared to performance-focused static placement while limiting the performance degradation to 1%. A dynamic reliability-aware migration scheme, which does not require prior knowledge about the application, improves reliability by a factor of 1.5x on average while limiting the performance loss to 4.9%. Finally, program annotation-based data placement improves the reliability by 1.3x at a performance cost of 1.1%.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122377042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Memory System Design for Ultra Low Power, Computationally Error Resilient Processor Microarchitectures 超低功耗、计算误差弹性处理器微架构的存储系统设计

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00065

S. Srikanth, Paul G. Rabbat, Eric R. Hein, Bobin Deng, T. Conte, E. Debenedictis, Jeanine E. Cook, M. Frank

Dennard scaling ended a decade ago. Energy reduction by lowering supply voltage has been limited because of guard bands and a subthreshold slope of over 60mV/decade in MOSFETs. On the other hand, newly-proposed logic devices maintain a high on/off ratio for drain currents even at significantly lower operating voltages. However, such ultra low power technology would eventually suffer from intermittent errors in logic as a result of operating close to the thermal noise floor. Computational error correction mitigates this issue by efficiently correcting stochastic bit errors that may occur in computational logic operating at low signal energies, thereby allowing for energy reduction by lowering supply voltage to tens of millivolts. Cores based on a Redundant Residual Number System (RRNS), which represents a number using a tuple of smaller numbers, are a promising candidate for implementing energyefficient computational error correction. However, prior RRNS core microarchitectures abstract away the memory hierarchy and do not consider the power-performance impact of RNS-based memory addressing. When compared with a non-error-correcting core addressing memory in binary, naive RNS-based memory addressing schemes cause a slowdown of over 3x/2x for inorder/out-of-order cores respectively. In this paper, we analyze RNS-based memory access pattern behavior and provide solutions in the form of novel schemes and the resulting design space exploration, thereby, extending and enabling a tangible, ultra low power RRNS based architecture.

登纳德缩放法在十年前就结束了。由于mosfet的保护带和超过60mV/ 10年的亚阈值斜率，通过降低电源电压来降低能量受到限制。另一方面，新提出的逻辑器件即使在明显较低的工作电压下也能保持漏极电流的高开/关比。然而，这种超低功耗技术最终会因靠近热噪声底而导致逻辑上的间歇性错误。计算纠错通过有效地纠正在低信号能量下计算逻辑中可能出现的随机比特错误，从而通过将电源电压降低到几十毫伏来减少能量，从而减轻了这个问题。基于冗余余数系统(RRNS)的核是实现高效计算纠错的一个很有前途的候选者，RRNS使用较小的数字元组表示一个数字。然而，以前的RRNS核心微体系结构抽象了内存层次结构，并且没有考虑基于RRNS的内存寻址对功耗性能的影响。与二进制的非纠错核心寻址内存相比，朴素的基于rns的内存寻址方案分别导致无序/乱序内核的速度下降超过3x/2x。在本文中，我们分析了基于rns的内存访问模式行为，并以新颖的方案和由此产生的设计空间探索的形式提供了解决方案，从而扩展和实现了一个切实的、超低功耗的基于RRNS的架构。

{"title":"Memory System Design for Ultra Low Power, Computationally Error Resilient Processor Microarchitectures","authors":"S. Srikanth, Paul G. Rabbat, Eric R. Hein, Bobin Deng, T. Conte, E. Debenedictis, Jeanine E. Cook, M. Frank","doi":"10.1109/HPCA.2018.00065","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00065","url":null,"abstract":"Dennard scaling ended a decade ago. Energy reduction by lowering supply voltage has been limited because of guard bands and a subthreshold slope of over 60mV/decade in MOSFETs. On the other hand, newly-proposed logic devices maintain a high on/off ratio for drain currents even at significantly lower operating voltages. However, such ultra low power technology would eventually suffer from intermittent errors in logic as a result of operating close to the thermal noise floor. Computational error correction mitigates this issue by efficiently correcting stochastic bit errors that may occur in computational logic operating at low signal energies, thereby allowing for energy reduction by lowering supply voltage to tens of millivolts. Cores based on a Redundant Residual Number System (RRNS), which represents a number using a tuple of smaller numbers, are a promising candidate for implementing energyefficient computational error correction. However, prior RRNS core microarchitectures abstract away the memory hierarchy and do not consider the power-performance impact of RNS-based memory addressing. When compared with a non-error-correcting core addressing memory in binary, naive RNS-based memory addressing schemes cause a slowdown of over 3x/2x for inorder/out-of-order cores respectively. In this paper, we analyze RNS-based memory access pattern behavior and provide solutions in the form of novel schemes and the resulting design space exploration, thereby, extending and enabling a tangible, ultra low power RRNS based architecture.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121037705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5