2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)最新文献_第4页

Stash directory: A scalable directory for many-core coherence 存储目录:用于多核一致性的可伸缩目录

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835928

Socrates Demetriades, Sangyeun Cho

Maintaining coherence in large-scale chip multiprocessors (CMPs) embodies tremendous design trade-offs in meeting the area, energy and performance requirements. Sparse directory organizations represent the most energy-efficient and scalable approach towards many-core coherence. However, their limited associativity disallows the one-to-one correspondence of directory entries to cached blocks, rendering them inadequate in tracking all cached blocks. Unless the directory storage is generously over-provisioned, conflicts will force frequent invalidations of cached blocks, severely jeopardizing the system performance. As the chip area and power become increasingly precious with the growing core count, over-provisioning the directory storage becomes unsustainably costly. Stash Directory is a novel sparse directory design that allows directory entries tracking private blocks to be safely evicted without invalidating the corresponding cached blocks. By doing so, it improves system performance and increases the effective directory capacity, enabling significantly smaller directory designs. To ensure correct coherence under the new relaxed inclusion property, stash directory delegates to the last level cache the responsibility to discover hidden cached blocks when necessary, without however raising significant overhead concerns. Simulations on a 16-core CMP model show that Stash Directory can reduce space requirements to 1/8 of a conventional sparse directory, without compromising performance.

在大规模芯片多处理器(cmp)中保持一致性需要在满足面积、能源和性能要求方面进行巨大的设计权衡。稀疏目录组织代表了实现多核一致性的最节能和可扩展的方法。然而，它们有限的关联性不允许目录条目与缓存块的一对一对应，使得它们无法跟踪所有缓存块。除非目录存储大量过量供应，否则冲突将迫使缓存块频繁失效，从而严重损害系统性能。随着核心数量的增加，芯片面积和功耗变得越来越宝贵，过度配置目录存储的成本变得不可持续。Stash Directory是一种新颖的稀疏目录设计，它允许安全地删除跟踪私有块的目录条目，而不会使相应的缓存块失效。通过这样做，它可以提高系统性能并增加有效目录容量，从而实现更小的目录设计。为了确保在新的宽松包含属性下的正确一致性，stash目录将在必要时发现隐藏缓存块的责任委托给最后一级缓存，而不会引起重大的开销问题。在16核CMP模型上的模拟表明，Stash目录可以将空间需求减少到传统稀疏目录的1/8，而不会影响性能。

{"title":"Stash directory: A scalable directory for many-core coherence","authors":"Socrates Demetriades, Sangyeun Cho","doi":"10.1109/HPCA.2014.6835928","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835928","url":null,"abstract":"Maintaining coherence in large-scale chip multiprocessors (CMPs) embodies tremendous design trade-offs in meeting the area, energy and performance requirements. Sparse directory organizations represent the most energy-efficient and scalable approach towards many-core coherence. However, their limited associativity disallows the one-to-one correspondence of directory entries to cached blocks, rendering them inadequate in tracking all cached blocks. Unless the directory storage is generously over-provisioned, conflicts will force frequent invalidations of cached blocks, severely jeopardizing the system performance. As the chip area and power become increasingly precious with the growing core count, over-provisioning the directory storage becomes unsustainably costly. Stash Directory is a novel sparse directory design that allows directory entries tracking private blocks to be safely evicted without invalidating the corresponding cached blocks. By doing so, it improves system performance and increases the effective directory capacity, enabling significantly smaller directory designs. To ensure correct coherence under the new relaxed inclusion property, stash directory delegates to the last level cache the responsibility to discover hidden cached blocks when necessary, without however raising significant overhead concerns. Simulations on a 16-core CMP model show that Stash Directory can reduce space requirements to 1/8 of a conventional sparse directory, without compromising performance.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132804679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Implications of high energy proportional servers on cluster-wide energy proportionality 高能量比例服务器对集群范围内能量比例的影响

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835925

Daniel Wong, M. Annavaram

Cluster-level packing techniques have long been used to improve the energy proportionality of server clusters by masking the poor energy proportionality of individual servers. With the emergence of high energy proportional servers, we revisit whether cluster-level packing techniques are still the most effective way to achieve high cluster-wide energy proportionality. Our findings indicate that cluster-level packing techniques can eventually limit cluster-wide energy proportionality and it may be more beneficial to depend solely on server-level low power techniques. Server-level low power techniques generally require a high latency slack to be effective due to diminishing idle periods as server core count increases. In order for server-level low power techniques to be a viable alternative, the latency slack required for these techniques must be lowered. We found that server-level active low power modes offer the lowest latency slack, independent of server core count, and propose low power mode switching policies to meet the best-case latency slack under realistic conditions. By overcoming these major issues, we show that server-level low power modes can be a viable alternative to cluster-level packing techniques in providing high cluster-wide energy proportionality.

长期以来，集群级打包技术一直被用于通过掩盖单个服务器较差的能量比例来改善服务器集群的能量比例。随着高能量比例服务器的出现，我们重新审视集群级打包技术是否仍然是实现高集群范围能量比例的最有效方法。我们的研究结果表明，集群级封装技术最终可以限制集群范围内的能量比例，并且仅依赖服务器级低功耗技术可能更有益。服务器级低功耗技术通常需要高延迟松弛才能有效，因为随着服务器核心数的增加，空闲时间会减少。为了使服务器级低功耗技术成为可行的替代方案，必须降低这些技术所需的延迟松弛。我们发现服务器级活动低功耗模式提供了最低的延迟空闲，与服务器核心数无关，并提出了低功耗模式切换策略，以满足现实条件下的最佳情况延迟空闲。通过克服这些主要问题，我们表明服务器级低功耗模式可以作为集群级封装技术的可行替代方案，以提供高集群范围的能量比例。

{"title":"Implications of high energy proportional servers on cluster-wide energy proportionality","authors":"Daniel Wong, M. Annavaram","doi":"10.1109/HPCA.2014.6835925","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835925","url":null,"abstract":"Cluster-level packing techniques have long been used to improve the energy proportionality of server clusters by masking the poor energy proportionality of individual servers. With the emergence of high energy proportional servers, we revisit whether cluster-level packing techniques are still the most effective way to achieve high cluster-wide energy proportionality. Our findings indicate that cluster-level packing techniques can eventually limit cluster-wide energy proportionality and it may be more beneficial to depend solely on server-level low power techniques. Server-level low power techniques generally require a high latency slack to be effective due to diminishing idle periods as server core count increases. In order for server-level low power techniques to be a viable alternative, the latency slack required for these techniques must be lowered. We found that server-level active low power modes offer the lowest latency slack, independent of server core count, and propose low power mode switching policies to meet the best-case latency slack under realistic conditions. By overcoming these major issues, we show that server-level low power modes can be a viable alternative to cluster-level packing techniques in providing high cluster-wide energy proportionality.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131744719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Concurrent and consistent virtual machine introspection with hardware transactional memory 使用硬件事务性内存的并发和一致的虚拟机自省

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835951

Yutao Liu, Yubin Xia, Haibing Guan, B. Zang, Haibo Chen

Virtual machine introspection, which provides tamperresistant, high-fidelity “out of the box” monitoring of virtual machines, has many prominent security applications including VM-based intrusion detection, malware analysis and memory forensic analysis. However, prior approaches are either intrusive in stopping the world to avoid race conditions between introspection tools and the guest VM, or providing no guarantee of getting a consistent state of the guest VM. Further, there is currently no effective means for timely examining the VM states in question. In this paper, we propose a novel approach, called TxIntro, which retrofits hardware transactional memory (HTM) for concurrent, timely and consistent introspection of guest VMs. Specifically, TxIntro leverages the strong atomicity of HTM to actively monitor updates to critical kernel data structures. Then TxIntro can mount introspection to timely detect malicious tampering. To avoid fetching inconsistent kernel states for introspection, TxIntro uses HTM to add related synchronization states into the read set of the monitoring core and thus can easily detect potential inflight concurrent kernel updates. We have implemented and evaluated TxIntro based on Xen VMM on a commodity Intel Haswell machine that provides restricted transactional memory (RTM) support. To demonstrate the effectiveness of TxIntro, we implemented a set of kernel rootkit detectors using TxIntro. Evaluation results show that TxIntro is effective in detecting these rootkits, and is efficient in adding negligible performance overhead.

虚拟机自省，提供了防篡改，高保真的“开箱即用”的虚拟机监控，有许多突出的安全应用，包括基于虚拟机的入侵检测，恶意软件分析和内存取证分析。然而，以前的方法要么是干扰性的，以停止世界以避免内省工具和客户VM之间的竞争条件，要么不能保证获得客户VM的一致状态。此外，目前还没有有效的方法来及时检查所讨论的VM状态。在本文中，我们提出了一种新的方法，称为TxIntro，它改进了硬件事务性内存(HTM)，以实现客户机虚拟机的并发、及时和一致的内省。具体来说，TxIntro利用HTM的强原子性来主动监控关键内核数据结构的更新。然后，TxIntro可以安装内省，及时检测恶意篡改。为了避免为自省获取不一致的内核状态，xintro使用HTM将相关的同步状态添加到监控核心的读集中，从而可以轻松地检测潜在的并发内核更新。我们已经在英特尔Haswell商用机器上实现并评估了基于Xen VMM的TxIntro，该机器提供了受限事务性内存(RTM)支持。为了证明TxIntro的有效性，我们使用TxIntro实现了一组内核rootkit检测器。评估结果表明，TxIntro在检测这些rootkit方面是有效的，并且可以有效地增加可以忽略不计的性能开销。

{"title":"Concurrent and consistent virtual machine introspection with hardware transactional memory","authors":"Yutao Liu, Yubin Xia, Haibing Guan, B. Zang, Haibo Chen","doi":"10.1109/HPCA.2014.6835951","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835951","url":null,"abstract":"Virtual machine introspection, which provides tamperresistant, high-fidelity “out of the box” monitoring of virtual machines, has many prominent security applications including VM-based intrusion detection, malware analysis and memory forensic analysis. However, prior approaches are either intrusive in stopping the world to avoid race conditions between introspection tools and the guest VM, or providing no guarantee of getting a consistent state of the guest VM. Further, there is currently no effective means for timely examining the VM states in question. In this paper, we propose a novel approach, called TxIntro, which retrofits hardware transactional memory (HTM) for concurrent, timely and consistent introspection of guest VMs. Specifically, TxIntro leverages the strong atomicity of HTM to actively monitor updates to critical kernel data structures. Then TxIntro can mount introspection to timely detect malicious tampering. To avoid fetching inconsistent kernel states for introspection, TxIntro uses HTM to add related synchronization states into the read set of the monitoring core and thus can easily detect potential inflight concurrent kernel updates. We have implemented and evaluated TxIntro based on Xen VMM on a commodity Intel Haswell machine that provides restricted transactional memory (RTM) support. To demonstrate the effectiveness of TxIntro, we implemented a set of kernel rootkit detectors using TxIntro. Evaluation results show that TxIntro is effective in detecting these rootkits, and is efficient in adding negligible performance overhead.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122304509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46

DraMon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead DraMon:以高精度和低开销预测多线程程序的内存带宽使用情况

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835948

Wei Wang, Tanima Dey, J. Davidson, M. Soffa

Memory bandwidth severely limits the scalability and performance of today's multi-core systems. Because of this limitation, many studies that focused on improving multi-core scalability rely on bandwidth usage predictions to achieve the best results. However, existing bandwidth prediction models have low accuracy, causing these studies to have inaccurate conclusions or perform sub-optimally. Most of these models make predictions based on the bandwidth usage samples of a few trial runs. Many factors that affect bandwidth usage and the complex DRAM operations are overlooked. This paper presents DraMon, a model that predicts bandwidth usages for multi-threaded programs with low overhead. It achieves high accuracy through highly accurate predictions of DRAM contention and DRAM concurrency, as well as by considering a wide range of hardware and software factors that impact bandwidth usage. We implemented two versions of DraMon: DraMon-T, a memory-trace based model, and DraMon-R, a run-time model which uses hardware performance counters. When evaluated on a real machine with memory-intensive benchmarks, DraMon-T has average accuracies of 99.17% and 94.70% for DRAM contention predictions and bandwidth predictions, respectively. DraMon-R has average accuracies of 98.55% and 93.37% for DRAM contention and bandwidth predictions respectively, with only 0.50% overhead on average.

内存带宽严重限制了当今多核系统的可伸缩性和性能。由于这种限制，许多关注于改进多核可伸缩性的研究依赖于带宽使用预测来获得最佳结果。然而，现有的带宽预测模型精度较低，导致这些研究的结论不准确或性能不理想。这些模型中的大多数都是基于少量试运行的带宽使用样本进行预测的。许多影响带宽使用和复杂的DRAM操作的因素被忽略了。本文提出了一个预测低开销多线程程序带宽使用的模型——DraMon。它通过高度准确地预测DRAM争用和DRAM并发性，以及考虑影响带宽使用的各种硬件和软件因素，实现了高精度。我们实现了两个版本的DraMon: DraMon- t(基于内存跟踪的模型)和DraMon- r(使用硬件性能计数器的运行时模型)。在使用内存密集型基准测试的真实机器上进行评估时，DraMon-T在DRAM争用预测和带宽预测方面的平均准确率分别为99.17%和94.70%。对于DRAM争用和带宽预测，DraMon-R的平均准确率分别为98.55%和93.37%，平均开销仅为0.50%。

{"title":"DraMon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead","authors":"Wei Wang, Tanima Dey, J. Davidson, M. Soffa","doi":"10.1109/HPCA.2014.6835948","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835948","url":null,"abstract":"Memory bandwidth severely limits the scalability and performance of today's multi-core systems. Because of this limitation, many studies that focused on improving multi-core scalability rely on bandwidth usage predictions to achieve the best results. However, existing bandwidth prediction models have low accuracy, causing these studies to have inaccurate conclusions or perform sub-optimally. Most of these models make predictions based on the bandwidth usage samples of a few trial runs. Many factors that affect bandwidth usage and the complex DRAM operations are overlooked. This paper presents DraMon, a model that predicts bandwidth usages for multi-threaded programs with low overhead. It achieves high accuracy through highly accurate predictions of DRAM contention and DRAM concurrency, as well as by considering a wide range of hardware and software factors that impact bandwidth usage. We implemented two versions of DraMon: DraMon-T, a memory-trace based model, and DraMon-R, a run-time model which uses hardware performance counters. When evaluated on a real machine with memory-intensive benchmarks, DraMon-T has average accuracies of 99.17% and 94.70% for DRAM contention predictions and bandwidth predictions, respectively. DraMon-R has average accuracies of 98.55% and 93.37% for DRAM contention and bandwidth predictions respectively, with only 0.50% overhead on average.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115398475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Adaptive placement and migration policy for an STT-RAM-based hybrid cache 基于stt - ram的混合缓存的自适应放置和迁移策略

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835933

Zhe Wang, Daniel A. Jiménez, Cong Xu, Guangyu Sun, Yuan Xie

Emerging Non-Volatile Memories (NVM) such as Spin-Torque Transfer RAM (STT-RAM) and Resistive RAM (RRAM) have been explored as potential alternatives for traditional SRAM-based Last-Level-Caches (LLCs) due to the benefits of higher density and lower leakage power. However, NVM technologies have long latency and high energy overhead associated with the write operations. Consequently, a hybrid STT-RAM and SRAM based LLC architecture has been proposed in the hope of exploiting high density and low leakage power of STT-RAM and low write overhead of SRAM. Such a hybrid cache design relies on an intelligent block placement policy that makes good use of the characteristics of both STT-RAM and SRAM technology. In this paper, we propose an adaptive block placement and migration policy (APM) for hybrid caches. LLC write accesses are categorized into three classes: prefetch-write, demand-write, and core-write. Our proposed technique places a block into either STT-RAM lines or SRAM lines by adapting to the access pattern of each class. An access pattern predictor is proposed to direct block placement and migration, which can benefit from the high density and low leakage power of STT-RAM lines as well as the low write overhead of SRAM lines. Our evaluation shows that the technique can improve performance and reduce LLC power consumption compared to both SRAM-based LLC and STT-RAM-based LLCs with the same area footprint. It outperforms the SRAM-based LLC on average by 8.0% for single-thread workloads and 20.5% for multi-core workloads. The technique reduces power consumption in the LLC by 18.9% and 19.3% for single-thread and multi-core workloads, respectively.

新兴的非易失性存储器(NVM)，如自旋扭矩传输RAM (STT-RAM)和电阻性RAM (RRAM)，由于具有更高的密度和更低的泄漏功率，已经被探索作为传统基于sram的最后一级缓存(LLCs)的潜在替代品。然而，NVM技术具有与写操作相关的长延迟和高能量开销。因此，提出了一种基于STT-RAM和SRAM的混合LLC架构，以期利用STT-RAM的高密度、低漏功率和SRAM的低写入开销。这种混合缓存设计依赖于智能块放置策略，该策略充分利用了STT-RAM和SRAM技术的特性。本文提出了一种用于混合缓存的自适应块放置和迁移策略(APM)。LLC写访问分为三种类型:预取写、按需写和核心写。我们提出的技术通过适应每个类的访问模式，将块放入STT-RAM线路或SRAM线路中。利用STT-RAM线路的高密度和低泄漏功率以及SRAM线路的低写入开销，提出了一种访问模式预测器来指导块的放置和迁移。我们的评估表明，与相同占地面积的基于sram的LLC和基于stt - ram的LLC相比，该技术可以提高性能并降低LLC功耗。对于单线程工作负载，它比基于sram的LLC的性能平均高出8.0%，对于多核工作负载，它的性能平均高出20.5%。对于单线程和多核工作负载，该技术可将LLC中的功耗分别降低18.9%和19.3%。

{"title":"Adaptive placement and migration policy for an STT-RAM-based hybrid cache","authors":"Zhe Wang, Daniel A. Jiménez, Cong Xu, Guangyu Sun, Yuan Xie","doi":"10.1109/HPCA.2014.6835933","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835933","url":null,"abstract":"Emerging Non-Volatile Memories (NVM) such as Spin-Torque Transfer RAM (STT-RAM) and Resistive RAM (RRAM) have been explored as potential alternatives for traditional SRAM-based Last-Level-Caches (LLCs) due to the benefits of higher density and lower leakage power. However, NVM technologies have long latency and high energy overhead associated with the write operations. Consequently, a hybrid STT-RAM and SRAM based LLC architecture has been proposed in the hope of exploiting high density and low leakage power of STT-RAM and low write overhead of SRAM. Such a hybrid cache design relies on an intelligent block placement policy that makes good use of the characteristics of both STT-RAM and SRAM technology. In this paper, we propose an adaptive block placement and migration policy (APM) for hybrid caches. LLC write accesses are categorized into three classes: prefetch-write, demand-write, and core-write. Our proposed technique places a block into either STT-RAM lines or SRAM lines by adapting to the access pattern of each class. An access pattern predictor is proposed to direct block placement and migration, which can benefit from the high density and low leakage power of STT-RAM lines as well as the low write overhead of SRAM lines. Our evaluation shows that the technique can improve performance and reduce LLC power consumption compared to both SRAM-based LLC and STT-RAM-based LLCs with the same area footprint. It outperforms the SRAM-based LLC on average by 8.0% for single-thread workloads and 20.5% for multi-core workloads. The technique reduces power consumption in the LLC by 18.9% and 19.3% for single-thread and multi-core workloads, respectively.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116017060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 121

NUAT: A non-uniform access time memory controller 一种非统一存取时间存储器控制器

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835956

Wongyu Shin, Jeongmin Yang, Jungwhan Choi, L. Kim

With rapid development of micro-processors, off-chip memory access becomes a system bottleneck. DRAM, a main memory in most computers, has concentrated only on capacity and bandwidth for decades to achieve high performance computing. However, DRAM access latency should also be considered to keep the development trend in multi-core era. Therefore, we propose NUAT which is a new memory controller focusing on reducing memory access latency without any modification of the existing DRAM structure. We only exploit DRAM's intrinsic phenomenon: electric charge variation in DRAM cell capacitors. Given the cost-sensitive DRAM market, it is a big advantage in terms of actual implementation. NUAT gives a score to every memory access request and the request with the highest score obtains a priority. For scoring, we introduce two new concepts: Partitioned Bank Rotation (PBR) and PBR Page Mode (PPM). First, PBR is a mechanism that draws information of access speed from refresh timing and position; the request which has faster access speed gains higher score. Second, PPM selects a better page mode between open- and close-page modes based on the information from PBR. Evaluations show that NUAT decreases memory access latency significantly for various environments.

随着微处理器的快速发展，片外存储器的存取成为系统的瓶颈。DRAM是大多数计算机的主存储器，几十年来，为了实现高性能计算，它一直只关注容量和带宽。但是，为了保持多核时代的发展趋势，还需要考虑DRAM的访问延迟。因此，我们提出了NUAT，这是一种新的内存控制器，专注于在不改变现有DRAM结构的情况下减少内存访问延迟。我们只开发了DRAM的固有现象:DRAM单元电容器中的电荷变化。考虑到对成本敏感的DRAM市场，它在实际实施方面具有很大的优势。NUAT给每个内存访问请求打分，得分最高的请求获得优先级。对于评分，我们引入了两个新概念:分区银行轮换(PBR)和PBR页面模式(PPM)。首先，PBR是一种从刷新时间和位置获取访问速度信息的机制;访问速度越快的请求得分越高。其次，PPM根据PBR的信息在打开和关闭页面模式之间选择更好的页面模式。评估表明，NUAT显著降低了各种环境下的内存访问延迟。

{"title":"NUAT: A non-uniform access time memory controller","authors":"Wongyu Shin, Jeongmin Yang, Jungwhan Choi, L. Kim","doi":"10.1109/HPCA.2014.6835956","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835956","url":null,"abstract":"With rapid development of micro-processors, off-chip memory access becomes a system bottleneck. DRAM, a main memory in most computers, has concentrated only on capacity and bandwidth for decades to achieve high performance computing. However, DRAM access latency should also be considered to keep the development trend in multi-core era. Therefore, we propose NUAT which is a new memory controller focusing on reducing memory access latency without any modification of the existing DRAM structure. We only exploit DRAM's intrinsic phenomenon: electric charge variation in DRAM cell capacitors. Given the cost-sensitive DRAM market, it is a big advantage in terms of actual implementation. NUAT gives a score to every memory access request and the request with the highest score obtains a priority. For scoring, we introduce two new concepts: Partitioned Bank Rotation (PBR) and PBR Page Mode (PPM). First, PBR is a mechanism that draws information of access speed from refresh timing and position; the request which has faster access speed gains higher score. Second, PPM selects a better page mode between open- and close-page modes based on the information from PBR. Evaluations show that NUAT decreases memory access latency significantly for various environments.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128603000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 61

Improving DRAM performance by parallelizing refreshes with accesses 通过并行刷新和访问来提高DRAM性能

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835946

K. Chang, Donghyuk Lee, Zeshan A. Chishti, Alaa R. Alameldeen, C. Wilkerson, Yoongu Kim, O. Mutlu

Modern DRAM cells are periodically refreshed to prevent data loss due to leakage. Commodity DDR (double data rate) DRAM refreshes cells at the rank level. This degrades performance significantly because it prevents an entire DRAM rank from serving memory requests while being refreshed. DRAM designed for mobile platforms, LPDDR (low power DDR) DRAM, supports an enhanced mode, called per-bank refresh, that refreshes cells at the bank level. This enables a bank to be accessed while another in the same rank is being refreshed, alleviating part of the negative performance impact of refreshes. Unfortunately, there are two shortcomings of per-bank refresh employed in today's systems. First, we observe that the perbank refresh scheduling scheme does not exploit the full potential of overlapping refreshes with accesses across banks because it restricts the banks to be refreshed in a sequential round-robin order. Second, accesses to a bank that is being refreshed have to wait. To mitigate the negative performance impact of DRAM refresh, we propose two complementary mechanisms, DARP (Dynamic Access Refresh Parallelization) and SARP (Subarray Access Refresh Parallelization). The goal is to address the drawbacks of per-bank refresh by building more efficient techniques to parallelize refreshes and accesses within DRAM. First, instead of issuing per-bank refreshes in a round-robin order, as it is done today, DARP issues per-bank refreshes to idle banks in an out-of-order manner. Furthermore, DARP proactively schedules refreshes during intervals when a batch of writes are draining to DRAM. Second, SARP exploits the existence of mostly-independent subarrays within a bank. With minor modifications to DRAM organization, it allows a bank to serve memory accesses to an idle subarray while another subarray is being refreshed. Extensive evaluations on a wide variety of workloads and systems show that our mechanisms improve system performance (and energy efficiency) compared to three state-of-the-art refresh policies and the performance benefit increases as DRAM density increases.

现代DRAM单元定期刷新，以防止因泄漏而丢失数据。商品DDR(双倍数据速率)DRAM在等级级别刷新单元。这会显著降低性能，因为它会阻止整个DRAM队列在刷新时处理内存请求。专为移动平台设计的LPDDR(低功耗DDR) DRAM支持一种增强模式，称为每银行刷新，在银行级别刷新单元。这使得可以在刷新同一级别的另一个银行时访问一个银行，从而减轻了刷新对性能的部分负面影响。不幸的是，在当今的系统中，每个银行的刷新有两个缺点。首先，我们观察到，perbank刷新调度方案没有充分利用跨银行访问重叠刷新的潜力，因为它限制了银行以连续的循环顺序进行刷新。其次，访问正在刷新的银行需要等待。为了减轻DRAM刷新对性能的负面影响，我们提出了两种互补机制，DARP(动态访问刷新并行化)和SARP(子阵列访问刷新并行化)。我们的目标是通过构建更有效的技术来并行处理DRAM中的刷新和访问，从而解决每个银行刷新的缺点。首先，DARP不像现在那样以循环顺序发布每个银行的刷新，而是以无序的方式向空闲银行发布每个银行的刷新。此外，DARP在一批写操作消耗到DRAM时主动安排刷新。其次，SARP利用银行中存在的大多数独立子数组。通过对DRAM组织进行少量修改，它允许存储库在刷新另一个子数组时为空闲子数组提供内存访问。对各种工作负载和系统的广泛评估表明，与三种最先进的刷新策略相比，我们的机制提高了系统性能(和能源效率)，并且性能优势随着DRAM密度的增加而增加。

{"title":"Improving DRAM performance by parallelizing refreshes with accesses","authors":"K. Chang, Donghyuk Lee, Zeshan A. Chishti, Alaa R. Alameldeen, C. Wilkerson, Yoongu Kim, O. Mutlu","doi":"10.1109/HPCA.2014.6835946","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835946","url":null,"abstract":"Modern DRAM cells are periodically refreshed to prevent data loss due to leakage. Commodity DDR (double data rate) DRAM refreshes cells at the rank level. This degrades performance significantly because it prevents an entire DRAM rank from serving memory requests while being refreshed. DRAM designed for mobile platforms, LPDDR (low power DDR) DRAM, supports an enhanced mode, called per-bank refresh, that refreshes cells at the bank level. This enables a bank to be accessed while another in the same rank is being refreshed, alleviating part of the negative performance impact of refreshes. Unfortunately, there are two shortcomings of per-bank refresh employed in today's systems. First, we observe that the perbank refresh scheduling scheme does not exploit the full potential of overlapping refreshes with accesses across banks because it restricts the banks to be refreshed in a sequential round-robin order. Second, accesses to a bank that is being refreshed have to wait. To mitigate the negative performance impact of DRAM refresh, we propose two complementary mechanisms, DARP (Dynamic Access Refresh Parallelization) and SARP (Subarray Access Refresh Parallelization). The goal is to address the drawbacks of per-bank refresh by building more efficient techniques to parallelize refreshes and accesses within DRAM. First, instead of issuing per-bank refreshes in a round-robin order, as it is done today, DARP issues per-bank refreshes to idle banks in an out-of-order manner. Furthermore, DARP proactively schedules refreshes during intervals when a batch of writes are draining to DRAM. Second, SARP exploits the existence of mostly-independent subarrays within a bank. With minor modifications to DRAM organization, it allows a bank to serve memory accesses to an idle subarray while another subarray is being refreshed. Extensive evaluations on a wide variety of workloads and systems show that our mechanisms improve system performance (and energy efficiency) compared to three state-of-the-art refresh policies and the performance benefit increases as DRAM density increases.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114800697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 208

Understanding the impact of gate-level physical reliability effects on whole program execution 理解门级物理可靠性对整个程序执行的影响

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835976

Raghuraman Balasubramanian, K. Sankaralingam

This paper introduces a novel end-to-end platform called PERSim that allows FPGA accelerated full-system simulation of complete programs on prototype hardware with detailed fault injection that can capture gate delays and digital logic behavior of arbitrary circuits and provides full coverage. We use PERSim and report on five case studies spanning a diverse spectrum of reliability techniques including wearout prediction/detection (FIRST, Wearmon, TRIX), transient faults, and permanent faults (Sampling-DMR). PERSim provides unprecedented capability to study these techniques quantitatively when applied to a full processor and when running complete programs. These case studies demonstrate PERSim's robustness and flexibility - such a diverse set of techniques can be studied uniformly with common metrics like area overhead, power overhead, and detection latency. PERSim provides many new insights, of which two important ones are: i) We discover an important modeling “hole” - when considering the true logic delay behavior, non-critical paths can directly transition into logic faults, rendering insufficient delay-based detection/prediction mechanisms targeted at critical paths alone. ii) When Sampling-DMR was evaluated in a real system running full applications, detection latency is orders of magnitude lower than previously reported model-based worst-case latency - 107 seconds vs. 0.84 seconds, thus dramatically strengthening Sampling-DMR's effectiveness. The framework is released open source and runs on the Zync platform.

本文介绍了一种新颖的端到端平台PERSim，该平台允许FPGA加速原型硬件上完整程序的全系统仿真，并提供详细的故障注入，可以捕获任意电路的门延迟和数字逻辑行为，并提供全覆盖。我们使用PERSim并报告了五个案例研究，涵盖了各种可靠性技术，包括磨损预测/检测(FIRST, Wearmon, TRIX)，瞬态故障和永久故障(采样- dmr)。当应用于全处理器和运行完整程序时，PERSim提供了前所未有的定量研究这些技术的能力。这些案例研究展示了PERSim的健壮性和灵活性——这样一组不同的技术可以用面积开销、功耗开销和检测延迟等通用指标统一研究。PERSim提供了许多新的见解，其中两个重要的见解是:i)我们发现了一个重要的建模“漏洞”——在考虑真实的逻辑延迟行为时，非关键路径可以直接转换为逻辑故障，使得仅针对关键路径的基于延迟的检测/预测机制不足。ii)当在运行完整应用程序的真实系统中评估采样- dmr时，检测延迟比先前报道的基于模型的最坏情况延迟低几个数量级- 107秒对0.84秒，从而大大增强了采样- dmr的有效性。该框架是开源的，运行在Zync平台上。

{"title":"Understanding the impact of gate-level physical reliability effects on whole program execution","authors":"Raghuraman Balasubramanian, K. Sankaralingam","doi":"10.1109/HPCA.2014.6835976","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835976","url":null,"abstract":"This paper introduces a novel end-to-end platform called PERSim that allows FPGA accelerated full-system simulation of complete programs on prototype hardware with detailed fault injection that can capture gate delays and digital logic behavior of arbitrary circuits and provides full coverage. We use PERSim and report on five case studies spanning a diverse spectrum of reliability techniques including wearout prediction/detection (FIRST, Wearmon, TRIX), transient faults, and permanent faults (Sampling-DMR). PERSim provides unprecedented capability to study these techniques quantitatively when applied to a full processor and when running complete programs. These case studies demonstrate PERSim's robustness and flexibility - such a diverse set of techniques can be studied uniformly with common metrics like area overhead, power overhead, and detection latency. PERSim provides many new insights, of which two important ones are: i) We discover an important modeling “hole” - when considering the true logic delay behavior, non-critical paths can directly transition into logic faults, rendering insufficient delay-based detection/prediction mechanisms targeted at critical paths alone. ii) When Sampling-DMR was evaluated in a real system running full applications, detection latency is orders of magnitude lower than previously reported model-based worst-case latency - 107 seconds vs. 0.84 seconds, thus dramatically strengthening Sampling-DMR's effectiveness. The framework is released open source and runs on the Zync platform.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122205925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

CDTT: Compiler-generated data-triggered threads CDTT:编译器生成的数据触发线程

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835973

Hung-Wei Tseng, D. Tullsen

This paper presents CDTT, a compiler framework that takes C/C++ code and automatically generates a binary that eliminates dynamically redundant code without programmer intervention. It does so by exploiting underlying hardware or software support for the data-triggered threads (DTT) programming and execution model. With the help of idempotence analysis and inter-procedural name dependence analysis, CDTT identifies potential code regions and composes support thread functions that execute as soon as live-in data changes. CDTT can also use profile data to target the elimination of redundant computation. The compiled binary running on top of a software runtime system can achieve nearly the same level of performance as careful hand-coded modifications in most benchmarks. CDTT improves the performance of serial C SPEC benchmarks by as much as 57% (average 11%) on a Nehalem processor.

本文介绍了CDTT，一个编译器框架，它采用C/ c++代码并自动生成二进制文件，消除动态冗余代码，无需程序员干预。它通过利用底层硬件或软件对数据触发线程(DTT)编程和执行模型的支持来实现这一点。在幂等性分析和过程间名称依赖分析的帮助下，CDTT可以识别潜在的代码区域，并编写支持线程函数，这些函数在实时数据发生变化时立即执行。CDTT还可以使用剖面数据来消除冗余计算。在软件运行时系统上运行的编译二进制代码可以达到与大多数基准测试中精心手工编写的修改几乎相同的性能水平。在Nehalem处理器上，CDTT将串行C SPEC基准测试的性能提高了57%(平均11%)。

引用次数: 10

FADE: A programmable filtering accelerator for instruction-grain monitoring 用于指令粒度监控的可编程滤波加速器

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835922

Sotiria Fytraki, Evangelos Vlachos, Yusuf Onur Koçberber, B. Falsafi, Boris Grot

Instruction-grain monitoring is a powerful approach that enables a wide spectrum of bug-finding tools. As existing software approaches incur prohibitive runtime overhead, researchers have focused on hardware support for instruction-grain monitoring. A recurring theme in recent work is the use of hardware-assisted filtering so as to elide costly software analysis. This work generalizes and extends prior point solutions into a programmable filtering accelerator affording vast flexibility and at-speed event filtering. The pipelined microarchitecture of the accelerator affords a peak filtering rate of one application event per cycle, which suffices to keep up with an aggressive OoO core running the monitored application. A unique feature of the proposed design is the ability to dynamically resolve dependencies between unfilterable events and subsequent events, eliminating data-dependent stalls and maximizing accelerator's performance. Our evaluation results show a monitoring slowdown of just 1.2-1.8x across a diverse set of monitoring tools.

指令粒度监视是一种强大的方法，它支持广泛的bug查找工具。由于现有的软件方法会产生令人望而却步的运行时开销，研究人员将重点放在了对指令粒度监视的硬件支持上。在最近的工作中，一个反复出现的主题是使用硬件辅助过滤，以避免昂贵的软件分析。这项工作将先前的点解决方案推广并扩展到可编程滤波加速器中，提供了巨大的灵活性和高速事件滤波。加速器的流水线微体系结构提供了每个周期一个应用程序事件的峰值过滤速率，这足以跟上运行被监视应用程序的OoO核心。所提出的设计的一个独特之处是能够动态地解决不可过滤事件和后续事件之间的依赖关系，从而消除与数据相关的停顿并最大限度地提高加速器的性能。我们的评估结果显示，在不同的监控工具集中，监控速度的减慢仅为1.2-1.8倍。

引用次数: 16