2011 IEEE 17th International Symposium on High Performance Computer Architecture最新文献_第3页

Hardware/software techniques for DRAM thermal management DRAM热管理的硬件/软件技术

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749756

Song Liu, Brian Leung, Alexander Neckar, S. Memik, G. Memik, N. Hardavellas

The performance of the main memory is an important factor on overall system performance. To improve DRAM performance, designers have been increasing chip densities and the number of memory modules. However, these approaches increase power consumption and operating temperatures: temperatures in existing DRAM modules can rise to over 95°C. Another important property of DRAM temperature is the large variation in DRAM chip temperatures. In this paper, we present our analysis collected from measurements on a real system indicating that temperatures across DRAM chips can vary by over 10°C. This work aims to minimize this variation as well as the peak DRAM temperature. We first develop a thermal model to estimate the temperature of DRAM chips and validate this model against real temperature measurements. We then propose three hardware and software schemes to reduce peak temperatures. The first technique introduces a new cache line replacement policy that reduces the number of accesses to the overheating DRAM chips. The second technique utilizes a Memory Write Buffer to improve the access efficiency of the overheated chips. The third scheme intelligently allocates pages to relatively cooler ranks of the DIMM. Our experiments show that in a high performance memory system, our schemes reduce the peak DRAM chip temperature by as much as 8.39°C over 10 workloads (5.36°C on average). Our schemes also improve performance mainly due to reduction in thermal emergencies: for a baseline system with memory bandwidth throttling scheme, the IPC is improved by as much as 15.8% (4.1% on average).

主存储器的性能是影响系统整体性能的一个重要因素。为了提高DRAM的性能，设计人员一直在增加芯片密度和存储模块的数量。然而，这些方法增加了功耗和工作温度:现有DRAM模块的温度可以上升到95°C以上。DRAM温度的另一个重要特性是DRAM芯片温度的大变化。在本文中，我们展示了从实际系统的测量中收集的分析，表明DRAM芯片的温度变化可能超过10°C。这项工作的目的是尽量减少这种变化以及峰值DRAM温度。我们首先开发了一个热模型来估计DRAM芯片的温度，并根据实际温度测量验证该模型。然后，我们提出了三种硬件和软件方案来降低峰值温度。第一种技术引入了一种新的缓存线路替换策略，减少了对过热的DRAM芯片的访问次数。第二种技术利用存储器写缓冲器来提高过热芯片的存取效率。第三种方案智能地将页面分配到相对较冷的内存级别。我们的实验表明，在高性能存储系统中，我们的方案在10个工作负载中(平均5.36°C)将DRAM芯片峰值温度降低了8.39°C。我们的方案还提高了性能，主要是由于减少了热紧急情况:对于具有内存带宽限制方案的基准系统，IPC提高了15.8%(平均4.1%)。

{"title":"Hardware/software techniques for DRAM thermal management","authors":"Song Liu, Brian Leung, Alexander Neckar, S. Memik, G. Memik, N. Hardavellas","doi":"10.1109/HPCA.2011.5749756","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749756","url":null,"abstract":"The performance of the main memory is an important factor on overall system performance. To improve DRAM performance, designers have been increasing chip densities and the number of memory modules. However, these approaches increase power consumption and operating temperatures: temperatures in existing DRAM modules can rise to over 95°C. Another important property of DRAM temperature is the large variation in DRAM chip temperatures. In this paper, we present our analysis collected from measurements on a real system indicating that temperatures across DRAM chips can vary by over 10°C. This work aims to minimize this variation as well as the peak DRAM temperature. We first develop a thermal model to estimate the temperature of DRAM chips and validate this model against real temperature measurements. We then propose three hardware and software schemes to reduce peak temperatures. The first technique introduces a new cache line replacement policy that reduces the number of accesses to the overheating DRAM chips. The second technique utilizes a Memory Write Buffer to improve the access efficiency of the overheated chips. The third scheme intelligently allocates pages to relatively cooler ranks of the DIMM. Our experiments show that in a high performance memory system, our schemes reduce the peak DRAM chip temperature by as much as 8.39°C over 10 workloads (5.36°C on average). Our schemes also improve performance mainly due to reduction in thermal emergencies: for a baseline system with memory bandwidth throttling scheme, the IPC is improved by as much as 15.8% (4.1% on average).","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115794138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 60

ACCESS: Smart scheduling for asymmetric cache CMPs ACCESS:非对称缓存cmp的智能调度

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749757

Xiaowei Jiang, Asit K. Mishra, Li Zhao, R. Iyer, Zhen Fang, S. Srinivasan, S. Makineni, P. Brett, C. Das

In current Chip-multiprocessors (CMPs), a significant portion of the die is consumed by the last-level cache. Until recently, the balance of cache and core space has been primarily guided by the needs of single applications. However, as multiple applications or virtual machines (VMs) are consolidated on such a platform, researchers have observed that not all VMs or applications require significant amount of cache space. In order to take advantage of this phenomenon, we explore the use of asymmetric last-level caches in a CMP platform. While asymmetric cache CMPs provide the benefit of reduced power and area, it is important to build in hardware/software support to appropriately schedule applications on to cores with suitable cache capacity. In this paper, we address this problem with our ACCESS architecture comprising of: (a) asymmetric caches across a group of cores, (b) hardware support that enables prediction of cache performance on the different sized caches and (c) OS scheduler support to make use of the prediction capability and appropriately schedule applications on to core with suitable cache capacity. Measurements on a working prototype using SPEC2006 benchmarks show that our ACCESS architecture can effectively schedule jobs in an asymmetric cache CMP and provide 23% performance improvement compared to a naive scheduler, and is 97% close to an oracle scheduler in making schedules.

在当前的芯片多处理器(cmp)中，die的很大一部分被最后一级缓存所消耗。直到最近，缓存和核心空间的平衡主要是由单个应用程序的需求来指导的。然而，由于多个应用程序或虚拟机(vm)被整合到这样的平台上，研究人员发现并不是所有的vm或应用程序都需要大量的缓存空间。为了利用这种现象，我们探讨了在CMP平台中使用非对称的最后一级缓存。虽然非对称缓存cmp提供了降低功耗和面积的好处，但重要的是要构建硬件/软件支持，以适当地将应用程序调度到具有适当缓存容量的核心上。在本文中，我们通过ACCESS架构解决了这个问题，该架构包括:(a)跨一组核心的非对称缓存，(b)硬件支持，可以预测不同大小的缓存上的缓存性能，以及(c)操作系统调度程序支持，以利用预测能力并适当地将应用程序调度到具有合适缓存容量的核心上。在使用SPEC2006基准测试的工作原型上进行的测量表明，我们的ACCESS架构可以有效地调度非对称缓存CMP中的作业，与普通调度器相比，性能提高了23%，在调度方面接近oracle调度器97%。

{"title":"ACCESS: Smart scheduling for asymmetric cache CMPs","authors":"Xiaowei Jiang, Asit K. Mishra, Li Zhao, R. Iyer, Zhen Fang, S. Srinivasan, S. Makineni, P. Brett, C. Das","doi":"10.1109/HPCA.2011.5749757","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749757","url":null,"abstract":"In current Chip-multiprocessors (CMPs), a significant portion of the die is consumed by the last-level cache. Until recently, the balance of cache and core space has been primarily guided by the needs of single applications. However, as multiple applications or virtual machines (VMs) are consolidated on such a platform, researchers have observed that not all VMs or applications require significant amount of cache space. In order to take advantage of this phenomenon, we explore the use of asymmetric last-level caches in a CMP platform. While asymmetric cache CMPs provide the benefit of reduced power and area, it is important to build in hardware/software support to appropriately schedule applications on to cores with suitable cache capacity. In this paper, we address this problem with our ACCESS architecture comprising of: (a) asymmetric caches across a group of cores, (b) hardware support that enables prediction of cache performance on the different sized caches and (c) OS scheduler support to make use of the prediction capability and appropriately schedule applications on to core with suitable cache capacity. Measurements on a working prototype using SPEC2006 benchmarks show that our ACCESS architecture can effectively schedule jobs in an asymmetric cache CMP and provide 23% performance improvement compared to a naive scheduler, and is 97% close to an oracle scheduler in making schedules.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131517322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

Efficient data streaming with on-chip accelerators: Opportunities and challenges 片上加速器的高效数据流:机遇与挑战

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749739

Rui Hou, Lixin Zhang, Michael C. Huang, Kun Wang, H. Franke, Y. Ge, Xiaotao Chang

The transistor density of microprocessors continues to increase as technology scales. Microprocessors designers have taken advantage of the increased transistors by integrating a significant number of cores onto a single die. However, a large number of cores are met with diminishing returns due to software and hardware scalability issues and hence designers have started integrating on-chip special-purpose logic units (i.e., accelerators) that were previously available as PCI-attached units. It is anticipated that more accelerators will be integrated on-chip due to the increasing abundance of transistors and the fact that not all logic can be powered at all times due to power budget limits. Thus, on-chip accelerator architectures deserve more attention from the research community. There is a wide spectrum of research opportunities for design and optimization of accelerators. This paper attempts to bring out some insights by studying the data access streams of on-chip accelerators that hopefully foster some future research in this area. Specifically, this paper uses a few simple case studies to show some of the common characteristics of the data streams introduced by on-chip accelerators, discusses challenges and opportunities in exploiting these characteristics to optimize the power and performance of accelerators, and then analyzes the effectiveness of some simple optimizing extensions proposed.

微处理器的晶体管密度随着技术的发展而不断增加。微处理器设计者通过将大量的核心集成到单个芯片上，充分利用了晶体管的增加。然而，由于软件和硬件的可伸缩性问题，大量的内核的收益递减，因此设计人员已经开始集成芯片上的专用逻辑单元(即加速器)，这些单元以前是作为pci附加单元提供的。由于晶体管越来越多，而且由于功率预算限制，并非所有逻辑都可以随时供电，因此预计将有更多的加速器集成在芯片上。因此，片上加速器架构值得研究界更多的关注。加速器的设计和优化有广泛的研究机会。本文试图通过研究片上加速器的数据访问流来提出一些见解，希望能促进该领域的一些未来研究。具体来说，本文通过几个简单的案例研究，展示了片上加速器引入的数据流的一些共同特征，讨论了利用这些特征来优化加速器功率和性能的挑战和机遇，然后分析了一些简单的优化扩展的有效性。

{"title":"Efficient data streaming with on-chip accelerators: Opportunities and challenges","authors":"Rui Hou, Lixin Zhang, Michael C. Huang, Kun Wang, H. Franke, Y. Ge, Xiaotao Chang","doi":"10.1109/HPCA.2011.5749739","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749739","url":null,"abstract":"The transistor density of microprocessors continues to increase as technology scales. Microprocessors designers have taken advantage of the increased transistors by integrating a significant number of cores onto a single die. However, a large number of cores are met with diminishing returns due to software and hardware scalability issues and hence designers have started integrating on-chip special-purpose logic units (i.e., accelerators) that were previously available as PCI-attached units. It is anticipated that more accelerators will be integrated on-chip due to the increasing abundance of transistors and the fact that not all logic can be powered at all times due to power budget limits. Thus, on-chip accelerator architectures deserve more attention from the research community. There is a wide spectrum of research opportunities for design and optimization of accelerators. This paper attempts to bring out some insights by studying the data access streams of on-chip accelerators that hopefully foster some future research in this area. Specifically, this paper uses a few simple case studies to show some of the common characteristics of the data streams introduced by on-chip accelerators, discusses challenges and opportunities in exploiting these characteristics to optimize the power and performance of accelerators, and then analyzes the effectiveness of some simple optimizing extensions proposed.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124160853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Efficient complex operators for irregular codes 不规则码的高效复算子

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749754

J. Sampson, Ganesh Venkatesh, Nathan Goulding, Saturnino Garcia, S. Swanson, M. Taylor

Complex “fat operators” are important contributors to the efficiency of specialized hardware. This paper introduces two new techniques for constructing efficient fat operators featuring up to dozens of operations with arbitrary and irregular data and memory dependencies. These techniques focus on minimizing critical path length and load-use delay, which are key concerns for irregular computations. Selective Depipelining(SDP) is a pipelining technique that allows fat operators containing several, possibly dependent, memory operations. SDP allows memory requests to operate at a faster clock rate than the datapath, saving power in the datapath and improving memory performance. Cachelets are small, customized, distributed L0 caches embedded in the datapath to reduce load-use latency. We apply these techniques to Conservation Cores(c-cores) to produce coprocessors that accelerate irregular code regions while still providing superior energy efficiency. On average, these enhanced c-cores reduce EDP by 2× and area by 35% relative to c-cores. They are up to 2.5× faster than a general-purpose processor and reduce energy consumption by up to 8× for a variety of irregular applications including several SPECINT benchmarks.

复杂的“胖操作符”是提高专用硬件效率的重要因素。本文介绍了两种新技术，用于构造具有任意和不规则数据和内存依赖关系的高效fat运算符，该运算符具有多达数十个操作。这些技术侧重于最小化关键路径长度和负载使用延迟，这是不规则计算的关键问题。选择性去管道化(SDP)是一种管道化技术，它允许fat操作符包含多个可能相互依赖的内存操作。SDP允许内存请求以比数据路径更快的时钟速率运行，从而节省数据路径的功耗，提高内存性能。缓存是嵌入在数据路径中的小型、定制的分布式L0缓存，用于减少负载使用延迟。我们将这些技术应用于保护核心(c-core)，以生产加速不规则代码区域的协处理器，同时仍提供卓越的能源效率。平均而言，这些增强的c核比c核减少了2倍的EDP和35%的面积。它们比通用处理器快2.5倍，对于各种不规则应用(包括几个SPECINT基准测试)，能耗降低高达8倍。

{"title":"Efficient complex operators for irregular codes","authors":"J. Sampson, Ganesh Venkatesh, Nathan Goulding, Saturnino Garcia, S. Swanson, M. Taylor","doi":"10.1109/HPCA.2011.5749754","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749754","url":null,"abstract":"Complex “fat operators” are important contributors to the efficiency of specialized hardware. This paper introduces two new techniques for constructing efficient fat operators featuring up to dozens of operations with arbitrary and irregular data and memory dependencies. These techniques focus on minimizing critical path length and load-use delay, which are key concerns for irregular computations. Selective Depipelining(SDP) is a pipelining technique that allows fat operators containing several, possibly dependent, memory operations. SDP allows memory requests to operate at a faster clock rate than the datapath, saving power in the datapath and improving memory performance. Cachelets are small, customized, distributed L0 caches embedded in the datapath to reduce load-use latency. We apply these techniques to Conservation Cores(c-cores) to produce coprocessors that accelerate irregular code regions while still providing superior energy efficiency. On average, these enhanced c-cores reduce EDP by 2× and area by 35% relative to c-cores. They are up to 2.5× faster than a general-purpose processor and reduce energy consumption by up to 8× for a variety of irregular applications including several SPECINT benchmarks.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"91 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126798776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

Cuckoo directory: A scalable directory for many-core systems Cuckoo目录:用于多核系统的可伸缩目录

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749726

M. Ferdman, P. Lotfi-Kamran, Ken Balet, B. Falsafi

Growing core counts have highlighted the need for scalable on-chip coherence mechanisms. The increase in the number of on-chip cores exposes the energy and area costs of scaling the directories. Duplicate-tag-based directories require highly associative structures that grow with core count, precluding scalability due to prohibitive power consumption. Sparse directories overcome the power barrier by reducing directory associativity, but require storage area over-provisioning to avoid high invalidation rates. We propose the Cuckoo directory, a power- and area-efficient scalable distributed directory. The cuckoo directory scales to high core counts without the energy costs of wide associative lookup and without gross capacity over-provisioning. Simulation of a 16-core CMP with commercial server and scientific workloads shows that the Cuckoo directory eliminates invalidations while being up to four times more power-efficient than the Duplicate-tag directory and 24% more power-efficient and up to seven times more area-efficient than the Sparse directory organization. Analytical projections indicate that the Cuckoo directory retains its energy and area benefits with increasing core count, efficiently scaling to at least 1024 cores.

不断增长的核心数量凸显了对可扩展的片上相干机制的需求。片上内核数量的增加暴露了扩展目录的能量和面积成本。基于重复标记的目录需要高度关联的结构，这种结构会随着核心数的增加而增长，由于过高的功耗而阻碍了可伸缩性。稀疏目录通过减少目录关联性克服了功率障碍，但需要过度配置存储区域以避免高的无效率。我们提出了布谷鸟目录，一个功率和面积效率高的可扩展分布式目录。布谷鸟目录扩展到高核心数，而不需要广泛关联查找的能量成本，也不会出现总容量过度供应。对商业服务器和科学工作负载的16核CMP的模拟表明，Cuckoo目录消除了无效，同时比Duplicate-tag目录节能高达4倍，比Sparse目录组织节能24%，面积效率高达7倍。分析预测表明，随着核心数量的增加，Cuckoo目录保持其能量和面积优势，有效地扩展到至少1024个核心。

{"title":"Cuckoo directory: A scalable directory for many-core systems","authors":"M. Ferdman, P. Lotfi-Kamran, Ken Balet, B. Falsafi","doi":"10.1109/HPCA.2011.5749726","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749726","url":null,"abstract":"Growing core counts have highlighted the need for scalable on-chip coherence mechanisms. The increase in the number of on-chip cores exposes the energy and area costs of scaling the directories. Duplicate-tag-based directories require highly associative structures that grow with core count, precluding scalability due to prohibitive power consumption. Sparse directories overcome the power barrier by reducing directory associativity, but require storage area over-provisioning to avoid high invalidation rates. We propose the Cuckoo directory, a power- and area-efficient scalable distributed directory. The cuckoo directory scales to high core counts without the energy costs of wide associative lookup and without gross capacity over-provisioning. Simulation of a 16-core CMP with commercial server and scientific workloads shows that the Cuckoo directory eliminates invalidations while being up to four times more power-efficient than the Duplicate-tag directory and 24% more power-efficient and up to seven times more area-efficient than the Sparse directory organization. Analytical projections indicate that the Cuckoo directory retains its energy and area benefits with increasing core count, efficiently scaling to at least 1024 cores.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127168569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 128

Practical and secure PCM systems by online detection of malicious write streams 实用和安全的PCM系统通过在线检测恶意写流

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749753

Moinuddin K. Qureshi, André Seznec, L. Lastras, M. Franceschini

Phase Change Memory (PCM) may become a viable alternative for the design of main memory systems in the next few years. However PCM suffers from limited write endurance. Therefore future adoption of PCM as a technology for main memory will depend on the availability of practical solutions for wear leveling that avoids uneven usage especially in the presence of potentially malicious users. First generation wear leveling algorithms were designed for typical workloads and have significantly reduced lifetime under malicious access patterns that try to write to the same line continuously. Secure wear leveling algorithms were recently proposed. They can handle such malicious attacks, but require that wear leveling is done at a rate that is orders of magnitude higher than what is sufficient for typical applications, thereby incurring significantly high write overhead, potentially impairing overall performance system. This paper proposes a practical wear-leveling framework that can provide years of lifetime under attacks while still incurring negligible (<1%) write overhead for typical applications. It uses a simple and novel Online Attack Detector circuit to adapt the rate of wear leveling depending on the properties of the memory reference stream, thereby obtaining the best of both worlds — low overhead for typical applications and years of lifetime under attacks. The proposed attack detector requires a storage overhead of 68 bytes, is effective at estimating the severity of attacks, is applicable to a wide variety of wear leveling algorithms, and reduces the write overhead of several recently proposed wear leveling algorithms by 16x–128x. The paradigm of online attack detection enables other preventive actions as well.

在未来几年内，相变存储器(PCM)可能成为主存储系统设计的一种可行的替代方案。然而，PCM的写入持久性有限。因此，PCM作为主存技术的未来采用将取决于磨损均衡的实用解决方案的可用性，以避免使用不均匀，特别是在潜在恶意用户存在的情况下。第一代磨损均衡算法是为典型的工作负载而设计的，并且在试图连续写入同一行的恶意访问模式下显着缩短了寿命。最近提出了安全磨损均衡算法。它们可以处理此类恶意攻击，但要求以比典型应用程序所需的速率高几个数量级的速率进行损耗均衡，从而导致非常高的写入开销，可能会损害系统的整体性能。本文提出了一个实用的损耗均衡框架，该框架可以在攻击下提供数年的使用寿命，同时对于典型应用程序来说，仍然可以忽略不计(<1%)的编写开销。它使用一个简单的和新颖的在线攻击检测器电路，以适应磨损均衡的速率取决于内存参考流的属性，从而获得两全其美-低开销的典型应用和数年的寿命下的攻击。提出的攻击检测器需要68字节的存储开销，可以有效地估计攻击的严重程度，适用于各种磨损均衡算法，并将最近提出的几种磨损均衡算法的写入开销降低了16 - 128倍。在线攻击检测范例还支持其他预防措施。

{"title":"Practical and secure PCM systems by online detection of malicious write streams","authors":"Moinuddin K. Qureshi, André Seznec, L. Lastras, M. Franceschini","doi":"10.1109/HPCA.2011.5749753","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749753","url":null,"abstract":"Phase Change Memory (PCM) may become a viable alternative for the design of main memory systems in the next few years. However PCM suffers from limited write endurance. Therefore future adoption of PCM as a technology for main memory will depend on the availability of practical solutions for wear leveling that avoids uneven usage especially in the presence of potentially malicious users. First generation wear leveling algorithms were designed for typical workloads and have significantly reduced lifetime under malicious access patterns that try to write to the same line continuously. Secure wear leveling algorithms were recently proposed. They can handle such malicious attacks, but require that wear leveling is done at a rate that is orders of magnitude higher than what is sufficient for typical applications, thereby incurring significantly high write overhead, potentially impairing overall performance system. This paper proposes a practical wear-leveling framework that can provide years of lifetime under attacks while still incurring negligible (<1%) write overhead for typical applications. It uses a simple and novel Online Attack Detector circuit to adapt the rate of wear leveling depending on the properties of the memory reference stream, thereby obtaining the best of both worlds — low overhead for typical applications and years of lifetime under attacks. The proposed attack detector requires a storage overhead of 68 bytes, is effective at estimating the severity of attacks, is applicable to a wide variety of wear leveling algorithms, and reduces the write overhead of several recently proposed wear leveling algorithms by 16x–128x. The paradigm of online attack detection enables other preventive actions as well.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130486592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 101

Hardware/software-based diagnosis of load-store queues using expandable activity logs 使用可扩展活动日志对负载存储队列进行基于硬件/软件的诊断

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749740

J. Carretero, X. Vera, J. Abella, Tanausú Ramírez, M. Monchiero, Antonio González

The increasing device count and design complexity are posing significant challenges to post-silicon validation. Bug diagnosis is the most difficult step during post-silicon validation. Limited reproducibility and low testing speeds are common limitations in current testing techniques. Moreover, low observability defies full-speed testing approaches. Modern solutions like on-chip trace buffers alleviate these issues, but are unable to store long activity traces. As a consequence, the cost of post-Si validation now represents a large fraction of the total design cost. This work describes a hybrid post-Si approach to validate a modern load-store queue. We use an effective error detection mechanism and an expandable logging mechanism to observe the microarchitectural activity for long periods of time, at processor full-speed. Validation is performed by analyzing the log activity by means of a diagnosis algorithm. Correct memory ordering is checked to root the cause of errors.

不断增加的器件数量和设计复杂性对后硅验证提出了重大挑战。Bug诊断是后硅验证过程中最困难的一步。有限的再现性和低测试速度是当前测试技术的常见限制。此外，低可观测性不利于全速测试方法。像片上跟踪缓冲区这样的现代解决方案缓解了这些问题，但无法存储长时间的活动跟踪。因此，si后验证的成本现在占总设计成本的很大一部分。这项工作描述了一种混合后si方法来验证现代负载存储队列。我们使用有效的错误检测机制和可扩展的日志机制，在处理器全速运行的情况下长时间观察微架构活动。通过诊断算法分析日志活动来执行验证。检查正确的内存顺序以根除错误的原因。

引用次数: 9

Thread block compaction for efficient SIMT control flow 线程块压缩有效的SIMT控制流

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749714

Wilson W. L. Fung, Tor M. Aamodt

Manycore accelerators such as graphics processor units (GPUs) organize processing units into single-instruction, multiple data “cores” to improve throughput per unit hardware cost. Programming models for these accelerators encourage applications to run kernels with large groups of parallel scalar threads. The hardware groups these threads into warps/wavefronts and executes them in lockstep-dubbed single-instruction, multiple-thread (SIMT) by NVIDIA. While current GPUs employ a per-warp (or per-wavefront) stack to manage divergent control flow, it incurs decreased efficiency for applications with nested, data-dependent control flow. In this paper, we propose and evaluate the benefits of extending the sharing of resources in a block of warps, already used for scratchpad memory, to exploit control flow locality among threads (where such sharing may at first seem detrimental). In our proposal, warps within a thread block share a common block-wide stack for divergence handling. At a divergent branch, threads are compacted into new warps in hardware. Our simulation results show that this compaction mechanism provides an average speedup of 22% over a baseline per-warp, stack-based reconvergence mechanism, and 17% versus dynamic warp formation on a set of CUDA applications that suffer significantly from control flow divergence.

多核加速器，如图形处理器单元(gpu)将处理单元组织成单指令、多个数据“核”，以提高每单位硬件成本的吞吐量。这些加速器的编程模型鼓励应用程序运行具有大量并行标量线程的内核。硬件将这些线程分组为扭曲/波阵，并以锁步方式执行它们，NVIDIA将其称为单指令多线程(SIMT)。虽然当前的gpu采用每个波前(或每个波前)堆栈来管理分散的控制流，但对于嵌套的、依赖数据的控制流的应用程序来说，它会降低效率。在本文中，我们提出并评估了扩展资源共享在一块warp中的好处，已经用于scratchpad内存，以利用线程之间的控制流局部性(这种共享最初似乎是有害的)。在我们的建议中，线程块中的warp共享一个通用的块范围堆栈，用于散度处理。在一个分叉的分支上，线程在硬件上被压缩成新的经线。我们的模拟结果表明，在一组受控制流发散影响严重的CUDA应用程序中，这种压缩机制比基于堆栈的基线每次曲速机制平均提高22%的速度，比动态曲速形成提高17%。

{"title":"Thread block compaction for efficient SIMT control flow","authors":"Wilson W. L. Fung, Tor M. Aamodt","doi":"10.1109/HPCA.2011.5749714","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749714","url":null,"abstract":"Manycore accelerators such as graphics processor units (GPUs) organize processing units into single-instruction, multiple data “cores” to improve throughput per unit hardware cost. Programming models for these accelerators encourage applications to run kernels with large groups of parallel scalar threads. The hardware groups these threads into warps/wavefronts and executes them in lockstep-dubbed single-instruction, multiple-thread (SIMT) by NVIDIA. While current GPUs employ a per-warp (or per-wavefront) stack to manage divergent control flow, it incurs decreased efficiency for applications with nested, data-dependent control flow. In this paper, we propose and evaluate the benefits of extending the sharing of resources in a block of warps, already used for scratchpad memory, to exploit control flow locality among threads (where such sharing may at first seem detrimental). In our proposal, warps within a thread block share a common block-wide stack for divergence handling. At a divergent branch, threads are compacted into new warps in hardware. Our simulation results show that this compaction mechanism provides an average speedup of 22% over a baseline per-warp, stack-based reconvergence mechanism, and 17% versus dynamic warp formation on a set of CUDA applications that suffer significantly from control flow divergence.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121894558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 181

Shared last-level TLBs for chip multiprocessors 为芯片多处理器共享最后一级tlb

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749717

A. Bhattacharjee, Daniel Lustig, M. Martonosi

Translation Lookaside Buffers (TLBs) are critical to processor performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as chip multiprocessors (CMPs) become ubiquitous, TLB design must be re-evaluated. This paper is the first to propose and evaluate shared last-level (SLL) TLBs as an alternative to the commercial norm of private, per-core L2 TLBs. SLL TLBs eliminate 7–79% of system-wide misses for parallel workloads. This is an average of 27% better than conventional private, per-core L2 TLBs, translating to notable runtime gains. SLL TLBs also provide benefits comparable to recently-proposed Inter-Core Cooperative (ICC) TLB prefetchers, but with considerably simpler hardware. Furthermore, unlike these prefetchers, SLL TLBs can aid sequential applications, eliminating 35–95% of the TLB misses for various multiprogrammed combinations of sequential applications. This corresponds to a 21% average increase in TLB miss eliminations compared to private, per-core L2 TLBs. Because of their benefits for parallel and sequential applications, and their readily-implementable hardware, SLL TLBs hold great promise for CMPs.

翻译暂存缓冲区(tlb)对处理器性能至关重要。许多过去的研究已经解决了单处理器tlb，降低访问时间和遗漏率。然而，随着芯片多处理器(cmp)的普及，TLB设计必须重新评估。本文首次提出并评估了共享的最后级别(SLL) tlb作为私有的、每核L2 tlb的商业规范的替代方案。SLL tlb为并行工作负载消除了7-79%的系统范围遗漏。这比传统的私有的、每核的L2 tlb平均提高27%，转化为显著的运行时增益。SLL TLB还提供了与最近提出的Inter-Core Cooperative (ICC) TLB预取器相当的优点，但硬件要简单得多。此外，与这些预取器不同，SLL TLB可以帮助顺序应用程序，为各种多程序组合的顺序应用程序消除35-95%的TLB遗漏。与私有的、每核的L2 TLB相比，这相当于TLB遗漏消除平均增加了21%。由于它们对并行和顺序应用程序的好处，以及它们易于实现的硬件，SLL tlb对cmp具有很大的前景。

{"title":"Shared last-level TLBs for chip multiprocessors","authors":"A. Bhattacharjee, Daniel Lustig, M. Martonosi","doi":"10.1109/HPCA.2011.5749717","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749717","url":null,"abstract":"Translation Lookaside Buffers (TLBs) are critical to processor performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as chip multiprocessors (CMPs) become ubiquitous, TLB design must be re-evaluated. This paper is the first to propose and evaluate shared last-level (SLL) TLBs as an alternative to the commercial norm of private, per-core L2 TLBs. SLL TLBs eliminate 7–79% of system-wide misses for parallel workloads. This is an average of 27% better than conventional private, per-core L2 TLBs, translating to notable runtime gains. SLL TLBs also provide benefits comparable to recently-proposed Inter-Core Cooperative (ICC) TLB prefetchers, but with considerably simpler hardware. Furthermore, unlike these prefetchers, SLL TLBs can aid sequential applications, eliminating 35–95% of the TLB misses for various multiprogrammed combinations of sequential applications. This corresponds to a 21% average increase in TLB miss eliminations compared to private, per-core L2 TLBs. Because of their benefits for parallel and sequential applications, and their readily-implementable hardware, SLL TLBs hold great promise for CMPs.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131138134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 140

Archipelago: A polymorphic cache design for enabling robust near-threshold operation Archipelago:一种多态缓存设计，支持鲁棒的近阈值操作

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749758

Amin Ansari, Shuguang Feng, S. Gupta, S. Mahlke

Extreme technology integration in the sub-micron regime comes with a rapid rise in heat dissipation and power density for modern processors. Dynamic voltage scaling is a widely used technique to tackle this problem when high performance is not the main concern. However, the minimum achievable supply voltage for the processor is often bounded by the large on-chip caches since SRAM cells fail at a significantly faster rate than logic cells when reducing supply voltage. This is mainly due to the higher susceptibility of the SRAM structures to process-induced parameter variations. In this work, we propose a highly flexible fault-tolerant cache design, Archipelago, that by reconfiguring its internal organization can efficiently tolerate the large number of SRAM failures that arise when operating in the near-threshold region. Archipelago partitions the cache to multiple autonomous islands with various sizes which can operate correctly without borrowing redundancy from each other. Our configuration algorithm — an adapted version of minimum clique covering — exploits the high degree of flexibility in the Archipelago architecture to reduce the granularity of redundancy replacement and minimize the amount of space lost in the cache when operating in near-threshold region. Using our approach, the operational voltage of a processor can be reduced to 375mV, which translates to 79% dynamic and 51% leakage power savings (in 90nm) for a microprocessor similar to the Alpha 21364. These power savings come with a 4.6% performance drop-off when operating in low power mode and 2% area overhead for the microprocessor.

在亚微米制度下的极端技术集成伴随着现代处理器的散热和功率密度的快速上升。动态电压缩放是一种广泛使用的技术来解决这个问题，当高性能不是主要关注。然而，处理器可达到的最小供电电压通常受到片上高速缓存的限制，因为SRAM单元在降低供电电压时比逻辑单元失效的速度要快得多。这主要是由于SRAM结构对工艺引起的参数变化的敏感性较高。在这项工作中，我们提出了一种高度灵活的容错缓存设计，Archipelago，通过重新配置其内部组织，可以有效地容忍在近阈值区域运行时出现的大量SRAM故障。Archipelago将缓存划分为多个不同大小的自治岛，这些自治岛可以在不借用彼此冗余的情况下正确运行。我们的配置算法——最小团覆盖的一个改编版本——利用Archipelago架构中的高度灵活性来减少冗余替换的粒度，并最大限度地减少在接近阈值区域操作时缓存中丢失的空间量。使用我们的方法，处理器的工作电压可以降低到375mV，这意味着与Alpha 21364类似的微处理器可以节省79%的动态功率和51%的泄漏功率(在90nm)。在低功耗模式下运行时，这些节能带来了4.6%的性能下降和2%的微处理器面积开销。

{"title":"Archipelago: A polymorphic cache design for enabling robust near-threshold operation","authors":"Amin Ansari, Shuguang Feng, S. Gupta, S. Mahlke","doi":"10.1109/HPCA.2011.5749758","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749758","url":null,"abstract":"Extreme technology integration in the sub-micron regime comes with a rapid rise in heat dissipation and power density for modern processors. Dynamic voltage scaling is a widely used technique to tackle this problem when high performance is not the main concern. However, the minimum achievable supply voltage for the processor is often bounded by the large on-chip caches since SRAM cells fail at a significantly faster rate than logic cells when reducing supply voltage. This is mainly due to the higher susceptibility of the SRAM structures to process-induced parameter variations. In this work, we propose a highly flexible fault-tolerant cache design, Archipelago, that by reconfiguring its internal organization can efficiently tolerate the large number of SRAM failures that arise when operating in the near-threshold region. Archipelago partitions the cache to multiple autonomous islands with various sizes which can operate correctly without borrowing redundancy from each other. Our configuration algorithm — an adapted version of minimum clique covering — exploits the high degree of flexibility in the Archipelago architecture to reduce the granularity of redundancy replacement and minimize the amount of space lost in the cache when operating in near-threshold region. Using our approach, the operational voltage of a processor can be reduced to 375mV, which translates to 79% dynamic and 51% leakage power savings (in 90nm) for a microprocessor similar to the Alpha 21364. These power savings come with a 4.6% performance drop-off when operating in low power mode and 2% area overhead for the microprocessor.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"154 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133167644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 78