Proceedings. International Symposium on Computer Architecture最新文献_第9页

PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches PIPP:多核共享缓存的提升/插入伪分区

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555778

Yuejian Xie, G. Loh

Many multi-core processors employ a large last-level cache (LLC) shared among the multiple cores. Past research has demonstrated that sharing-oblivious cache management policies (e.g., LRU) can lead to poor performance and fairness when the multiple cores compete for the limited LLC capacity. Different memory access patterns can cause cache contention in different ways, and various techniques have been proposed to target some of these behaviors. In this work, we propose a new cache management approach that combines dynamic insertion and promotion policies to provide the benefits of cache partitioning, adaptive insertion, and capacity stealing all with a single mechanism. By handling multiple types of memory behaviors, our proposed technique outperforms techniques that target only either capacity partitioning or adaptive insertion.

许多多核处理器使用在多核之间共享的大型最后一级缓存(LLC)。过去的研究表明，当多核竞争有限的LLC容量时，共享无关缓存管理策略(例如LRU)可能导致性能和公平性差。不同的内存访问模式会以不同的方式导致缓存争用，针对这些行为，已经提出了各种技术。在这项工作中，我们提出了一种新的缓存管理方法，该方法结合了动态插入和提升策略，以单一机制提供缓存分区、自适应插入和容量窃取的好处。通过处理多种类型的内存行为，我们提出的技术优于仅针对容量分区或自适应插入的技术。

引用次数: 328

Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices 去耦DIMM:使用低速DRAM器件构建高带宽存储系统

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555788

Hongzhong Zheng, Jiang Lin, Zhao Zhang, Zhichun Zhu

The widespread use of multicore processors has dramatically increased the demands on high bandwidth and large capacity from memory systems. In a conventional DDR2/DDR3 DRAM memory system, the memory bus and DRAM devices run at the same data rate. To improve memory bandwidth, we propose a new memory system design called decoupled DIMM that allows the memory bus to operate at a data rate much higher than that of the DRAM devices. In the design, a synchronization buffer is added to relay data between the slow DRAM devices and the fast memory bus; and memory access scheduling is revised to avoid access conflicts on memory ranks. The design not only improves memory bandwidth beyond what can be supported by current memory devices, but also improves reliability, power efficiency, and cost effectiveness by using relatively slow memory devices. The idea of decoupling, precisely the decoupling of bandwidth match between memory bus and a single rank of devices, can also be applied to other types of memory systems including FB-DIMM. Our experimental results show that a decoupled DIMM system of 2667MT/s bus data rate and 1333MT/s device data rate improves the performance of memory-intensive workloads by 51% on average over a conventional memory system of 1333MT/s data rate. Alternatively, a decoupled DIMM system of 1600MT/s bus data rate and 800MT/s device data rate incurs only 8% performance loss when compared with a conventional system of 1600MT/s data rate, with 16% reduction on the memory power consumption and 9% saving on memory energy.

多核处理器的广泛应用极大地提高了存储系统对高带宽和大容量的要求。在传统的DDR2/DDR3 DRAM存储系统中，内存总线和DRAM设备以相同的数据速率运行。为了提高内存带宽，我们提出了一种新的内存系统设计，称为去耦DIMM，它允许内存总线以比DRAM设备高得多的数据速率运行。在该设计中，在慢速DRAM设备和快速存储器总线之间添加同步缓冲区来中继数据;并对内存访问调度进行了改进，以避免内存rank上的访问冲突。该设计不仅提高了当前存储设备所能支持的内存带宽，而且通过使用相对较慢的存储设备，提高了可靠性、功率效率和成本效益。解耦的思想，精确地解耦存储器总线和单级设备之间的带宽匹配，也可以应用于包括FB-DIMM在内的其他类型的存储器系统。实验结果表明，总线数据速率为2667MT/s，器件数据速率为1333MT/s的解耦DIMM系统在内存密集型工作负载下的性能比传统的1333MT/s数据速率的内存系统平均提高51%。另外，总线数据速率为1600MT/s，器件数据速率为800MT/s的解耦DIMM系统，与传统的1600MT/s数据速率系统相比，性能损失仅为8%，内存功耗降低16%，内存能耗节省9%。

{"title":"Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices","authors":"Hongzhong Zheng, Jiang Lin, Zhao Zhang, Zhichun Zhu","doi":"10.1145/1555754.1555788","DOIUrl":"https://doi.org/10.1145/1555754.1555788","url":null,"abstract":"The widespread use of multicore processors has dramatically increased the demands on high bandwidth and large capacity from memory systems. In a conventional DDR2/DDR3 DRAM memory system, the memory bus and DRAM devices run at the same data rate. To improve memory bandwidth, we propose a new memory system design called decoupled DIMM that allows the memory bus to operate at a data rate much higher than that of the DRAM devices. In the design, a synchronization buffer is added to relay data between the slow DRAM devices and the fast memory bus; and memory access scheduling is revised to avoid access conflicts on memory ranks. The design not only improves memory bandwidth beyond what can be supported by current memory devices, but also improves reliability, power efficiency, and cost effectiveness by using relatively slow memory devices. The idea of decoupling, precisely the decoupling of bandwidth match between memory bus and a single rank of devices, can also be applied to other types of memory systems including FB-DIMM.\u0000 Our experimental results show that a decoupled DIMM system of 2667MT/s bus data rate and 1333MT/s device data rate improves the performance of memory-intensive workloads by 51% on average over a conventional memory system of 1333MT/s data rate. Alternatively, a decoupled DIMM system of 1600MT/s bus data rate and 800MT/s device data rate incurs only 8% performance loss when compared with a conventional system of 1600MT/s data rate, with 16% reduction on the memory power consumption and 9% saving on memory energy.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"1 1","pages":"255-266"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78187217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 61

Simultaneous speculative threading: a novel pipeline architecture implemented in sun's rock processor 同步推测线程:在sun的岩石处理器中实现的一种新颖的管道架构

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555814

Shailender Chaudhry, R. Cypher, M. Ekman, Martin Karlsson, A. Landin, S. Yip, Håkan Zeffer, M. Tremblay

This paper presents Simultaneous Speculative Threading (SST), which is a technique for creating high-performance area- and power-efficient cores for chip multiprocessors. SST hardware dynamically extracts two threads of execution from a single sequential program (one consisting of a load miss and its dependents, and the other consisting of the instructions that are independent of the load miss) and executes them in parallel. SST uses an efficient checkpointing mechanism to eliminate the need for complex and power-inefficient structures such as register renaming logic, reorder buffers, memory disambiguation buffers, and large issue windows. Simulations of certain SST implementations show 18% better per-thread performance on commercial benchmarks than larger and higher-powered out-of-order cores. Sun Microsystems' ROCK processor, which is the first processor to use SST cores, has been implemented and is scheduled to be commercially available in 2009.

本文介绍了同步推测线程(SST)，这是一种为芯片多处理器创建高性能区域和节能内核的技术。SST硬件动态地从单个顺序程序中提取两个执行线程(一个由负载缺失及其依赖的线程组成，另一个由与负载缺失无关的指令组成)并并行执行它们。SST使用有效的检查点机制来消除对复杂和低功耗结构的需求，例如寄存器重命名逻辑，重新排序缓冲区，内存消歧缓冲区和大问题窗口。某些SST实现的模拟显示，在商业基准测试中，每线程性能比更大、更高功率的乱序内核高18%。Sun Microsystems的ROCK处理器是第一个使用SST内核的处理器，已经实现并计划于2009年投入商用。

引用次数: 66

SigRace: signature-based data race detection SigRace:基于签名的数据竞争检测

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555797

A. Muzahid, D. S. Gracia, Shanxiang Qi, J. Torrellas

Detecting data races in parallel programs is important for both software development and production-run diagnosis. Recently, there have been several proposals for hardware-assisted data race detection. Such proposals typically modify the L1 cache and cache coherence protocol messages, and largely lose their capability when lines get displaced or invalidated from the cache. To eliminate these shortcomings, this paper proposes a novel, different approach to hardware-assisted data race detection. The approach, called SigRace, relies on hardware address signatures. As a processor runs, the addresses of the data that it accesses are automatically encoded in signatures. At certain times, the signatures are automatically passed to a hardware module that intersects them with those of other processors. If the intersection is not null, a data race may have occurred. This paper presents the architecture of SigRace, an implementation, and its software interface. With SigRace, caches and coherence protocol messages are unmodified. Moreover, cache lines can be displaced and invalidated with no effect. Our experiments show that SigRace is significantly more effective than a state-of-the-art conventional hardware-assisted race detector. SigRace finds on average 29% more static races and 107% more dynamic races. Moreover, if we inject data races, SigRace finds 150% more static races than the conventional scheme.

在并行程序中检测数据竞争对于软件开发和生产运行诊断都很重要。最近，有几个关于硬件辅助数据争用检测的建议。这样的建议通常修改L1缓存和缓存一致性协议消息，并且当线路从缓存中被替换或失效时，它们在很大程度上失去了功能。为了消除这些缺点，本文提出了一种新颖的、不同的硬件辅助数据争用检测方法。这种被称为SigRace的方法依赖于硬件地址签名。当处理器运行时，它所访问的数据的地址被自动编码到签名中。在某些时候，这些签名被自动传递到一个硬件模块，该模块将它们与其他处理器的签名相交。如果交集不为空，则可能发生了数据争用。本文介绍了SigRace的体系结构、实现及其软件接口。使用SigRace，缓存和一致性协议消息不会被修改。此外，缓存线可以被替换或无效，而不会产生任何影响。我们的实验表明，SigRace明显比最先进的传统硬件辅助种族检测器更有效。SigRace发现静态比赛平均增加29%，动态比赛平均增加107%。此外，如果我们注入数据竞争，SigRace发现的静态竞争比传统方案多150%。

{"title":"SigRace: signature-based data race detection","authors":"A. Muzahid, D. S. Gracia, Shanxiang Qi, J. Torrellas","doi":"10.1145/1555754.1555797","DOIUrl":"https://doi.org/10.1145/1555754.1555797","url":null,"abstract":"Detecting data races in parallel programs is important for both software development and production-run diagnosis. Recently, there have been several proposals for hardware-assisted data race detection. Such proposals typically modify the L1 cache and cache coherence protocol messages, and largely lose their capability when lines get displaced or invalidated from the cache. To eliminate these shortcomings, this paper proposes a novel, different approach to hardware-assisted data race detection. The approach, called SigRace, relies on hardware address signatures. As a processor runs, the addresses of the data that it accesses are automatically encoded in signatures. At certain times, the signatures are automatically passed to a hardware module that intersects them with those of other processors. If the intersection is not null, a data race may have occurred.\u0000 This paper presents the architecture of SigRace, an implementation, and its software interface. With SigRace, caches and coherence protocol messages are unmodified. Moreover, cache lines can be displaced and invalidated with no effect. Our experiments show that SigRace is significantly more effective than a state-of-the-art conventional hardware-assisted race detector. SigRace finds on average 29% more static races and 107% more dynamic races. Moreover, if we inject data races, SigRace finds 150% more static races than the conventional scheme.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"67 1","pages":"337-348"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76391477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 97

A case for an interleaving constrained shared-memory multi-processor 交错约束共享内存多处理器的一种情况

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555796

Jie Yu, S. Narayanasamy

Shared-memory multi-threaded programming is inherently more difficult than single-threaded programming. The main source of complexity is that, the threads of an application can interleave in so many different ways. To ensure correctness, a programmer has to test all possible thread interleavings, which, however, is impractical. Many rare thread interleavings remain untested in production systems, and they are the root cause for a majority of concurrency bugs. We propose a shared-memory multi-processor design that avoids untested interleavings to improve the correctness of a multi-threaded program. Since untested interleavings tend to occur infrequently at runtime, the performance cost of avoiding them is not high. We propose to encode the set of tested correct interleavings in a program's binary executable using Predecessor Set (PSet) constraints. These constraints are efficiently enforced at runtime using processor support, which ensures that the runtime follows a tested interleaving. We analyze several bugs in open source applications such as MySQL, Apache, Mozilla, etc., and show that, by enforcing PSet constraints, we can avoid not only data races and atomicity violations, but also other forms of concurrency bugs.

共享内存多线程编程本质上比单线程编程更困难。复杂性的主要来源是，应用程序的线程可以以许多不同的方式交错。为了确保正确性，程序员必须测试所有可能的线程交织，然而，这是不切实际的。许多罕见的线程交织在生产系统中仍然没有经过测试，它们是大多数并发错误的根本原因。我们提出一种共享内存多处理器设计，避免未经测试的交织，以提高多线程程序的正确性。由于未经测试的交错在运行时很少发生，因此避免它们的性能成本并不高。我们建议使用前导集(PSet)约束对程序二进制可执行文件中经过测试的正确交织集进行编码。使用处理器支持在运行时有效地实施这些约束，从而确保运行时遵循经过测试的交错。我们分析了开源应用程序(如MySQL、Apache、Mozilla等)中的几个错误，并表明，通过实施PSet约束，我们不仅可以避免数据竞争和原子性冲突，还可以避免其他形式的并发错误。

{"title":"A case for an interleaving constrained shared-memory multi-processor","authors":"Jie Yu, S. Narayanasamy","doi":"10.1145/1555754.1555796","DOIUrl":"https://doi.org/10.1145/1555754.1555796","url":null,"abstract":"Shared-memory multi-threaded programming is inherently more difficult than single-threaded programming. The main source of complexity is that, the threads of an application can interleave in so many different ways. To ensure correctness, a programmer has to test all possible thread interleavings, which, however, is impractical.\u0000 Many rare thread interleavings remain untested in production systems, and they are the root cause for a majority of concurrency bugs. We propose a shared-memory multi-processor design that avoids untested interleavings to improve the correctness of a multi-threaded program. Since untested interleavings tend to occur infrequently at runtime, the performance cost of avoiding them is not high.\u0000 We propose to encode the set of tested correct interleavings in a program's binary executable using Predecessor Set (PSet) constraints. These constraints are efficiently enforced at runtime using processor support, which ensures that the runtime follows a tested interleaving. We analyze several bugs in open source applications such as MySQL, Apache, Mozilla, etc., and show that, by enforcing PSet constraints, we can avoid not only data races and atomicity violations, but also other forms of concurrency bugs.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"35 1","pages":"325-336"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81915398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 180

A durable and energy efficient main memory using phase change memory technology 一种使用相变存储技术的耐用且节能的主存储器

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555759

Ping Zhou, Bo Zhao, Jun Yang, Youtao Zhang

Using nonvolatile memories in memory hierarchy has been investigated to reduce its energy consumption because nonvolatile memories consume zero leakage power in memory cells. One of the difficulties is, however, that the endurance of most nonvolatile memory technologies is much shorter than the conventional SRAM and DRAM technology. This has limited its usage to only the low levels of a memory hierarchy, e.g., disks, that is far from the CPU. In this paper, we study the use of a new type of nonvolatile memories -- the Phase Change Memory (PCM) as the main memory for a 3D stacked chip. The main challenges we face are the limited PCM endurance, longer access latencies, and higher dynamic power compared to the conventional DRAM technology. We propose techniques to extend the endurance of the PCM to an average of 13 (for MLC PCM cell) to 22 (for SLC PCM) years. We also study the design choices of implementing PCM to achieve the best tradeoff between energy and performance. Our design reduced the total energy of an already low-power DRAM main memory of the same capacity by 65%, and energy-delay2 product by 60%. These results indicate that it is feasible to use PCM technology in place of DRAM in the main memory for better energy efficiency.

由于非易失性存储器对存储单元的泄漏功率为零，因此研究了在存储层中使用非易失性存储器来降低其能量消耗。然而，其中一个困难是，大多数非易失性存储技术的耐用性比传统的SRAM和DRAM技术短得多。这将它的使用限制在内存层次结构的较低级别，例如，远离CPU的磁盘。本文研究了一种新型的非易失性存储器——相变存储器(PCM)作为三维堆叠芯片的主存储器。与传统的DRAM技术相比，我们面临的主要挑战是有限的PCM耐用性，更长的访问延迟和更高的动态功率。我们建议将PCM的寿命延长到平均13年(MLC PCM电池)到22年(SLC PCM)。我们还研究了实现PCM的设计选择，以实现能量和性能之间的最佳权衡。我们的设计将相同容量的低功耗DRAM主存储器的总能量降低了65%，能量延迟产品降低了60%。这些结果表明，在主存储器中使用PCM技术来代替DRAM以获得更好的能源效率是可行的。

{"title":"A durable and energy efficient main memory using phase change memory technology","authors":"Ping Zhou, Bo Zhao, Jun Yang, Youtao Zhang","doi":"10.1145/1555754.1555759","DOIUrl":"https://doi.org/10.1145/1555754.1555759","url":null,"abstract":"Using nonvolatile memories in memory hierarchy has been investigated to reduce its energy consumption because nonvolatile memories consume zero leakage power in memory cells. One of the difficulties is, however, that the endurance of most nonvolatile memory technologies is much shorter than the conventional SRAM and DRAM technology. This has limited its usage to only the low levels of a memory hierarchy, e.g., disks, that is far from the CPU.\u0000 In this paper, we study the use of a new type of nonvolatile memories -- the Phase Change Memory (PCM) as the main memory for a 3D stacked chip. The main challenges we face are the limited PCM endurance, longer access latencies, and higher dynamic power compared to the conventional DRAM technology. We propose techniques to extend the endurance of the PCM to an average of 13 (for MLC PCM cell) to 22 (for SLC PCM) years. We also study the design choices of implementing PCM to achieve the best tradeoff between energy and performance. Our design reduced the total energy of an already low-power DRAM main memory of the same capacity by 65%, and energy-delay2 product by 60%. These results indicate that it is feasible to use PCM technology in place of DRAM in the main memory for better energy efficiency.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"30 1","pages":"14-23"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81867657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 944

Spatio-temporal memory streaming 时空记忆流

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555766

Stephen Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi

Recent research advocates memory streaming techniques to alleviate the performance bottleneck caused by the high latencies of off-chip memory accesses. Temporal memory streaming replays previously observed miss sequences to eliminate long chains of dependent misses. Spatial memory streaming predicts repetitive data layout patterns within fixed-size memory regions. Because each technique targets a different subset of misses, their effectiveness varies across workloads and each leaves a significant fraction of misses unpredicted. In this paper, we propose Spatio-Temporal Memory Streaming (STeMS) to exploit the synergy between spatial and temporal streaming. We observe that the order of spatial accesses repeats both within and across regions. STeMS records and replays the temporal sequence of region accesses and uses spatial relationships within each region to dynamically reconstruct a predicted total miss order. Using trace-driven and cycle-accurate simulation across a suite of commercial workloads, we demonstrate that with similar implementation complexity as temporal streaming, STeMS achieves equal or higher coverage than spatial or temporal memory streaming alone, and improves performance by 31%, 3%, and 18% over stride, spatial, and temporal prediction, respectively.

最近的研究提倡内存流技术来缓解芯片外存储器访问的高延迟所造成的性能瓶颈。时间记忆流回放先前观察到的缺失序列，以消除长链的依赖缺失。空间内存流预测固定大小内存区域内的重复数据布局模式。由于每种技术针对的是不同的失误子集，因此它们的有效性因工作负载而异，并且每种技术都会留下很大一部分无法预测的失误。在本文中，我们提出了时空记忆流(stem)来利用空间和时间流之间的协同作用。我们观察到空间访问的顺序在区域内和跨区域重复。stem记录和重放区域访问的时间序列，并使用每个区域内的空间关系来动态重建预测的总缺失顺序。在一组商业工作负载中使用跟踪驱动和周期精确的模拟，我们证明了在与时间流相似的实现复杂性下，stem实现了与单独的空间或时间内存流相同或更高的覆盖范围，并且在跨距、空间和时间预测方面分别提高了31%、3%和18%的性能。

{"title":"Spatio-temporal memory streaming","authors":"Stephen Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi","doi":"10.1145/1555754.1555766","DOIUrl":"https://doi.org/10.1145/1555754.1555766","url":null,"abstract":"Recent research advocates memory streaming techniques to alleviate the performance bottleneck caused by the high latencies of off-chip memory accesses. Temporal memory streaming replays previously observed miss sequences to eliminate long chains of dependent misses. Spatial memory streaming predicts repetitive data layout patterns within fixed-size memory regions. Because each technique targets a different subset of misses, their effectiveness varies across workloads and each leaves a significant fraction of misses unpredicted.\u0000 In this paper, we propose Spatio-Temporal Memory Streaming (STeMS) to exploit the synergy between spatial and temporal streaming. We observe that the order of spatial accesses repeats both within and across regions. STeMS records and replays the temporal sequence of region accesses and uses spatial relationships within each region to dynamically reconstruct a predicted total miss order. Using trace-driven and cycle-accurate simulation across a suite of commercial workloads, we demonstrate that with similar implementation complexity as temporal streaming, STeMS achieves equal or higher coverage than spatial or temporal memory streaming alone, and improves performance by 31%, 3%, and 18% over stride, spatial, and temporal prediction, respectively.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"50 1","pages":"69-80"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82621486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 139

Scaling the bandwidth wall: challenges in and avenues for CMP scaling 扩展带宽墙:CMP扩展的挑战和途径

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555801

Brian Rogers, A. Krishna, Gordon B. Bell, K. V. Vu, Xiaowei Jiang, Yan Solihin

As transistor density continues to grow at an exponential rate in accordance to Moore's law, the goal for many Chip Multi-Processor (CMP) systems is to scale the number of on-chip cores proportionally. Unfortunately, off-chip memory bandwidth capacity is projected to grow slowly compared to the desired growth in the number of cores. This creates a situation in which each core will have a decreasing amount of off-chip bandwidth that it can use to load its data from off-chip memory. The situation in which off-chip bandwidth is becoming a performance and throughput bottleneck is referred to as the bandwidth wall problem. In this study, we seek to answer two questions: (1) to what extent does the bandwidth wall problem restrict future multicore scaling, and (2) to what extent are various bandwidth conservation techniques able to mitigate this problem. To address them, we develop a simple but powerful analytical model to predict the number of on-chip cores that a CMP can support given a limited growth in memory traffic capacity. We find that the bandwidth wall can severely limit core scaling. When starting with a balanced 8-core CMP, in four technology generations the number of cores can only scale to 24, as opposed to 128 cores under proportional scaling, without increasing the memory traffic requirement. We find that various individual bandwidth conservation techniques we evaluate have a wide ranging impact on core scaling, and when combined together, these techniques have the potential to enable super-proportional core scaling for up to 4 technology generations.

随着晶体管密度按照摩尔定律继续以指数速率增长，许多芯片多处理器(CMP)系统的目标是按比例缩放片上内核的数量。不幸的是，与内核数量的预期增长相比，片外内存带宽容量预计增长缓慢。这就造成了这样一种情况，即每个内核用于从片外内存加载数据的片外带宽将越来越少。片外带宽成为性能和吞吐量瓶颈的情况被称为带宽墙问题。在本研究中，我们试图回答两个问题:(1)带宽墙问题在多大程度上限制了未来的多核扩展，以及(2)各种带宽保护技术能够在多大程度上缓解这一问题。为了解决这些问题，我们开发了一个简单但功能强大的分析模型来预测在内存流量有限的情况下，CMP可以支持的片上内核数量。我们发现带宽墙会严重限制核心扩展。当从平衡的8核CMP开始时，在四代技术中，核心数量只能扩展到24个，而在比例扩展下可以扩展到128个核心，而不会增加内存流量需求。我们发现，我们评估的各种单独的带宽保护技术对核心扩展有广泛的影响，当结合在一起时，这些技术有可能为多达4代技术实现超比例的核心扩展。

{"title":"Scaling the bandwidth wall: challenges in and avenues for CMP scaling","authors":"Brian Rogers, A. Krishna, Gordon B. Bell, K. V. Vu, Xiaowei Jiang, Yan Solihin","doi":"10.1145/1555754.1555801","DOIUrl":"https://doi.org/10.1145/1555754.1555801","url":null,"abstract":"As transistor density continues to grow at an exponential rate in accordance to Moore's law, the goal for many Chip Multi-Processor (CMP) systems is to scale the number of on-chip cores proportionally. Unfortunately, off-chip memory bandwidth capacity is projected to grow slowly compared to the desired growth in the number of cores. This creates a situation in which each core will have a decreasing amount of off-chip bandwidth that it can use to load its data from off-chip memory. The situation in which off-chip bandwidth is becoming a performance and throughput bottleneck is referred to as the bandwidth wall problem.\u0000 In this study, we seek to answer two questions: (1) to what extent does the bandwidth wall problem restrict future multicore scaling, and (2) to what extent are various bandwidth conservation techniques able to mitigate this problem. To address them, we develop a simple but powerful analytical model to predict the number of on-chip cores that a CMP can support given a limited growth in memory traffic capacity. We find that the bandwidth wall can severely limit core scaling. When starting with a balanced 8-core CMP, in four technology generations the number of cores can only scale to 24, as opposed to 128 cores under proportional scaling, without increasing the memory traffic requirement. We find that various individual bandwidth conservation techniques we evaluate have a wide ranging impact on core scaling, and when combined together, these techniques have the potential to enable super-proportional core scaling for up to 4 technology generations.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"22 1","pages":"371-382"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89310546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 305

Architectural core salvaging in a multi-core processor for hard-error tolerance 多核处理器中的体系结构核心回收，以实现硬错误容错性

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555769

Michael D. Powell, Arijit Biswas, S. Gupta, Shubhendu S. Mukherjee

The incidence of hard errors in CPUs is a challenge for future multicore designs due to increasing total core area. Even if the location and nature of hard errors are known a priori, either at manufacture-time or in the field, cores with such errors must be disabled in the absence of hard-error tolerance. While caches, with their regular and repetitive structures, are easily covered against hard errors by providing spare arrays or spare lines, structures within a core are neither as regular nor as repetitive. Previous work has proposed microarchitectural core salvaging to exploit structural redundancy within a core and maintain functionality in the presence of hard errors. Unfortunately microarchitectural salvaging introduces complexity and may provide only limited coverage of core area against hard errors due to a lack of natural redundancy in the core. This paper makes a case for architectural core salvaging. We observe that even if some individual cores cannot execute certain operations, a CPU die can be instruction-set-architecture (ISA) compliant, that is execute all of the instructions required by its ISA, by exploiting natural cross-core redundancy. We propose using hardware to migrate offending threads to another core that can execute the operation. Architectural core salvaging can cover a large core area against faults, and be implemented by leveraging known techniques that minimize changes to the microarchitecture. We show it is possible to optimize architectural core salvaging such that the performance on a faulty die approaches that of a fault-free die--assuring significantly better performance than core disabling for many workloads and no worse performance than core disabling for the remainder.

由于总核心面积的增加，cpu中的硬错误发生率是未来多核设计的一个挑战。即使预先知道硬错误的位置和性质，无论是在制造时还是在现场，在没有硬错误容忍度的情况下，也必须禁用具有此类错误的磁芯。缓存具有规则和重复的结构，通过提供备用数组或备用行很容易避免硬错误，而核心中的结构既不规则也不重复。先前的工作提出了微架构核心回收，以利用核心内的结构冗余并在存在硬错误的情况下保持功能。不幸的是，微架构回收引入了复杂性，并且由于核心中缺乏自然冗余，可能只能提供有限的核心区域覆盖，以应对硬错误。本文提出了一个建筑核心回收的案例。我们观察到，即使一些单独的内核不能执行某些操作，CPU芯片也可以符合指令集架构(ISA)，即通过利用自然的跨核冗余来执行其ISA所需的所有指令。我们建议使用硬件将有问题的线程迁移到另一个可以执行操作的核心。体系结构核心回收可以覆盖很大的核心区域，防止出现故障，并且可以通过利用最小化微体系结构更改的已知技术来实现。我们展示了优化架构核心回收是可能的，这样在有故障的芯片上的性能就可以接近无故障芯片的性能——确保在许多工作负载下，性能明显优于禁用核心，而在其余工作负载下，性能不会比禁用核心差。

{"title":"Architectural core salvaging in a multi-core processor for hard-error tolerance","authors":"Michael D. Powell, Arijit Biswas, S. Gupta, Shubhendu S. Mukherjee","doi":"10.1145/1555754.1555769","DOIUrl":"https://doi.org/10.1145/1555754.1555769","url":null,"abstract":"The incidence of hard errors in CPUs is a challenge for future multicore designs due to increasing total core area. Even if the location and nature of hard errors are known a priori, either at manufacture-time or in the field, cores with such errors must be disabled in the absence of hard-error tolerance. While caches, with their regular and repetitive structures, are easily covered against hard errors by providing spare arrays or spare lines, structures within a core are neither as regular nor as repetitive. Previous work has proposed microarchitectural core salvaging to exploit structural redundancy within a core and maintain functionality in the presence of hard errors. Unfortunately microarchitectural salvaging introduces complexity and may provide only limited coverage of core area against hard errors due to a lack of natural redundancy in the core.\u0000 This paper makes a case for architectural core salvaging. We observe that even if some individual cores cannot execute certain operations, a CPU die can be instruction-set-architecture (ISA) compliant, that is execute all of the instructions required by its ISA, by exploiting natural cross-core redundancy. We propose using hardware to migrate offending threads to another core that can execute the operation. Architectural core salvaging can cover a large core area against faults, and be implemented by leveraging known techniques that minimize changes to the microarchitecture. We show it is possible to optimize architectural core salvaging such that the performance on a faulty die approaches that of a fault-free die--assuring significantly better performance than core disabling for many workloads and no worse performance than core disabling for the remainder.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"126 1","pages":"93-104"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73004286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 137

ECMon: exposing cache events for monitoring ECMon:公开缓存事件以供监控

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555798

V. Nagarajan, Rajiv Gupta

The advent of multicores has introduced new challenges for programmers to provide increased performance and software reliability. There has been significant interest in techniques that use software speculation to better utilize the computational power of multicores. At the same time, several recent proposals for ensuring software reliability are not applicable in a multicore setting due to their inability to handle interprocessor shared memory dependences (ISMDs). The demands for performing speculation and ensuring software reliability in a multicore setting, although seemingly different, share a common requirement: the need for monitoring program execution and collecting interprocessor dependence information at low overhead. For example, an important component of speculation is the effcient detection of missspeculation which in turn requires dependence information. Likewise, tasks that help ensure software reliability on multicores, including recording for replay, require ISMD information. In this paper, we propose ECMon: support for exposing cache events to the software. This enables the programmer to catch these events and react to them; in effect, efficiently exposing the ISMDs to the programmer. In the context of speculation, we show how ECMon optimizes the detection of miss-speculation; we use this simple support to speculate past active barriers and achieve a speedup of 12% for the set of parallel programs considered. As an application of ensuring software reliability, we show how ECMon can be used to record shared memory dependences on multicores using no specialized hardware support at only 2.8 fold execution time overhead.

多核的出现为程序员提供更高的性能和软件可靠性带来了新的挑战。人们对利用软件推测来更好地利用多核计算能力的技术非常感兴趣。与此同时，最近一些确保软件可靠性的建议并不适用于多核设置，因为它们无法处理处理器间共享内存依赖(ismd)。在多核设置中执行推测和确保软件可靠性的需求虽然看起来不同，但有一个共同的需求:需要以低开销监视程序执行和收集处理器间依赖信息。例如，投机的一个重要组成部分是对错误投机的有效检测，而错误投机反过来又需要依赖信息。同样，有助于确保多核软件可靠性的任务，包括为重播录制，也需要ISMD信息。在本文中，我们提出了ECMon:支持向软件公开缓存事件。这使程序员能够捕获这些事件并对它们做出反应;实际上，有效地将ismd暴露给程序员。在投机的背景下，我们展示了ECMon如何优化对投机失误的检测;我们使用这个简单的支持来推测过去的活动障碍，并为所考虑的并行程序集实现了12%的加速。作为确保软件可靠性的应用程序，我们展示了如何使用ECMon在没有专门硬件支持的情况下记录多核上的共享内存依赖，其执行时间开销仅为2.8倍。

{"title":"ECMon: exposing cache events for monitoring","authors":"V. Nagarajan, Rajiv Gupta","doi":"10.1145/1555754.1555798","DOIUrl":"https://doi.org/10.1145/1555754.1555798","url":null,"abstract":"The advent of multicores has introduced new challenges for programmers to provide increased performance and software reliability. There has been significant interest in techniques that use software speculation to better utilize the computational power of multicores. At the same time, several recent proposals for ensuring software reliability are not applicable in a multicore setting due to their inability to handle interprocessor shared memory dependences (ISMDs). The demands for performing speculation and ensuring software reliability in a multicore setting, although seemingly different, share a common requirement: the need for monitoring program execution and collecting interprocessor dependence information at low overhead. For example, an important component of speculation is the effcient detection of missspeculation which in turn requires dependence information. Likewise, tasks that help ensure software reliability on multicores, including recording for replay, require ISMD information.\u0000 In this paper, we propose ECMon: support for exposing cache events to the software. This enables the programmer to catch these events and react to them; in effect, efficiently exposing the ISMDs to the programmer. In the context of speculation, we show how ECMon optimizes the detection of miss-speculation; we use this simple support to speculate past active barriers and achieve a speedup of 12% for the set of parallel programs considered. As an application of ensuring software reliability, we show how ECMon can be used to record shared memory dependences on multicores using no specialized hardware support at only 2.8 fold execution time overhead.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"58 1","pages":"349-360"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80601153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37