2011 38th Annual International Symposium on Computer Architecture (ISCA)最新文献_第2页

Bypass and insertion algorithms for exclusive last-level caches 排他性最后一级缓存的旁路和插入算法

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000075

Jayesh Gaur, Mainak Chaudhuri, S. Subramoney

Inclusive last-level caches (LLCs) waste precious silicon estate due to cross-level replication of cache blocks. As the industry moves toward cache hierarchies with larger inner levels, this wasted cache space leads to bigger performance losses compared to exclusive LLCs. However, exclusive LLCs make the design of replacement policies more challenging. While in an inclusive LLC a block can gather a filtered access history, this is not possible in an exclusive design because the block is de-allocated from the LLC on a hit. As a result, the popular least-recently-used replacement policy and its approximations are rendered ineffective and proper choice of insertion ages of cache blocks becomes even more important in exclusive designs. On the other hand, it is not necessary to fill every block into an exclusive LLC. This is known as selective cache bypassing and is not possible to implement in an inclusive LLC because that would violate inclusion. This paper explores insertion and bypass algorithms for exclusive LLCs. Our detailed execution-driven simulation results show that a combination of our best insertion and bypass policies delivers an improvement of up to 61.2% and on average (geometric mean) 3.4% in terms of instructions retired per cycle (IPC) for 97 single-threaded dynamic instruction traces spanning selected SPEC 2006 and server applications, running on a 2 MB 16-way exclusive LLC compared to a baseline exclusive design in the presence of well-tuned multi-stream hardware prefetchers. The corresponding improvements in throughput for 35 4-way multi-programmed workloads running with an 8 MB 16-way shared exclusive LLC are 20.6% (maximum) and 2.5% (geometric mean).

由于缓存块的跨层复制，包含最后一级缓存(llc)浪费了宝贵的硅资源。随着行业向具有更大内部级别的缓存层次结构发展，与排他性llc相比，这种浪费的缓存空间会导致更大的性能损失。然而，排他性有限责任公司使替代政策的设计更具挑战性。虽然在包容性LLC中，块可以收集过滤的访问历史，但在排他设计中不可能这样做，因为块在命中时从LLC中取消分配。因此，流行的最近最少使用的替换策略及其近似值变得无效，并且在独占设计中正确选择缓存块的插入年龄变得更加重要。另一方面，没有必要将每个块都填充到exclusive LLC中。这被称为选择性缓存绕过，并且不可能在inclusive LLC中实现，因为这会违反inclusion。本文探讨了排他性有限责任公司的插入和旁路算法。我们详细的执行驱动模拟结果表明，我们最好的插入和绕过策略的组合提供了高达61.2%的改进和平均(几何平均)3.4%的指令退役每周期(IPC)的97单线程动态指令跟踪，涵盖选定的SPEC 2006和服务器应用程序，在一个2 MB的16路独占LLC上运行，与基线独占设计相比，在良好调优的多流硬件预取器的存在。对于使用8 MB 16路共享独占LLC运行的35个4路多编程工作负载，相应的吞吐量改进分别为20.6%(最大值)和2.5%(几何平均值)。

{"title":"Bypass and insertion algorithms for exclusive last-level caches","authors":"Jayesh Gaur, Mainak Chaudhuri, S. Subramoney","doi":"10.1145/2000064.2000075","DOIUrl":"https://doi.org/10.1145/2000064.2000075","url":null,"abstract":"Inclusive last-level caches (LLCs) waste precious silicon estate due to cross-level replication of cache blocks. As the industry moves toward cache hierarchies with larger inner levels, this wasted cache space leads to bigger performance losses compared to exclusive LLCs. However, exclusive LLCs make the design of replacement policies more challenging. While in an inclusive LLC a block can gather a filtered access history, this is not possible in an exclusive design because the block is de-allocated from the LLC on a hit. As a result, the popular least-recently-used replacement policy and its approximations are rendered ineffective and proper choice of insertion ages of cache blocks becomes even more important in exclusive designs. On the other hand, it is not necessary to fill every block into an exclusive LLC. This is known as selective cache bypassing and is not possible to implement in an inclusive LLC because that would violate inclusion. This paper explores insertion and bypass algorithms for exclusive LLCs. Our detailed execution-driven simulation results show that a combination of our best insertion and bypass policies delivers an improvement of up to 61.2% and on average (geometric mean) 3.4% in terms of instructions retired per cycle (IPC) for 97 single-threaded dynamic instruction traces spanning selected SPEC 2006 and server applications, running on a 2 MB 16-way exclusive LLC compared to a baseline exclusive design in the presence of well-tuned multi-stream hardware prefetchers. The corresponding improvements in throughput for 35 4-way multi-programmed workloads running with an 8 MB 16-way shared exclusive LLC are 20.6% (maximum) and 2.5% (geometric mean).","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"476 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117039099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 107

SpecTLB: A mechanism for speculative address translation SpecTLB:一种推测地址转换机制

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000101

Thomas W. Barr, A. Cox, S. Rixner

Data-intensive computing applications are using more and more memory and are placing an increasing load on the virtual memory system. While the use of large pages can help alleviate the overhead of address translation, they limit the control the operating system has over memory allocation and protection. We present a novel device, the SpecTLB, that exploits the predictable behavior of reservation-based physical memory allocators to interpolate address translations. Our device provides speculative translations for many TLB misses on small pages without referencing the page table. While these interpolations must be confirmed, doing so can be done in parallel with speculative execution. This effectively hides the execution latency of these TLB misses. In simulation, the SpecTLB is able to overlap an average of 57% of page table walks with successful speculative execution over a wide variety of applications. We also show that the SpecTLB outperforms a state-of-the-art TLB prefetching scheme for virtually all tested applications with significant TLB miss rates. Moreover, we show that the SpecTLB is efficient since mispredictions are extremely rare, occurring in less than 1% of TLB misses. In essense, the SpecTLB effectively enables the use of small pages to achieve fine-grained allocation and protection, while avoiding the associated latency penalties of small pages.

数据密集型计算应用程序正在使用越来越多的内存，并对虚拟内存系统施加越来越大的负载。虽然使用大页面可以帮助减轻地址转换的开销，但它们限制了操作系统对内存分配和保护的控制。我们提出了一种新的设备SpecTLB，它利用基于预留的物理内存分配器的可预测行为来插入地址转换。我们的设备在不引用页表的情况下为小页面上的许多TLB缺失提供推测翻译。虽然必须确认这些内插，但这样做可以与推测执行并行进行。这有效地隐藏了这些TLB失误的执行延迟。在模拟中，SpecTLB能够在各种应用程序上成功地推测执行，平均重叠57%的页表遍历。我们还表明，SpecTLB在几乎所有具有显着TLB缺失率的测试应用程序中都优于最先进的TLB预取方案。此外，我们表明SpecTLB是有效的，因为错误预测非常罕见，发生在不到1%的TLB错误。从本质上讲，SpecTLB有效地支持使用小页面来实现细粒度的分配和保护，同时避免了小页面带来的相关延迟损失。

{"title":"SpecTLB: A mechanism for speculative address translation","authors":"Thomas W. Barr, A. Cox, S. Rixner","doi":"10.1145/2000064.2000101","DOIUrl":"https://doi.org/10.1145/2000064.2000101","url":null,"abstract":"Data-intensive computing applications are using more and more memory and are placing an increasing load on the virtual memory system. While the use of large pages can help alleviate the overhead of address translation, they limit the control the operating system has over memory allocation and protection. We present a novel device, the SpecTLB, that exploits the predictable behavior of reservation-based physical memory allocators to interpolate address translations. Our device provides speculative translations for many TLB misses on small pages without referencing the page table. While these interpolations must be confirmed, doing so can be done in parallel with speculative execution. This effectively hides the execution latency of these TLB misses. In simulation, the SpecTLB is able to overlap an average of 57% of page table walks with successful speculative execution over a wide variety of applications. We also show that the SpecTLB outperforms a state-of-the-art TLB prefetching scheme for virtually all tested applications with significant TLB miss rates. Moreover, we show that the SpecTLB is efficient since mispredictions are extremely rare, occurring in less than 1% of TLB misses. In essense, the SpecTLB effectively enables the use of small pages to achieve fine-grained allocation and protection, while avoiding the associated latency penalties of small pages.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116382563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 126

Prefetch-aware shared-resource management for multi-core systems 多核系统的预取感知共享资源管理

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000081

Eiman Ebrahimi, Chang Joo Lee, O. Mutlu, Y. Patt

Chip multiprocessors (CMPs) share a large portion of the memory subsystem among multiple cores. Recent proposals have addressed high-performance and fair management of these shared resources; however, none of them take into account prefetch requests. Without prefetching, significant performance is lost, which is why existing systems prefetch. By not taking into account prefetch requests, recent shared-resource management proposals often significantly degrade both performance and fairness, rather than improve them in the presence of prefetching. This paper is the first to propose mechanisms that both manage the shared resources of a multi-core chip to obtain high-performance and fairness, and also exploit prefetching. We apply our proposed mechanisms to two resource-based management techniques for memory scheduling and one source-throttling-based management technique for the entire shared memory system. We show that our mechanisms improve the performance of a 4-core system that uses network fair queuing, parallelism-aware batch scheduling, and fairness via source throttling by 11.0%, 10.9%, and 11.3% respectively, while also significantly improving fairness.

芯片多处理器(cmp)在多个内核之间共享内存子系统的很大一部分。最近的建议涉及对这些共享资源的高效和公平管理;但是，它们都不考虑预取请求。如果不进行预取，将会造成很大的性能损失，这就是现有系统预取的原因。由于没有考虑预取请求，最近的共享资源管理建议通常会显著降低性能和公平性，而不是在预取的情况下提高性能和公平性。本文首次提出了管理多核芯片共享资源以获得高性能和公平性的机制，以及利用预取的机制。我们将提出的机制应用于两种基于资源的内存调度管理技术和一种基于源节流的整个共享内存系统管理技术。我们表明，我们的机制提高了使用网络公平排队、并行感知批调度和通过源节流的公平性的4核系统的性能，分别提高了11.0%、10.9%和11.3%，同时也显著提高了公平性。

{"title":"Prefetch-aware shared-resource management for multi-core systems","authors":"Eiman Ebrahimi, Chang Joo Lee, O. Mutlu, Y. Patt","doi":"10.1145/2000064.2000081","DOIUrl":"https://doi.org/10.1145/2000064.2000081","url":null,"abstract":"Chip multiprocessors (CMPs) share a large portion of the memory subsystem among multiple cores. Recent proposals have addressed high-performance and fair management of these shared resources; however, none of them take into account prefetch requests. Without prefetching, significant performance is lost, which is why existing systems prefetch. By not taking into account prefetch requests, recent shared-resource management proposals often significantly degrade both performance and fairness, rather than improve them in the presence of prefetching. This paper is the first to propose mechanisms that both manage the shared resources of a multi-core chip to obtain high-performance and fairness, and also exploit prefetching. We apply our proposed mechanisms to two resource-based management techniques for memory scheduling and one source-throttling-based management technique for the entire shared memory system. We show that our mechanisms improve the performance of a 4-core system that uses network fair queuing, parallelism-aware batch scheduling, and fairness via source throttling by 11.0%, 10.9%, and 11.3% respectively, while also significantly improving fairness.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129729630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 154

CPPC: Correctable parity protected cache CPPC:可校正奇偶校验保护缓存

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000091

Mehrtash Manoochehri, M. Annavaram, M. Dubois

Due to shrinking feature sizes processors are becoming more vulnerable to soft errors. Write-back caches are particularly vulnerable since they hold dirty data that do not exist in other memory levels. While conventional error correcting codes can protect write-back caches, it has been shown that they are expensive in terms of area and power. This paper proposes a new reliable write-back cache called Correctable Parity Protected Cache (CPPC) which adds error correction capability to a parity-protected cache. For this purpose, CPPC augments a write-back parity-protected cache with two registers: the first register stores the XOR of all data written to the cache and the second register stores the XOR of all dirty data that are removed from the cache. CPPC relies on parity to detect a fault and then on the two XOR registers to correct faults. By a novel combination of byte shifting and parity interleaving CPPC corrects both single and spatial multi-bit faults to provide a high degree of reliability. We compare CPPC with one-dimensional parity, SECDED (Single Error Correction Double Error Detection) and two-dimensional parity-protected caches. Our simulation results show that CPPC provides a high level of reliability while its overheads are less than the overheads of SECDED and two-dimensional parity.

由于特征尺寸的缩小，处理器变得更容易受到软错误的影响。回写缓存特别容易受到攻击，因为它们保存着其他内存级别中不存在的脏数据。虽然传统的纠错码可以保护回写缓存，但事实证明，它们在面积和功率方面都很昂贵。本文提出了一种新的可靠回写缓存，称为可校正奇偶校验保护缓存(CPPC)，它在奇偶校验保护缓存中增加了纠错能力。为此，CPPC用两个寄存器增加了回写奇偶校验保护的缓存:第一个寄存器存储写入缓存的所有数据的异或，第二个寄存器存储从缓存中删除的所有脏数据的异或。CPPC依靠奇偶校验来检测故障，然后依靠两个异或寄存器来纠正故障。通过字节移位和奇偶交错的新颖组合，CPPC纠正了单个和空间多比特错误，提供了高度的可靠性。我们比较了CPPC与一维奇偶校验、SECDED(单错误校正双错误检测)和二维奇偶校验保护缓存。我们的仿真结果表明，CPPC提供了高水平的可靠性，同时它的开销低于SECDED和二维奇偶校验的开销。

{"title":"CPPC: Correctable parity protected cache","authors":"Mehrtash Manoochehri, M. Annavaram, M. Dubois","doi":"10.1145/2000064.2000091","DOIUrl":"https://doi.org/10.1145/2000064.2000091","url":null,"abstract":"Due to shrinking feature sizes processors are becoming more vulnerable to soft errors. Write-back caches are particularly vulnerable since they hold dirty data that do not exist in other memory levels. While conventional error correcting codes can protect write-back caches, it has been shown that they are expensive in terms of area and power. This paper proposes a new reliable write-back cache called Correctable Parity Protected Cache (CPPC) which adds error correction capability to a parity-protected cache. For this purpose, CPPC augments a write-back parity-protected cache with two registers: the first register stores the XOR of all data written to the cache and the second register stores the XOR of all dirty data that are removed from the cache. CPPC relies on parity to detect a fault and then on the two XOR registers to correct faults. By a novel combination of byte shifting and parity interleaving CPPC corrects both single and spatial multi-bit faults to provide a high degree of reliability. We compare CPPC with one-dimensional parity, SECDED (Single Error Correction Double Error Detection) and two-dimensional parity-protected caches. Our simulation results show that CPPC provides a high level of reliability while its overheads are less than the overheads of SECDED and two-dimensional parity.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132345952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 50

Power management of online data-intensive services 在线数据密集型业务的电源管理

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000103

David Meisner, Christopher M. Sadler, L. Barroso, W. Weber, T. Wenisch

Much of the success of the Internet services model can be attributed to the popularity of a class of workloads that we call Online Data-Intensive (OLDI) services. These work-loads perform significant computing over massive data sets per user request but, unlike their offline counterparts (such as MapReduce computations), they require responsiveness in the sub-second time scale at high request rates. Large search products, online advertising, and machine translation are examples of workloads in this class. Although the load in OLDI services can vary widely during the day, their energy consumption sees little variance due to the lack of energy proportionality of the underlying machinery. The scale and latency sensitivity of OLDI workloads also make them a challenging target for power management techniques. We investigate what, if anything, can be done to make OLDI systems more energy-proportional. Specifically, we evaluate the applicability of active and idle low-power modes to reduce the power consumed by the primary server components (processor, memory, and disk), while maintaining tight response time constraints, particularly on 95th-percentile latency. Using Web search as a representative example of this workload class, we first characterize a production Web search workload at cluster-wide scale. We provide a fine-grain characterization and expose the opportunity for power savings using low-power modes of each primary server component. Second, we develop and validate a performance model to evaluate the impact of processor- and memory-based low-power modes on the search latency distribution and consider the benefit of current and foreseeable low-power modes. Our results highlight the challenges of power management for this class of workloads. In contrast to other server workloads, for which idle low-power modes have shown great promise, for OLDI workloads we find that energy-proportionality with acceptable query latency can only be achieved using coordinated, full-system active low-power modes.

Internet服务模型的成功在很大程度上可以归因于一类工作负载的流行，我们称之为在线数据密集型(OLDI)服务。这些工作负载在每个用户请求上对大量数据集执行重要的计算，但是，与它们的离线对应(例如MapReduce计算)不同，它们需要在高请求率下在亚秒级时间尺度内进行响应。大型搜索产品、在线广告和机器翻译是此类工作负载的示例。尽管OLDI服务中的负载在一天中变化很大，但由于底层机器缺乏能量比例性，它们的能耗变化很小。OLDI工作负载的规模和延迟敏感性也使它们成为电源管理技术的一个具有挑战性的目标。我们将研究如何(如果有的话)使OLDI系统更符合能量比例。具体来说，我们评估了活动和空闲低功耗模式的适用性，以减少主要服务器组件(处理器、内存和磁盘)消耗的功率，同时保持严格的响应时间限制，特别是在95百分位延迟上。使用Web搜索作为此工作负载类的代表性示例，我们首先描述集群范围内的生产Web搜索工作负载。我们提供了精细的特性描述，并揭示了使用每个主服务器组件的低功耗模式来节省电力的机会。其次，我们开发并验证了一个性能模型，以评估基于处理器和内存的低功耗模式对搜索延迟分布的影响，并考虑当前和可预见的低功耗模式的好处。我们的结果突出了这类工作负载的电源管理面临的挑战。与其他服务器工作负载相比，空闲低功耗模式显示出巨大的前景，对于OLDI工作负载，我们发现只有使用协调的、全系统活动的低功耗模式才能实现具有可接受查询延迟的能量比例。

{"title":"Power management of online data-intensive services","authors":"David Meisner, Christopher M. Sadler, L. Barroso, W. Weber, T. Wenisch","doi":"10.1145/2000064.2000103","DOIUrl":"https://doi.org/10.1145/2000064.2000103","url":null,"abstract":"Much of the success of the Internet services model can be attributed to the popularity of a class of workloads that we call Online Data-Intensive (OLDI) services. These work-loads perform significant computing over massive data sets per user request but, unlike their offline counterparts (such as MapReduce computations), they require responsiveness in the sub-second time scale at high request rates. Large search products, online advertising, and machine translation are examples of workloads in this class. Although the load in OLDI services can vary widely during the day, their energy consumption sees little variance due to the lack of energy proportionality of the underlying machinery. The scale and latency sensitivity of OLDI workloads also make them a challenging target for power management techniques. We investigate what, if anything, can be done to make OLDI systems more energy-proportional. Specifically, we evaluate the applicability of active and idle low-power modes to reduce the power consumed by the primary server components (processor, memory, and disk), while maintaining tight response time constraints, particularly on 95th-percentile latency. Using Web search as a representative example of this workload class, we first characterize a production Web search workload at cluster-wide scale. We provide a fine-grain characterization and expose the opportunity for power savings using low-power modes of each primary server component. Second, we develop and validate a performance model to evaluate the impact of processor- and memory-based low-power modes on the search latency distribution and consider the benefit of current and foreseeable low-power modes. Our results highlight the challenges of power management for this class of workloads. In contrast to other server workloads, for which idle low-power modes have shown great promise, for OLDI workloads we find that energy-proportionality with acceptable query latency can only be achieved using coordinated, full-system active low-power modes.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125598920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 423

FlexBulk: Intelligently forming atomic blocks in blocked-execution multiprocessors to minimize squashes FlexBulk:在阻塞执行的多处理器中智能地形成原子块，以尽量减少挤压

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000070

Rishi Agarwal, J. Torrellas

Blocked-execution multiprocessor architectures continuously run atomic blocks of instructions - also called Chunks. Such architectures can boost both performance and software productivity, and enable unique compiler optimization opportunities. Unfortunately, they are handicapped in that, if they use large chunks to minimize chunk-commit overhead and to enable more compiler optimization, inter-thread data conflicts may lead to frequent chunk squashes. In this paper, we present automatic techniques to form chunks in these architectures to minimize the cycles lost to squashes. We start by characterizing the operations that frequently cause squashes. We call them Squash Hazards. We then propose squash-removing algorithms tailored to these Squash Hazards. We also describe a software framework called FlexBulk that profiles the code and transforms it following these algorithms. We evaluate FlexBulk on 16-threaded PARSEC and SPLASH-2 codes running on a simulated machine. The results show that, with 17,000-instruction chunks, FlexBulk eliminates, on average, over 90% of the squash cycles in the applications. As a result, compared to a baseline execution with 2,000-instruction chunks as in previous work, the applications run on average 1.43x faster.

块执行多处理器体系结构连续运行原子指令块(也称为块)。这样的体系结构可以提高性能和软件生产力，并提供独特的编译器优化机会。不幸的是，如果它们使用大块来最小化块提交开销并启用更多的编译器优化，那么线程间数据冲突可能会导致频繁的块压缩。在本文中，我们提出了在这些体系结构中自动形成块的技术，以最大限度地减少因挤压而损失的周期。我们从描述经常导致压扁的操作开始。我们称之为壁球危险。然后，我们提出了针对这些壁球危害量身定制的壁球清除算法。我们还描述了一个名为FlexBulk的软件框架，它可以分析代码并按照这些算法进行转换。我们在模拟机器上运行的16线程PARSEC和SPLASH-2代码上评估FlexBulk。结果表明，使用17,000条指令块，FlexBulk平均消除了应用程序中90%以上的压缩周期。因此，与之前工作中包含2,000条指令块的基准执行相比，应用程序的运行速度平均提高了1.43倍。

{"title":"FlexBulk: Intelligently forming atomic blocks in blocked-execution multiprocessors to minimize squashes","authors":"Rishi Agarwal, J. Torrellas","doi":"10.1145/2000064.2000070","DOIUrl":"https://doi.org/10.1145/2000064.2000070","url":null,"abstract":"Blocked-execution multiprocessor architectures continuously run atomic blocks of instructions - also called Chunks. Such architectures can boost both performance and software productivity, and enable unique compiler optimization opportunities. Unfortunately, they are handicapped in that, if they use large chunks to minimize chunk-commit overhead and to enable more compiler optimization, inter-thread data conflicts may lead to frequent chunk squashes. In this paper, we present automatic techniques to form chunks in these architectures to minimize the cycles lost to squashes. We start by characterizing the operations that frequently cause squashes. We call them Squash Hazards. We then propose squash-removing algorithms tailored to these Squash Hazards. We also describe a software framework called FlexBulk that profiles the code and transforms it following these algorithms. We evaluate FlexBulk on 16-threaded PARSEC and SPLASH-2 codes running on a simulated machine. The results show that, with 17,000-instruction chunks, FlexBulk eliminates, on average, over 90% of the squash cycles in the applications. As a result, compared to a baseline execution with 2,000-instruction chunks as in previous work, the applications run on average 1.43x faster.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127756998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Virtualizing performance asymmetric multi-core systems 虚拟化性能不对称的多核系统

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000071

Youngjin Kwon, Changdae Kim, S. Maeng, Jaehyuk Huh

Performance-asymmetric multi-cores consist of heterogeneous cores, which support the same ISA, but have different computing capabilities. To maximize the throughput of asymmetric multi-core systems, operating systems are responsible for scheduling threads to different types of cores. However, system virtualization poses a challenge for such asymmetric multi-cores, since virtualization hides the physical heterogeneity from guest operating systems. In this paper, we explore the design space of hypervisor schedulers for asymmetric multi-cores, which do not require asymmetry-awareness from guest operating systems. The proposed scheduler characterizes the efficiency of each virtual core, and map the virtual core to the most area-efficient physical core. In addition to the overall system throughput, we consider two important aspects of virtualizing asymmetric multi-cores: performance fairness among virtual machines and performance scalability for changing availability of fast and slow cores. We have implemented an asymmetry-aware scheduler in the open-source Xen hypervisor. Using applications with various characteristics, we evaluate how effectively the proposed scheduler can improve system throughput without asymmetry-aware operating systems. The modified scheduler improves the performance of the Xen credit scheduler by as much as 40% on a 12-core system with four fast and eight slow cores. The results show that even the VMs scheduled to slow cores have relatively low performance degradations, and the scheduler provides scalable performance with increasing fast core counts.

性能不对称多核由异构核组成，支持相同的ISA，但计算能力不同。为了最大限度地提高非对称多核系统的吞吐量，操作系统负责将线程调度到不同类型的核。然而，系统虚拟化对这种非对称多核提出了挑战，因为虚拟化对客户机操作系统隐藏了物理异构性。在本文中，我们探讨了不对称多核管理程序调度器的设计空间，它不需要客户机操作系统的不对称感知。建议的调度器描述每个虚拟核心的效率，并将虚拟核心映射到最具区域效率的物理核心。除了总体系统吞吐量之外，我们还考虑了虚拟化非对称多核的两个重要方面:虚拟机之间的性能公平性和快速和慢速核心可用性变化的性能可伸缩性。我们在开源Xen管理程序中实现了一个不对称感知调度器。使用具有各种特征的应用程序，我们评估了在没有不对称感知操作系统的情况下，所建议的调度器如何有效地提高系统吞吐量。修改后的调度器在12核系统(4个快核和8个慢核)上将Xen信用调度器的性能提高了40%。结果表明，即使是调度到慢核的虚拟机也有相对较低的性能下降，并且调度器提供了随着快速核数增加而可扩展的性能。

{"title":"Virtualizing performance asymmetric multi-core systems","authors":"Youngjin Kwon, Changdae Kim, S. Maeng, Jaehyuk Huh","doi":"10.1145/2000064.2000071","DOIUrl":"https://doi.org/10.1145/2000064.2000071","url":null,"abstract":"Performance-asymmetric multi-cores consist of heterogeneous cores, which support the same ISA, but have different computing capabilities. To maximize the throughput of asymmetric multi-core systems, operating systems are responsible for scheduling threads to different types of cores. However, system virtualization poses a challenge for such asymmetric multi-cores, since virtualization hides the physical heterogeneity from guest operating systems. In this paper, we explore the design space of hypervisor schedulers for asymmetric multi-cores, which do not require asymmetry-awareness from guest operating systems. The proposed scheduler characterizes the efficiency of each virtual core, and map the virtual core to the most area-efficient physical core. In addition to the overall system throughput, we consider two important aspects of virtualizing asymmetric multi-cores: performance fairness among virtual machines and performance scalability for changing availability of fast and slow cores. We have implemented an asymmetry-aware scheduler in the open-source Xen hypervisor. Using applications with various characteristics, we evaluate how effectively the proposed scheduler can improve system throughput without asymmetry-aware operating systems. The modified scheduler improves the performance of the Xen credit scheduler by as much as 40% on a 12-core system with four fast and eight slow cores. The results show that even the VMs scheduled to slow cores have relatively low performance degradations, and the scheduler provides scalable performance with increasing fast core counts.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115414261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

Fighting fire with fire: Modeling the datacenter-scale effects of targeted superlattice thermal management 以牙还牙:模拟目标超晶格热管理的数据中心规模效应

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000104

Susmit Biswas, Mohit Tiwari, T. Sherwood, L. Theogarajan, F. Chong

Local thermal hot-spots in microprocessors lead to worst-case provisioning of global cooling resources, especially in large-scale systems where cooling power can be 50~100% of IT power. Further, the efficiency of cooling solutions degrade non-linearly with supply temperature. Recent advances in active cooling techniques have shown on-chip thermoelectric coolers (TECs) to be very efficient at selectively eliminating small hot-spots. Applying current to a superlattice TEC-film that is deposited between silicon and the heat spreader results in a Peltier effect, which spreads the heat and lowers the temperature of the hot-spot significantly and improves chip reliability. In this paper, we propose that hot-spot mitigation using thermoelectric coolers can be used as a power management mechanism to allow global coolers to be provisioned for a better worst case temperature leading to substantial savings in cooling power. In order to quantify the potential power savings from using TECs in data center servers, we present a detailed power model that integrates on-chip dynamic and leakage power sources, heat diffusion through the entire chip, TEC and global cooler efficiencies, and all their mutual interactions. Our multi-scale analysis shows that, for a typical data center, TECs allow global coolers to operate at higher temperatures without degrading chip lifetime, and thus save ~27% cooling power on average while providing the same processor reliability as a data center running at 288K.

微处理器中的局部热热点导致全局冷却资源的最坏情况配置，特别是在冷却功率可以达到IT功率的50~100%的大型系统中。此外，冷却方案的效率随电源温度呈非线性降低。主动冷却技术的最新进展表明，片上热电冷却器(tec)在选择性消除小热点方面非常有效。将电流施加到沉积在硅和散热片之间的超晶格tec薄膜上，会产生珀尔帖效应，从而扩散热量，显著降低热点温度，提高芯片的可靠性。在本文中，我们建议使用热电冷却器来缓解热点，可以作为一种电源管理机制，允许全球冷却器提供更好的最坏情况温度，从而大幅节省冷却功率。为了量化在数据中心服务器中使用TEC的潜在节能，我们提出了一个详细的功耗模型，该模型集成了片上动态电源和泄漏电源、整个芯片的热扩散、TEC和全局冷却器效率，以及它们之间的所有相互作用。我们的多尺度分析表明，对于一个典型的数据中心，tec允许全局冷却器在更高的温度下运行而不会降低芯片寿命，因此在提供与运行在288K下的数据中心相同的处理器可靠性的同时，平均节省约27%的冷却功率。

{"title":"Fighting fire with fire: Modeling the datacenter-scale effects of targeted superlattice thermal management","authors":"Susmit Biswas, Mohit Tiwari, T. Sherwood, L. Theogarajan, F. Chong","doi":"10.1145/2000064.2000104","DOIUrl":"https://doi.org/10.1145/2000064.2000104","url":null,"abstract":"Local thermal hot-spots in microprocessors lead to worst-case provisioning of global cooling resources, especially in large-scale systems where cooling power can be 50~100% of IT power. Further, the efficiency of cooling solutions degrade non-linearly with supply temperature. Recent advances in active cooling techniques have shown on-chip thermoelectric coolers (TECs) to be very efficient at selectively eliminating small hot-spots. Applying current to a superlattice TEC-film that is deposited between silicon and the heat spreader results in a Peltier effect, which spreads the heat and lowers the temperature of the hot-spot significantly and improves chip reliability. In this paper, we propose that hot-spot mitigation using thermoelectric coolers can be used as a power management mechanism to allow global coolers to be provisioned for a better worst case temperature leading to substantial savings in cooling power. In order to quantify the potential power savings from using TECs in data center servers, we present a detailed power model that integrates on-chip dynamic and leakage power sources, heat diffusion through the entire chip, TEC and global cooler efficiencies, and all their mutual interactions. Our multi-scale analysis shows that, for a typical data center, TECs allow global coolers to operate at higher temperatures without degrading chip lifetime, and thus save ~27% cooling power on average while providing the same processor reliability as a data center running at 288K.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127525569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 42

An abacus turn model for time/space-efficient reconfigurable routing 一种时间/空间高效可重构路由的算盘转弯模型

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000096

Binzhang Fu, Yinhe Han, Jun Ma, Huawei Li, Xiaowei Li

Applications' traffic tends to be bursty and the location of hot-spot nodes moves as time goes by. This will significantly aggregate the blocking problem of wormhole-routed Network-on-Chip (NoC). Most of state-of-the-art traffic balancing solutions are based on fully adaptive routing algorithms which may introduce large time/space overhead to routers. Partially adaptive routing algorithms, on the other hand, are time/space efficient, but lack of even or sufficient routing adaptiveness. Reconfigurable routing algorithms could provide on-demand routing adaptiveness for reducing blocking, but most of them are off-line solutions due to the lack of a practical model to dynamically generate deadlock-free routing algorithms. In this paper, we propose the abacus-turn-model (AbTM) for designing time/space-efficient reconfigurable wormhole routing algorithms. Unlike the original turn model, AbTM exploits dynamic communication patterns in applications to reduce the routing latency and chip area requirements. We apply forbidden turns dynamically to preserve deadlock-free operations. Our AbTM routing architecture has two distinct advantages: First, the AbTM leads to a new router architecture without adding virtual channels and routing table. This reconfigurable architecture updates the routing path once the communication pattern changes, and always provides full adaptiveness to hot-spot directions to reduce network blocking. Secondly, the reconfiguration scheme has a good scalability because all operations are carried out between neighbors. We demonstrate these advantages through extensive simulation experiments. The experimental results are indeed encouraging and prove its applicability with scalable performance in large-scale NoC applications.

应用程序的流量往往是突发的，热点节点的位置随着时间的推移而移动。这将显著地聚集了虫洞路由的片上网络(NoC)的阻塞问题。大多数最先进的流量平衡解决方案都是基于完全自适应路由算法，这可能会给路由器带来大量的时间/空间开销。另一方面，部分自适应路由算法具有时间/空间效率，但缺乏均匀或充分的路由自适应性。可重构路由算法可以提供按需路由自适应以减少阻塞，但由于缺乏动态生成无死锁路由算法的实用模型，大多数路由算法都是离线解决方案。在本文中，我们提出abacus- turnmodel (AbTM)来设计时间/空间高效的可重构虫洞路由算法。与原来的回合模型不同，AbTM利用应用程序中的动态通信模式来减少路由延迟和芯片面积要求。我们动态地应用禁止匝来保持无死锁的操作。我们的AbTM路由体系结构有两个明显的优点:首先，AbTM在不增加虚拟通道和路由表的情况下形成了一个新的路由器体系结构。这种可重构的体系结构在通信模式发生变化时更新路由路径，并始终提供对热点方向的完全自适应，以减少网络阻塞。其次，所有的操作都是在邻居之间进行的，具有很好的可扩展性。我们通过大量的仿真实验证明了这些优点。实验结果确实令人鼓舞，并证明了其在大规模NoC应用中的可扩展性。

{"title":"An abacus turn model for time/space-efficient reconfigurable routing","authors":"Binzhang Fu, Yinhe Han, Jun Ma, Huawei Li, Xiaowei Li","doi":"10.1145/2000064.2000096","DOIUrl":"https://doi.org/10.1145/2000064.2000096","url":null,"abstract":"Applications' traffic tends to be bursty and the location of hot-spot nodes moves as time goes by. This will significantly aggregate the blocking problem of wormhole-routed Network-on-Chip (NoC). Most of state-of-the-art traffic balancing solutions are based on fully adaptive routing algorithms which may introduce large time/space overhead to routers. Partially adaptive routing algorithms, on the other hand, are time/space efficient, but lack of even or sufficient routing adaptiveness. Reconfigurable routing algorithms could provide on-demand routing adaptiveness for reducing blocking, but most of them are off-line solutions due to the lack of a practical model to dynamically generate deadlock-free routing algorithms. In this paper, we propose the abacus-turn-model (AbTM) for designing time/space-efficient reconfigurable wormhole routing algorithms. Unlike the original turn model, AbTM exploits dynamic communication patterns in applications to reduce the routing latency and chip area requirements. We apply forbidden turns dynamically to preserve deadlock-free operations. Our AbTM routing architecture has two distinct advantages: First, the AbTM leads to a new router architecture without adding virtual channels and routing table. This reconfigurable architecture updates the routing path once the communication pattern changes, and always provides full adaptiveness to hot-spot directions to reduce network blocking. Secondly, the reconfiguration scheme has a good scalability because all operations are carried out between neighbors. We demonstrate these advantages through extensive simulation experiments. The experimental results are indeed encouraging and prove its applicability with scalable performance in large-scale NoC applications.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115638971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 95

Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs 为cmp中的堆叠3D STT-RAM缓存构建片上互连

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000074

Asit K. Mishra, Xiangyu Dong, Guangyu Sun, Yuan Xie, N. Vijaykrishnan, C. Das

Emerging memory technologies such as STT-RAM, PCRAM, and resistive RAM are being explored as potential replacements to existing on-chip caches or main memories for future multi-core architectures. This is due to the many attractive features these memory technologies posses: high density, low leakage, and non-volatility. However, the latency and energy overhead associated with the write operations of these emerging memories has become a major obstacle in their adoption. Previous works have proposed various circuit and architectural level solutions to mitigate the write overhead. In this paper, we study the integration of STT-RAM in a 3D multi-core environment and propose solutions at the on-chip network level to circumvent the write overhead problem in the cache architecture with STT-RAM technology. Our scheme is based on the observation that instead of staggering requests to a write-busy STT-RAM bank, the network should schedule requests to other idle cache banks for effectively hiding the latency. Thus, we prioritize cache accesses to the idle banks by delaying accesses to the STTRAM cache banks that are currently serving long latency write requests. Through a detailed characterization of the cache access patterns of 42 applications, we propose an efficient mechanism to facilitate such delayed writes to cache banks by (a) accurately estimating the busy time of each cache bank through logical partitioning of the cache layer and (b) prioritizing packets in a router requesting accesses to idle banks. Evaluations on a 3D architecture, consisting of 64 cores and 64 STT-RAM cache banks, show that our proposed approach provides 14% average IPC improvement for multi-threaded benchmarks, 19% instruction throughput benefits for multi-programmed workloads, and 6% latency reduction compared to a recently proposed write buffering mechanism.

诸如STT-RAM、PCRAM和电阻式RAM等新兴存储技术正在被探索，作为未来多核架构中现有片上缓存或主存储器的潜在替代品。这是由于这些存储技术具有许多吸引人的特点:高密度、低泄漏和非易失性。然而，与这些新兴存储器的写入操作相关的延迟和能量开销已成为采用它们的主要障碍。以前的工作已经提出了各种电路和架构级别的解决方案来减少写入开销。本文研究了STT-RAM在3D多核环境下的集成，并在片上网络层面提出了解决方案，以规避STT-RAM技术在缓存架构中的写入开销问题。我们的方案是基于这样的观察:网络应该将请求调度到其他空闲缓存银行，而不是将请求错开到写繁忙的STT-RAM银行，以有效地隐藏延迟。因此，我们通过延迟对当前正在处理长延迟写请求的stream缓存银行的访问来优先考虑对空闲银行的缓存访问。通过对42个应用程序的缓存访问模式的详细描述，我们提出了一种有效的机制，通过(a)通过缓存层的逻辑分区准确估计每个缓存库的繁忙时间，以及(b)在请求访问空闲缓存库的路由器中对数据包进行优先级排序，来促进对缓存库的延迟写入。对由64核和64 STT-RAM缓存库组成的3D架构的评估表明，我们提出的方法为多线程基准测试提供了14%的平均IPC改进，为多编程工作负载提供了19%的指令吞吐量优势，与最近提出的写缓冲机制相比，延迟减少了6%。

{"title":"Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs","authors":"Asit K. Mishra, Xiangyu Dong, Guangyu Sun, Yuan Xie, N. Vijaykrishnan, C. Das","doi":"10.1145/2000064.2000074","DOIUrl":"https://doi.org/10.1145/2000064.2000074","url":null,"abstract":"Emerging memory technologies such as STT-RAM, PCRAM, and resistive RAM are being explored as potential replacements to existing on-chip caches or main memories for future multi-core architectures. This is due to the many attractive features these memory technologies posses: high density, low leakage, and non-volatility. However, the latency and energy overhead associated with the write operations of these emerging memories has become a major obstacle in their adoption. Previous works have proposed various circuit and architectural level solutions to mitigate the write overhead. In this paper, we study the integration of STT-RAM in a 3D multi-core environment and propose solutions at the on-chip network level to circumvent the write overhead problem in the cache architecture with STT-RAM technology. Our scheme is based on the observation that instead of staggering requests to a write-busy STT-RAM bank, the network should schedule requests to other idle cache banks for effectively hiding the latency. Thus, we prioritize cache accesses to the idle banks by delaying accesses to the STTRAM cache banks that are currently serving long latency write requests. Through a detailed characterization of the cache access patterns of 42 applications, we propose an efficient mechanism to facilitate such delayed writes to cache banks by (a) accurately estimating the busy time of each cache bank through logical partitioning of the cache layer and (b) prioritizing packets in a router requesting accesses to idle banks. Evaluations on a 3D architecture, consisting of 64 cores and 64 STT-RAM cache banks, show that our proposed approach provides 14% average IPC improvement for multi-threaded benchmarks, 19% instruction throughput benefits for multi-programmed workloads, and 6% latency reduction compared to a recently proposed write buffering mechanism.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123447219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 108