Proceedings. International Symposium on Computer Architecture最新文献_第8页

Dynamic performance tuning for speculative threads 推测线程的动态性能调优

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555812

Yangchun Luo, Venkatesan Packirisamy, W. Hsu, Antonia Zhai, Nikhil Mungre, Ankit Tarkas

In response to the emergence of multicore processors, various novel and sophisticated execution models have been introduced to fully utilize these processors. One such execution model is Thread-Level Speculation (TLS), which allows potentially dependent threads to execute speculatively in parallel. While TLS offers significant performance potential for applications that are otherwise non-parallel, extracting efficient speculative threads in the presence of complex control flow and ambiguous data dependences is a real challenge. This task is further complicated by the fact that the performance of speculative threads is often architecture-dependent, input-sensitive, and exhibits phase behaviors. Thus we propose dynamic performance tuning mechanisms that determine where and how to create speculative threads at runtime. This paper describes the design, implementation, and evaluation of hardware and software support that takes advantage of runtime performance profiles to extract efficient speculative threads. In our proposed framework, speculative threads are monitored by hardware-based performance counters and their performance impact is estimated. The creation of speculative threads is adjusted based on the estimation. This paper proposes speculative threads performance estimation techniques, that are capable of correctly determining whether speculation can improve performance for loops that corresponds to 83.8% of total loop execution time across all benchmarks. This paper also examines several dynamic performance tuning policies and finds that the best tuning policy achieves an overall speedup of 36.8%on a set of benchmarks from SPEC2000 suite, which outperforms static thread management by 9.5%.

为了应对多核处理器的出现，已经引入了各种新颖和复杂的执行模型来充分利用这些处理器。其中一种执行模型是线程级推测(TLS)，它允许潜在的依赖线程并行地推测执行。虽然TLS为非并行的应用程序提供了巨大的性能潜力，但在复杂的控制流和模糊的数据依赖关系中提取有效的推测线程是一个真正的挑战。由于推测线程的性能通常依赖于体系结构，对输入敏感，并表现出阶段行为，因此使这项任务更加复杂。因此，我们提出了动态性能调优机制，以确定在何处以及如何在运行时创建推测线程。本文描述了硬件和软件支持的设计、实现和评估，这些支持利用运行时性能配置文件来提取有效的推测线程。在我们提出的框架中，推测线程由基于硬件的性能计数器监视，并估计其性能影响。投机线程的创建将根据估计进行调整。本文提出了推测线程性能估计技术，能够正确地确定推测是否可以提高循环的性能，在所有基准测试中，循环执行时间占总循环执行时间的83.8%。本文还研究了几种动态性能调优策略，并发现在SPEC2000套件的一组基准测试中，最佳调优策略可以实现36.8%的总体加速，比静态线程管理高出9.5%。

{"title":"Dynamic performance tuning for speculative threads","authors":"Yangchun Luo, Venkatesan Packirisamy, W. Hsu, Antonia Zhai, Nikhil Mungre, Ankit Tarkas","doi":"10.1145/1555754.1555812","DOIUrl":"https://doi.org/10.1145/1555754.1555812","url":null,"abstract":"In response to the emergence of multicore processors, various novel and sophisticated execution models have been introduced to fully utilize these processors. One such execution model is Thread-Level Speculation (TLS), which allows potentially dependent threads to execute speculatively in parallel. While TLS offers significant performance potential for applications that are otherwise non-parallel, extracting efficient speculative threads in the presence of complex control flow and ambiguous data dependences is a real challenge. This task is further complicated by the fact that the performance of speculative threads is often architecture-dependent, input-sensitive, and exhibits phase behaviors. Thus we propose dynamic performance tuning mechanisms that determine where and how to create speculative threads at runtime.\u0000 This paper describes the design, implementation, and evaluation of hardware and software support that takes advantage of runtime performance profiles to extract efficient speculative threads. In our proposed framework, speculative threads are monitored by hardware-based performance counters and their performance impact is estimated. The creation of speculative threads is adjusted based on the estimation. This paper proposes speculative threads performance estimation techniques, that are capable of correctly determining whether speculation can improve performance for loops that corresponds to 83.8% of total loop execution time across all benchmarks. This paper also examines several dynamic performance tuning policies and finds that the best tuning policy achieves an overall speedup of 36.8%on a set of benchmarks from SPEC2000 suite, which outperforms static thread management by 9.5%.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"8 1","pages":"462-473"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85427150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

Thread motion: fine-grained power management for multi-core systems 线程运动:多核系统的细粒度电源管理

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555793

K. Rangan, Gu-Yeon Wei, D. Brooks

Dynamic voltage and frequency scaling (DVFS) is a commonly-used power-management scheme that dynamically adjusts power and performance to the time-varying needs of running programs. Unfortunately, conventional DVFS, relying on off-chip regulators, faces limitations in terms of temporal granularity and high costs when considered for future multi-core systems. To overcome these challenges, this paper presents thread motion (TM), a fine-grained power-management scheme for chip multiprocessors (CMPs). Instead of incurring the high cost of changing the voltage and frequency of different cores, TM enables rapid movement of threads to adapt the time-varying computing needs of running applications to a mixture of cores with fixed but different power/performance levels. Results show that for the same power budget, two voltage/frequency levels are sufficient to provide performance gains commensurate to idealized scenarios using per-core voltage control. Thread motion extends workload-based power management into the nanosecond realm and, for a given power budget, provides up to 20% better performance than coarse-grained DVFS.

动态电压和频率缩放(DVFS)是一种常用的电源管理方案，它根据运行程序的时变需求动态调整功率和性能。不幸的是，当考虑到未来的多核系统时，传统的DVFS依赖于片外调节器，在时间粒度和高成本方面面临限制。为了克服这些挑战，本文提出了线程运动(TM)，一种用于芯片多处理器(cmp)的细粒度电源管理方案。TM不会因改变不同内核的电压和频率而产生高昂的成本，而是支持线程的快速移动，以适应运行应用程序的时变计算需求，以适应具有固定但不同功率/性能水平的内核的混合。结果表明，对于相同的功率预算，两个电压/频率水平足以提供与使用每个核心电压控制的理想情况相称的性能增益。线程运动将基于工作负载的电源管理扩展到纳秒级，并且在给定的电源预算下，提供比粗粒度DVFS高出20%的性能。

引用次数: 266

Indirect adaptive routing on large scale interconnection networks 大规模互连网络中的间接自适应路由

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555783

Nan Jiang, John Kim, W. Dally

Recently proposed high-radix interconnection networks [10] require global adaptive routing to achieve optimum performance. Existing direct adaptive routing methods are slow to sense congestion remote from the source router and hence misroute many packets before such congestion is detected. This paper introduces indirect global adaptive routing (IAR) in which the adaptive routing decision uses information that is not directly available at the source router. We describe four IAR routing methods: credit round trip (CRT) [10], progressive adaptive routing (PAR), piggyback routing (PB), and reservation routing (RES). We evaluate each of these methods on the dragonfly topology under both steady-state and transient loads. Our results show that PB, PAR, and CRT all achieve good performance. PB provides the best absolute performance, with 2-7% lower latency on steady-state uniform random traffic at 70% load, while PAR provides the fastest response on transient loads. We also evaluate the implementation costs of the indirect adaptive routing methods and show that PB has the lowest implementation cost requiring <1% increase in the total storage of a typical high-radix router.

最近提出的高基数互连网络[10]需要全局自适应路由来实现最佳性能。现有的直接自适应路由方法在距离源路由器较远的地方感知拥塞的速度较慢，因此在检测到拥塞之前会错误地路由许多数据包。本文介绍了间接全局自适应路由(IAR)，其中自适应路由决策使用源路由器上不能直接获得的信息。我们描述了四种IAR路由方法:信用往返(CRT)[10]、渐进式自适应路由(PAR)、背带路由(PB)和预订路由(RES)。我们在蜻蜓拓扑上评估了这些方法在稳态和瞬态载荷下的性能。我们的研究结果表明，PB、PAR和CRT都取得了良好的性能。PB提供了最好的绝对性能，在70%负载下的稳态均匀随机流量上延迟降低了2-7%，而PAR在瞬态负载上提供了最快的响应。我们还评估了间接自适应路由方法的实现成本，并表明在典型的高基数路由器中，PB具有最低的实现成本，所需的总存储增加<1%。

{"title":"Indirect adaptive routing on large scale interconnection networks","authors":"Nan Jiang, John Kim, W. Dally","doi":"10.1145/1555754.1555783","DOIUrl":"https://doi.org/10.1145/1555754.1555783","url":null,"abstract":"Recently proposed high-radix interconnection networks [10] require global adaptive routing to achieve optimum performance. Existing direct adaptive routing methods are slow to sense congestion remote from the source router and hence misroute many packets before such congestion is detected. This paper introduces indirect global adaptive routing (IAR) in which the adaptive routing decision uses information that is not directly available at the source router. We describe four IAR routing methods: credit round trip (CRT) [10], progressive adaptive routing (PAR), piggyback routing (PB), and reservation routing (RES). We evaluate each of these methods on the dragonfly topology under both steady-state and transient loads. Our results show that PB, PAR, and CRT all achieve good performance. PB provides the best absolute performance, with 2-7% lower latency on steady-state uniform random traffic at 70% load, while PAR provides the fastest response on transient loads. We also evaluate the implementation costs of the indirect adaptive routing methods and show that PB has the lowest implementation cost requiring <1% increase in the total storage of a typical high-radix router.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"71 1","pages":"220-231"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87740054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 119

Reactive NUCA: near-optimal block placement and replication in distributed caches 反应NUCA:在分布式缓存中接近最佳的块放置和复制

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555779

N. Hardavellas, M. Ferdman, B. Falsafi, A. Ailamaki

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests. At the same time, the growing on-chip communication delay favors core-private caches that replicate data to minimize delays on global wires. Recent hybrid proposals offer lower average latency than conventional designs, but they address the placement requirements of only a subset of the data accessed by the application, require complex lookup and coherence mechanisms that increase latency, or fail to scale to high core counts. In this work, we observe that the cache access patterns of a range of server and scientific workloads can be classified into distinct classes, where each class is amenable to different block placement policies. Based on this observation, we propose Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache. R-NUCA cooperates with the operating system to support intelligent placement, migration, and replication without the overhead of an explicit coherence mechanism for the on-chip last-level cache. In a range of server, scientific, and multiprogrammed workloads, R-NUCA matches the performance of the best cache design for each workload, improving performance by 14% on average over competing designs and by 32% at best, while achieving performance within 5% of an ideal cache design.

片上通信延迟的增加以及服务器和科学工作负载的大型工作集使多核处理器片上最后一级缓存的设计复杂化。大型工作集支持共享缓存设计，该设计可以最大化总缓存容量并最小化片外内存请求。同时，不断增长的片上通信延迟有利于复制数据以最小化全球线路延迟的核心私有缓存。最近的混合方案提供了比传统设计更低的平均延迟，但它们只解决了应用程序访问的一小部分数据的放置要求，需要复杂的查找和一致性机制，从而增加了延迟，或者无法扩展到高核数。在这项工作中，我们观察到一系列服务器和科学工作负载的缓存访问模式可以分为不同的类，其中每个类适用于不同的块放置策略。基于这一观察，我们提出了反应式NUCA (R-NUCA)，这是一种分布式缓存设计，它对每个缓存访问的类别做出反应，并将块放置在缓存中的适当位置。R-NUCA与操作系统合作，支持智能放置，迁移和复制，而无需为片上最后一级缓存提供显式一致性机制的开销。在一系列服务器，科学和多程序工作负载中，R-NUCA匹配每个工作负载的最佳缓存设计的性能，比竞争设计平均提高14%，最多提高32%，同时实现性能在理想缓存设计的5%以内。

{"title":"Reactive NUCA: near-optimal block placement and replication in distributed caches","authors":"N. Hardavellas, M. Ferdman, B. Falsafi, A. Ailamaki","doi":"10.1145/1555754.1555779","DOIUrl":"https://doi.org/10.1145/1555754.1555779","url":null,"abstract":"Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests. At the same time, the growing on-chip communication delay favors core-private caches that replicate data to minimize delays on global wires. Recent hybrid proposals offer lower average latency than conventional designs, but they address the placement requirements of only a subset of the data accessed by the application, require complex lookup and coherence mechanisms that increase latency, or fail to scale to high core counts.\u0000 In this work, we observe that the cache access patterns of a range of server and scientific workloads can be classified into distinct classes, where each class is amenable to different block placement policies. Based on this observation, we propose Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache. R-NUCA cooperates with the operating system to support intelligent placement, migration, and replication without the overhead of an explicit coherence mechanism for the on-chip last-level cache. In a range of server, scientific, and multiprogrammed workloads, R-NUCA matches the performance of the best cache design for each workload, improving performance by 14% on average over competing designs and by 32% at best, while achieving performance within 5% of an ideal cache design.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"3 1","pages":"184-195"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73189765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 419

Performance and power of cache-based reconfigurable computing 基于缓存的可重构计算的性能和能力

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555804

Andrew Putnam, S. Eggers, Dave Bennett, E. Dellinger, J. Mason, Henry Styles, P. Sundararajan, Ralph Wittig

Many-cache is a memory architecture that efficiently supports caching in commercially available FPGAs. It facilitates FPGA programming for high-performance computing (HPC) developers by providing them with memory performance that is greater and power consumption that is less than their current CPU platforms, but without sacrificing their familiar, C-based programming environment. Many-cache creates multiple, multi-banked caches on top of an FGPA's small, independent memories, each targeting a particular data structure or region of memory in an application and each customized for the memory operations that access it. The caches are automatically generated from C source by the CHiMPS C-to-FPGA compiler. This paper presents the analyses and optimizations of the CHiMPS compiler that construct many-cache caches. An architectural evaluation of CHiMPS-generated FPGAs demonstrates a performance advantage of 7.8x (geometric mean) over CPU-only execution of the same source code, FPGA power usage that is on average 4.1x less, and consequently performance per watt that is also greater, by a geometric mean of 21.3x.

多缓存是一种有效支持商用fpga缓存的内存架构。它为高性能计算(HPC)开发人员提供了比当前CPU平台更高的内存性能和更低的功耗，同时又不会牺牲他们熟悉的基于c语言的编程环境，从而促进了FPGA编程。Many-cache在FGPA的小型独立内存之上创建多个多银行缓存，每个缓存针对应用程序中的特定数据结构或内存区域，每个缓存针对访问它的内存操作进行定制。这些缓存由黑猩猩的C-to- fpga编译器从C源代码自动生成。本文对构建多缓存的黑猩猩编译器进行了分析和优化。对黑猩猩生成的FPGA的架构评估表明，与仅cpu执行相同的源代码相比，FPGA的性能优势为7.8倍(几何平均值)，FPGA的功耗平均减少4.1倍，因此每瓦性能也更高，几何平均值为21.3倍。

{"title":"Performance and power of cache-based reconfigurable computing","authors":"Andrew Putnam, S. Eggers, Dave Bennett, E. Dellinger, J. Mason, Henry Styles, P. Sundararajan, Ralph Wittig","doi":"10.1145/1555754.1555804","DOIUrl":"https://doi.org/10.1145/1555754.1555804","url":null,"abstract":"Many-cache is a memory architecture that efficiently supports caching in commercially available FPGAs. It facilitates FPGA programming for high-performance computing (HPC) developers by providing them with memory performance that is greater and power consumption that is less than their current CPU platforms, but without sacrificing their familiar, C-based programming environment.\u0000 Many-cache creates multiple, multi-banked caches on top of an FGPA's small, independent memories, each targeting a particular data structure or region of memory in an application and each customized for the memory operations that access it. The caches are automatically generated from C source by the CHiMPS C-to-FPGA compiler.\u0000 This paper presents the analyses and optimizations of the CHiMPS compiler that construct many-cache caches. An architectural evaluation of CHiMPS-generated FPGAs demonstrates a performance advantage of 7.8x (geometric mean) over CPU-only execution of the same source code, FPGA power usage that is on average 4.1x less, and consequently performance per watt that is also greater, by a geometric mean of 21.3x.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"197 1","pages":"395-405"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74504682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

The performance of PC solid-state disks (SSDs) as a function of bandwidth, concurrency, device architecture, and system organization PC固态硬盘(ssd)的性能与带宽、并发性、设备架构和系统组织有关

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555790

Cagdas Dirik, B. Jacob

As their prices decline, their storage capacities increase, and their endurance improves, NAND Flash Solid State Disks (SSD) provide an increasingly attractive alternative to Hard Disk Drives (HDD) for portable computing systems and PCs. This paper presents a study of NAND Flash SSD architectures and their management techniques, quantifying SSD performance under user-driven/PC applications in a multi-tasked environment; user activity represents typical PC workloads and includes browsing files and folders, emailing, text editing and document creation, surfing the web, listening to music and playing movies, editing large pictures, and running office applications. We find the following: (a) the real limitation to NAND Flash memory performance is not its low per-device bandwidth but its internal core interface; (b) NAND Flash memory media transfer rates do not need to scale up to those of HDDs for good performance; (c) SSD organizations that exploit concurrency at both the system and device level (e.g. RAID-like organizations and Micron-style (superblocks) improve performance significantly; and (d) these system- and device-level concurrency mechanisms are, to a significant degree, orthogonal: that is, the performance increase due to one does not come at the expense of the other, as each exploits a different facet of concurrency exhibited within the PC workload.

随着价格的下降，存储容量的增加和耐用性的提高，NAND闪存固态磁盘(SSD)为便携式计算系统和个人电脑提供了一个越来越有吸引力的替代硬盘驱动器(HDD)。本文介绍了NAND闪存SSD架构及其管理技术的研究，量化了用户驱动/PC应用程序在多任务环境下的SSD性能;用户活动代表典型的PC工作负载，包括浏览文件和文件夹、发电子邮件、文本编辑和文档创建、上网、听音乐和播放电影、编辑大图片以及运行办公应用程序。我们发现以下几点:(a) NAND闪存性能的真正限制不是它的单设备带宽低，而是它的内部核心接口;(b) NAND快闪存储器媒体传输速率不需要扩大到硬盘驱动器的传输速率即可获得良好的性能;(c) SSD组织在系统和设备级别同时利用并发性(例如，类似raid的组织和micron风格的(超级块))显著提高性能;(d)这些系统级和设备级并发机制在很大程度上是正交的:也就是说，性能的提高并不是以牺牲另一个为代价的，因为每一个都利用了PC工作负载中显示的并发性的不同方面。

{"title":"The performance of PC solid-state disks (SSDs) as a function of bandwidth, concurrency, device architecture, and system organization","authors":"Cagdas Dirik, B. Jacob","doi":"10.1145/1555754.1555790","DOIUrl":"https://doi.org/10.1145/1555754.1555790","url":null,"abstract":"As their prices decline, their storage capacities increase, and their endurance improves, NAND Flash Solid State Disks (SSD) provide an increasingly attractive alternative to Hard Disk Drives (HDD) for portable computing systems and PCs. This paper presents a study of NAND Flash SSD architectures and their management techniques, quantifying SSD performance under user-driven/PC applications in a multi-tasked environment; user activity represents typical PC workloads and includes browsing files and folders, emailing, text editing and document creation, surfing the web, listening to music and playing movies, editing large pictures, and running office applications.\u0000 We find the following: (a) the real limitation to NAND Flash memory performance is not its low per-device bandwidth but its internal core interface; (b) NAND Flash memory media transfer rates do not need to scale up to those of HDDs for good performance; (c) SSD organizations that exploit concurrency at both the system and device level (e.g. RAID-like organizations and Micron-style (superblocks) improve performance significantly; and (d) these system- and device-level concurrency mechanisms are, to a significant degree, orthogonal: that is, the performance increase due to one does not come at the expense of the other, as each exploits a different facet of concurrency exhibited within the PC workload.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"7 1","pages":"279-289"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73107166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 211

Boosting single-thread performance in multi-core systems through fine-grain multi-threading 通过细粒度多线程提高多核系统中的单线程性能

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555813

C. Madriles, P. López, J. M. Codina, E. Gibert, Fernando Latorre, Alejandro Martínez, Raúl Martínez, Antonio González

Industry has shifted towards multi-core designs as we have hit the memory and power walls. However, single thread performance remains of paramount importance since some applications have limited thread-level parallelism (TLP), and even a small part with limited TLP impose important constraints to the global performance, as explained by Amdahl's law. In this paper we propose a novel approach for leveraging multiple cores to improve single-thread performance in a multi-core design. The proposed technique features a set of novel hardware mechanisms that support the execution of threads generated at compile time. These threads result from a fine-grain speculative decomposition of the original application and they are executed under a modified multi-core system that includes: (1) mechanisms to support multiple versions; (2) mechanisms to detect violations among threads; (3) mechanisms to reconstruct the original sequential order; and (4) mechanisms to checkpoint the architectural state and recovery to handle misspeculations. The proposed scheme outperforms previous hardware-only schemes to implement the idea of combining cores for executing single-thread applications in a multi-core design by more than 10% on average on Spec2006 for all configurations. Moreover, single-thread performance is improved by 41% on average when the proposed scheme is used on a Tiny Core, and up to 2.6x for some selected applications.

行业已经转向多核设计，因为我们已经遇到了内存和功率的瓶颈。然而，单线程性能仍然是最重要的，因为一些应用程序具有有限的线程级并行性(TLP)，即使是具有有限TLP的一小部分也会对全局性能施加重要的约束，正如Amdahl定律所解释的那样。在本文中，我们提出了一种在多核设计中利用多核来提高单线程性能的新方法。所建议的技术具有一组新颖的硬件机制，支持在编译时生成的线程的执行。这些线程来自原始应用程序的细粒度推测分解，它们在修改后的多核系统下执行，其中包括:(1)支持多个版本的机制;(2)线程间违规检测机制;(3)重构原序列的机制;(4)检查体系结构状态和恢复以处理错误猜测的机制。在Spec2006上，对于所有配置，所提出的方案优于以前的纯硬件方案，以实现在多核设计中组合核来执行单线程应用程序的想法，平均优于10%以上。此外，当在Tiny Core上使用所提出的方案时，单线程性能平均提高了41%，对于某些选定的应用程序可提高2.6倍。

{"title":"Boosting single-thread performance in multi-core systems through fine-grain multi-threading","authors":"C. Madriles, P. López, J. M. Codina, E. Gibert, Fernando Latorre, Alejandro Martínez, Raúl Martínez, Antonio González","doi":"10.1145/1555754.1555813","DOIUrl":"https://doi.org/10.1145/1555754.1555813","url":null,"abstract":"Industry has shifted towards multi-core designs as we have hit the memory and power walls. However, single thread performance remains of paramount importance since some applications have limited thread-level parallelism (TLP), and even a small part with limited TLP impose important constraints to the global performance, as explained by Amdahl's law.\u0000 In this paper we propose a novel approach for leveraging multiple cores to improve single-thread performance in a multi-core design. The proposed technique features a set of novel hardware mechanisms that support the execution of threads generated at compile time. These threads result from a fine-grain speculative decomposition of the original application and they are executed under a modified multi-core system that includes: (1) mechanisms to support multiple versions; (2) mechanisms to detect violations among threads; (3) mechanisms to reconstruct the original sequential order; and (4) mechanisms to checkpoint the architectural state and recovery to handle misspeculations.\u0000 The proposed scheme outperforms previous hardware-only schemes to implement the idea of combining cores for executing single-thread applications in a multi-core design by more than 10% on average on Spec2006 for all configurations. Moreover, single-thread performance is improved by 41% on average when the proposed scheme is used on a Tiny Core, and up to 2.6x for some selected applications.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"12 1","pages":"474-483"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81735261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Temperature-constrained power control for chip multiprocessors with online model estimation 基于在线模型估计的芯片多处理器温度约束功率控制

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555794

Yefu Wang, Kai Ma, Xiaorui Wang

As chip multiprocessors (CMP) become the main trend in processor development, various power and thermal management strategies have recently been proposed to optimize system performance while controlling the power or temperature of a CMP chip to stay below a constraint. The availability of per-core DVFS (dynamic voltage and frequency scaling) also makes it possible to develop advanced management strategies. However, most existing solutions rely on open-loop search or optimization with the assumption that power can be estimated accurately, while others adopt oversimplified feedback control strategies to control power and temperature separately, without any theoretical guarantees. In this paper, we propose a chip-level power control algorithm that is systematically designed based on optimal control theory. Our algorithm can precisely control the power of a CMP chip to the desired set point while maintaining the temperature of each core below a specified threshold. Furthermore, an online model estimator is designed to achieve analytical assurance of control accuracy and system stability, even in the face of significant workload variations or unpredictable chip or core variations. Empirical results on a physical testbed show that our controller outperforms two state-of-the-art control algorithms by having better SPEC benchmark performance and more precise power control. In addition, extensive simulation results demonstrate the efficacy of our algorithm for various CMP configurations.

随着芯片多处理器(CMP)成为处理器发展的主要趋势，最近提出了各种功耗和热管理策略，以优化系统性能，同时控制CMP芯片的功耗或温度低于限制。每核DVFS(动态电压和频率缩放)的可用性也使得开发高级管理策略成为可能。然而，现有的大多数解决方案依赖于开环搜索或优化，并假设可以准确估计功率，而其他解决方案则采用过于简化的反馈控制策略，将功率和温度分开控制，没有任何理论保证。本文提出了一种基于最优控制理论系统设计的芯片级功率控制算法。我们的算法可以精确地控制CMP芯片的功率到所需的设定点，同时保持每个核心的温度低于指定的阈值。此外，设计了一个在线模型估计器，以实现控制精度和系统稳定性的分析保证，即使面对显著的工作负载变化或不可预测的芯片或核心变化。在物理测试台上的经验结果表明，我们的控制器具有更好的SPEC基准性能和更精确的功率控制，优于两种最先进的控制算法。此外，大量的仿真结果证明了该算法对各种CMP配置的有效性。

{"title":"Temperature-constrained power control for chip multiprocessors with online model estimation","authors":"Yefu Wang, Kai Ma, Xiaorui Wang","doi":"10.1145/1555754.1555794","DOIUrl":"https://doi.org/10.1145/1555754.1555794","url":null,"abstract":"As chip multiprocessors (CMP) become the main trend in processor development, various power and thermal management strategies have recently been proposed to optimize system performance while controlling the power or temperature of a CMP chip to stay below a constraint. The availability of per-core DVFS (dynamic voltage and frequency scaling) also makes it possible to develop advanced management strategies. However, most existing solutions rely on open-loop search or optimization with the assumption that power can be estimated accurately, while others adopt oversimplified feedback control strategies to control power and temperature separately, without any theoretical guarantees. In this paper, we propose a chip-level power control algorithm that is systematically designed based on optimal control theory. Our algorithm can precisely control the power of a CMP chip to the desired set point while maintaining the temperature of each core below a specified threshold. Furthermore, an online model estimator is designed to achieve analytical assurance of control accuracy and system stability, even in the face of significant workload variations or unpredictable chip or core variations. Empirical results on a physical testbed show that our controller outperforms two state-of-the-art control algorithms by having better SPEC benchmark performance and more precise power control. In addition, extensive simulation results demonstrate the efficacy of our algorithm for various CMP configurations.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"30 1","pages":"314-324"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85538568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 217

Achieving predictable performance through better memory controller placement in many-core CMPs 通过在多核cmp中更好的内存控制器位置实现可预测的性能

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555810

D. Abts, Natalie D. Enright Jerger, John Kim, Dan Gibson, Mikko H. Lipasti

In the near term, Moore's law will continue to provide an increasing number of transistors and therefore an increasing number of on-chip cores. Limited pin bandwidth prevents the integration of a large number of memory controllers on-chip. With many cores, and few memory controllers, where to locate the memory controllers in the on-chip interconnection fabric becomes an important and as yet unexplored question. In this paper we show how the location of the memory controllers can reduce contention (hot spots) in the on-chip fabric and lower the variance in reference latency. This in turn provides predictable performance for memory-intensive applications regardless of the processing core on which a thread is scheduled. We explore the design space of on-chip fabrics to find optimal memory controller placement relative to different topologies (i.e. mesh and torus), routing algorithms, and workloads.

在短期内，摩尔定律将继续提供越来越多的晶体管，从而增加芯片上核心的数量。有限的引脚带宽阻碍了芯片上大量存储器控制器的集成。由于内核众多，而存储器控制器很少，在片上互连结构中存储器控制器的位置成为一个重要而尚未探索的问题。在本文中，我们展示了内存控制器的位置如何减少片上结构中的争用(热点)并降低参考延迟的差异。这反过来又为内存密集型应用程序提供了可预测的性能，而不管线程被安排在哪个处理核心上。我们探索片上结构的设计空间，以找到相对于不同拓扑(即网格和环面)，路由算法和工作负载的最佳内存控制器位置。

引用次数: 161

Architecting phase change memory as a scalable dram alternative 将相变存储器架构为可扩展的dram替代品

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555758

Benjamin C. Lee, Engin Ipek, O. Mutlu, D. Burger

Memory scaling is in jeopardy as charge storage and sensing mechanisms become less reliable for prevalent memory technologies, such as DRAM. In contrast, phase change memory (PCM) storage relies on scalable current and thermal mechanisms. To exploit PCM's scalability as a DRAM alternative, PCM must be architected to address relatively long latencies, high energy writes, and finite endurance. We propose, crafted from a fundamental understanding of PCM technology parameters, area-neutral architectural enhancements that address these limitations and make PCM competitive with DRAM. A baseline PCM system is 1.6x slower and requires 2.2x more energy than a DRAM system. Buffer reorganizations reduce this delay and energy gap to 1.2x and 1.0x, using narrow rows to mitigate write energy and multiple rows to improve locality and write coalescing. Partial writes enhance memory endurance, providing 5.6 years of lifetime. Process scaling will further reduce PCM energy costs and improve endurance.

由于电荷存储和传感机制对于流行的存储器技术(如DRAM)变得不那么可靠，内存扩展处于危险之中。相比之下，相变存储器(PCM)存储依赖于可扩展的电流和热机制。为了利用PCM作为DRAM替代品的可扩展性，PCM的架构必须能够解决相对较长的延迟、高能量写入和有限的耐用性。我们从对PCM技术参数的基本理解出发，提出了区域中立的架构增强，以解决这些限制，并使PCM与DRAM竞争。基准PCM系统比DRAM系统慢1.6倍，需要2.2倍的能量。缓冲区重组将这种延迟和能量差距减少到1.2倍和1.0倍，使用窄行来减少写能量，使用多行来改善局域性和写合并。部分写入提高了内存的持久性，提供了5.6年的寿命。工艺缩放将进一步降低PCM能源成本并提高耐用性。

引用次数: 1491