2011 IEEE 17th International Symposium on High Performance Computer Architecture最新文献

Fg-STP: Fine-Grain Single Thread Partitioning on Multicores Fg-STP:多核上的细粒度单线程分区

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749713

Rakesh Ranjan, Fernando Latorre, P. Marcuello, Antonio González

Power and complexity issues have led the microprocessor industry to shift to Chip Multiprocessors in order to be able to better utilize the additional transistors ensured by Moore's law. While parallel programs are going to be able to take most of the advantage of these CMPs, single thread applications are not equipped to benefit from them. In this paper we propose Fine-Grain Single-Thread Partitioning (Fg-STP), a hardware-only scheme that takes advantage of CMP designs to speedup single-threaded applications. Our proposal improves single thread performance by reconfiguring two cores with the aim of collaborating on the fetching and execution of the instructions. These cores are basically conventional out-of-order cores in which execution is orchestrated using a dedicated hardware that has minimum and localized impact on the original design of the cores. This approach partitions the code at instruction granularity and differs from previous proposals on the extensive use of dependence speculation, replication and communication. These features are combined with the ability to look for parallelism on large instruction windows without any software intervention (no re-compilation or profiling hints are needed). These characteristics allow Fg-STP to speedup single thread by 18% and 7% on average over similar hardware-only approaches like Core Fusion, on medium sized and small sized 2-core CMP respectively for Spec 2006 benchmarks.

功率和复杂性问题导致微处理器行业转向芯片多处理器，以便能够更好地利用摩尔定律所保证的额外晶体管。虽然并行程序将能够利用这些cmp的大部分优势，但单线程应用程序无法从中受益。在本文中，我们提出了细粒度单线程分区(Fg-STP)，这是一种利用CMP设计来加速单线程应用程序的纯硬件方案。我们的建议通过重新配置两个核心来提高单线程性能，目的是在指令的获取和执行上进行协作。这些核心基本上是传统的乱序核心，其中的执行使用专用硬件进行编排，对核心的原始设计产生最小的局部影响。这种方法在指令粒度上划分代码，不同于之前广泛使用依赖推测、复制和通信的建议。这些特性与在大型指令窗口上查找并行性的能力相结合，而无需任何软件干预(不需要重新编译或分析提示)。这些特性使得Fg-STP在2006年Spec基准测试中，在中型和小型2核CMP上，单线程速度比类似的纯硬件方法(如Core Fusion)平均提高18%和7%。

{"title":"Fg-STP: Fine-Grain Single Thread Partitioning on Multicores","authors":"Rakesh Ranjan, Fernando Latorre, P. Marcuello, Antonio González","doi":"10.1109/HPCA.2011.5749713","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749713","url":null,"abstract":"Power and complexity issues have led the microprocessor industry to shift to Chip Multiprocessors in order to be able to better utilize the additional transistors ensured by Moore's law. While parallel programs are going to be able to take most of the advantage of these CMPs, single thread applications are not equipped to benefit from them. In this paper we propose Fine-Grain Single-Thread Partitioning (Fg-STP), a hardware-only scheme that takes advantage of CMP designs to speedup single-threaded applications. Our proposal improves single thread performance by reconfiguring two cores with the aim of collaborating on the fetching and execution of the instructions. These cores are basically conventional out-of-order cores in which execution is orchestrated using a dedicated hardware that has minimum and localized impact on the original design of the cores. This approach partitions the code at instruction granularity and differs from previous proposals on the extensive use of dependence speculation, replication and communication. These features are combined with the ability to look for parallelism on large instruction windows without any software intervention (no re-compilation or profiling hints are needed). These characteristics allow Fg-STP to speedup single thread by 18% and 7% on average over similar hardware-only approaches like Core Fusion, on medium sized and small sized 2-core CMP respectively for Spec 2006 benchmarks.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115135427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Offline symbolic analysis to infer Total Store Order 离线符号分析，以推断总存储订单

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749743

Dongyoon Lee, Mahmoud H. Said, S. Narayanasamy, Z. Yang

Ability to record and replay an execution can significantly help programmers debug their programs, especially parallel programs. De-terministically replaying a multiprocessor's execution under a relaxed memory model has remained a challenging problem. This is an important problem as most modern processors only support a relaxed memory model to enable many performance critical optimizations. The most common consistency model implemented in processors is the Total Store Order (TSO). We present an efficient and low-complexity processor based solution for recording and replaying under the Total Store Order (TSO) memory model. Processor provides support for logging data fetched on cache misses. Using this information each thread can be de-terministically replayed. A TSO-compliant casual order between the shared-memory accesses executed in different threads is then inferred using an offline algorithm based on Satisfiability Modulo Theory (SMT) solver. We also discuss methods to bound the search space during offline analysis and several optimizations to reduce the offline analysis time.

记录和重放执行的能力可以极大地帮助程序员调试他们的程序，特别是并行程序。在宽松内存模型下确定地重放多处理器的执行仍然是一个具有挑战性的问题。这是一个重要的问题，因为大多数现代处理器只支持宽松的内存模型来实现许多性能关键优化。处理器中实现的最常见的一致性模型是Total Store Order (TSO)。我们提出了一种基于全存储顺序(TSO)存储器模型的高效、低复杂度的记录和重放解决方案。处理器提供了对缓存失败时获取的数据进行日志记录的支持。使用这些信息，每个线程都可以确定地重放。然后使用基于可满足模理论(SMT)求解器的离线算法推断在不同线程中执行的共享内存访问之间符合tso的临时顺序。我们还讨论了在离线分析时约束搜索空间的方法以及减少离线分析时间的几种优化方法。

引用次数: 22

Atomic Coherence: Leveraging nanophotonics to build race-free cache coherence protocols 原子相干性:利用纳米光子学构建无竞争缓存相干性协议

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749723

D. Vantrease, Mikko H. Lipasti, N. Binkert

This paper advocates Atomic Coherence, a framework that simplifies cache coherence protocol specification, design, and verification by decoupling races from the protocol's operation. Atomic Coherence requires conflicting coherence requests to the same addresses be serialized with a mutex before they are issued. Once issued, requests follow a predictable race-free path. Because requests are guaranteed not to race, coherence protocols are simpler and protocol extensions are straightforward. Our implementation of Atomic Coherence uses optical mutexes because optics provides very low latency. We begin with a state-of-the-art non-atomic MOEFSI protocol and demonstrate that an atomic implementation is much simpler while imposing less than a 2% performance penalty. We then show how, in the absence of races, it is easy to add support for speculative coherence and improve performance by up to 70%. Similar performance gains may be possible in a non-atomic protocol, but not without considerable effort in race management.

本文提倡原子一致性，这是一个框架，通过将竞争与协议操作解耦来简化缓存一致性协议规范、设计和验证。原子相干要求对相同地址的冲突相干请求在发出之前用互斥锁序列化。请求一旦发出，就会遵循一条可预测的无竞争路径。由于保证请求不竞争，一致性协议更简单，协议扩展也更直接。我们的原子相干实现使用光学互斥体，因为光学提供了非常低的延迟。我们从最先进的非原子MOEFSI协议开始，并演示原子实现要简单得多，同时带来不到2%的性能损失。然后，我们展示了在没有比赛的情况下，如何很容易地增加对推测一致性的支持，并将性能提高高达70%。在非原子协议中也可能获得类似的性能提升，但如果不进行竞争管理，就不可能获得类似的性能提升。

引用次数: 43

Bloom Filter Guided Transaction Scheduling 布隆过滤器引导的事务调度

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749718

G. Blake, R. Dreslinski, T. Mudge

Contention management is an important design component to a transactional memory system. Without effective contention management to ensure forward progress, a transactional memory system can experience live-lock, which is difficult to debug in parallel programs. Early work in contention management focused on heuristic managers that reacted to conflicts between transactions by picking the most appropriate transaction to abort. Reactive methods allow conflicts to happen repeatedly as they do not try to prevent future conflicts from happening. These shortcomings of reactive contention managers have led to proposals that approach contention management as a scheduling problem — proactive managers. Proactive techniques range from throttling execution in predicted periods of high contention to preventing groups of transactions running concurrently that are predicted likely to conflict. We propose a novel transaction scheduling scheme called “Bloom Filter Guided Transaction Scheduling” (BFGTS), that uses a combination of simple hardware and Bloom filter heuristics to guide scheduling decisions and provide enhanced performance in high contention situations. We compare to two state-of-the-art transaction schedulers, “Adaptive Transaction Scheduling” and “Proactive Transaction Scheduling” and show that BFGTS attains up to a 4.6× and 1.7× improvement on high contention benchmarks respectively. Across all benchmarks it shows a 35% and 25% average performance improvement respectively.

争用管理是事务性内存系统的一个重要设计组件。如果没有有效的争用管理来确保前进，事务性内存系统就会出现活锁，这在并行程序中很难调试。争用管理的早期工作集中在启发式管理器上，启发式管理器通过选择最合适的事务来终止事务之间的冲突。响应式方法允许冲突反复发生，因为它们不试图阻止未来冲突的发生。响应式争用管理器的这些缺点导致了将争用管理视为调度问题的建议——主动管理器。主动技术的范围从在预测的高争用期间限制执行，到防止预计可能发生冲突的并发运行的事务组。我们提出了一种新的事务调度方案，称为“Bloom Filter Guided transaction scheduling”(BFGTS)，它使用简单硬件和Bloom Filter heuristics的组合来指导调度决策，并在高争用情况下提供增强的性能。我们比较了两种最先进的事务调度程序，“自适应事务调度”和“主动事务调度”，结果表明BFGTS在高争用基准测试中分别获得了4.6倍和1.7倍的改进。在所有基准测试中，它的平均性能分别提高了35%和25%。

{"title":"Bloom Filter Guided Transaction Scheduling","authors":"G. Blake, R. Dreslinski, T. Mudge","doi":"10.1109/HPCA.2011.5749718","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749718","url":null,"abstract":"Contention management is an important design component to a transactional memory system. Without effective contention management to ensure forward progress, a transactional memory system can experience live-lock, which is difficult to debug in parallel programs. Early work in contention management focused on heuristic managers that reacted to conflicts between transactions by picking the most appropriate transaction to abort. Reactive methods allow conflicts to happen repeatedly as they do not try to prevent future conflicts from happening. These shortcomings of reactive contention managers have led to proposals that approach contention management as a scheduling problem — proactive managers. Proactive techniques range from throttling execution in predicted periods of high contention to preventing groups of transactions running concurrently that are predicted likely to conflict. We propose a novel transaction scheduling scheme called “Bloom Filter Guided Transaction Scheduling” (BFGTS), that uses a combination of simple hardware and Bloom filter heuristics to guide scheduling decisions and provide enhanced performance in high contention situations. We compare to two state-of-the-art transaction schedulers, “Adaptive Transaction Scheduling” and “Proactive Transaction Scheduling” and show that BFGTS attains up to a 4.6× and 1.7× improvement on high contention benchmarks respectively. Across all benchmarks it shows a 35% and 25% average performance improvement respectively.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128975828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Low-voltage on-chip cache architecture using heterogeneous cell sizes for high-performance processors 使用异构单元尺寸的高性能处理器的低压片上缓存架构

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749715

H. Ghasemi, S. Draper, N. Kim

To date dynamic voltage/frequency scaling (DVFS) has been one of the most successful power-reduction techniques. However, ever-increasing process variability reduces the reliability of static random access memory (SRAM) at low voltages. This limits voltage scaling to a minimum operating voltage (VDDMIN). Larger SRAM cells, that are less sensitive to process variability, allow the use of lower VDDMIN. However, large-scale memory structures, e.g., the last-level cache (LLC) (that often determines the VDDMIN of the processor), cannot afford to use such large SRAM cells due to the die area constraint. In this paper we propose low-voltage LLC architectures that exploit 1) the DVFS characteristics of workloads running on high-performance processors, 2) the trade-off between SRAM cell size and VDDMIN, and 3) the fact that at lower voltage/frequency operating states the negative performance impact of having a smaller LLC capacity is reduced. Our proposed LLC architectures provide the same maximum performance and VDDMIN as the conventional architecture, while reducing the total LLC cell area by 15%–19% with negligible average runtime increase.

到目前为止，动态电压/频率缩放(DVFS)是最成功的降功耗技术之一。然而，不断增加的过程可变性降低了静态随机存取存储器(SRAM)在低电压下的可靠性。这限制了电压缩放到最小工作电压(VDDMIN)。较大的SRAM单元，对过程可变性不太敏感，允许使用较低的VDDMIN。然而，由于芯片面积的限制，大型存储结构，例如最后一级缓存(LLC)(通常决定处理器的VDDMIN)无法使用如此大的SRAM单元。在本文中，我们提出了低压LLC架构，利用1)在高性能处理器上运行的工作负载的DVFS特性，2)SRAM单元大小和VDDMIN之间的权衡，以及3)在较低电压/频率工作状态下，具有较小LLC容量的负面性能影响被减少的事实。我们提出的LLC架构提供了与传统架构相同的最大性能和VDDMIN，同时将LLC单元总面积减少了15%-19%，平均运行时间增加可以忽略不计。

{"title":"Low-voltage on-chip cache architecture using heterogeneous cell sizes for high-performance processors","authors":"H. Ghasemi, S. Draper, N. Kim","doi":"10.1109/HPCA.2011.5749715","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749715","url":null,"abstract":"To date dynamic voltage/frequency scaling (DVFS) has been one of the most successful power-reduction techniques. However, ever-increasing process variability reduces the reliability of static random access memory (SRAM) at low voltages. This limits voltage scaling to a minimum operating voltage (VDDMIN). Larger SRAM cells, that are less sensitive to process variability, allow the use of lower VDDMIN. However, large-scale memory structures, e.g., the last-level cache (LLC) (that often determines the VDDMIN of the processor), cannot afford to use such large SRAM cells due to the die area constraint. In this paper we propose low-voltage LLC architectures that exploit 1) the DVFS characteristics of workloads running on high-performance processors, 2) the trade-off between SRAM cell size and VDDMIN, and 3) the fact that at lower voltage/frequency operating states the negative performance impact of having a smaller LLC capacity is reduced. Our proposed LLC architectures provide the same maximum performance and VDDMIN as the conventional architecture, while reducing the total LLC cell area by 15%–19% with negligible average runtime increase.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129182792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

SolarCore: Solar energy driven multi-core architecture power management SolarCore:太阳能驱动的多核架构电源管理

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749729

Chao Li, Wangyuan Zhang, Chang-Burm Cho, Tao Li

The global energy crisis and environmental concerns (e.g. global warming) have driven the IT community into the green computing era. Of clean, renewable energy sources, solar power is the most promising. While efforts have been made to improve the performance-per-watt, conventional architecture power management schemes incur significant solar energy loss since they are largely workload-driven and unaware of the supply-side attributes. Existing solar power harvesting techniques improve the energy utilization but increase the environmental burden and capital investment due to the inclusion of large-scale batteries. Moreover, solar power harvesting itself cannot guarantee high performance without appropriate load adaptation. To this end, we propose SolarCore, a solar energy driven, multi-core architecture power management scheme that combines maximal power provisioning control and workload run-time optimization. Using real-world meteorological data across different geographic sites and seasons, we show that SolarCore is capable of achieving the optimal operation condition (e.g. maximal power point) of solar panels autonomously under various environmental conditions with a high green energy utilization of 82% on average. We propose efficient heuristics for allocating the time varying solar power across multiple cores and our algorithm can further improve the workload performance by 10.8% compared with that of round-robin adaptation, and at least 43% compared with that of conventional fixed-power budget control. This paper makes the first step on maximally reducing the carbon footprint of computing systems through the usage of renewable energy sources. We expect that the novel joint optimization techniques proposed in this paper will contribute to building a truly sustainable, high-performance computing environment.

全球能源危机和环境问题(例如全球变暖)已推动资讯科技界进入绿色计算时代。在清洁的可再生能源中，太阳能是最有前途的。虽然人们已经在努力提高每瓦特的性能，但传统的架构电源管理方案会导致严重的太阳能损失，因为它们主要是工作负载驱动的，并且不知道供应侧的属性。现有的太阳能收集技术提高了能源利用率，但由于包含大型电池，增加了环境负担和资本投资。此外，如果没有适当的负载适应，太阳能收集本身无法保证高性能。为此，我们提出了SolarCore，这是一种太阳能驱动的多核架构电源管理方案，结合了最大功率供应控制和工作负载运行时优化。利用不同地理位置和季节的真实气象数据，我们表明SolarCore能够在各种环境条件下自主实现太阳能电池板的最佳运行状态(例如最大功率点)，平均绿色能源利用率高达82%。我们提出了一种有效的启发式方法来分配时变太阳能在多个核心上，与循环自适应相比，我们的算法可以进一步提高工作负载性能10.8%，与传统的固定功率预算控制相比，至少提高43%。本文通过使用可再生能源，在最大限度地减少计算系统的碳足迹方面迈出了第一步。我们期望本文提出的新型联合优化技术将有助于建立一个真正可持续的高性能计算环境。

{"title":"SolarCore: Solar energy driven multi-core architecture power management","authors":"Chao Li, Wangyuan Zhang, Chang-Burm Cho, Tao Li","doi":"10.1109/HPCA.2011.5749729","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749729","url":null,"abstract":"The global energy crisis and environmental concerns (e.g. global warming) have driven the IT community into the green computing era. Of clean, renewable energy sources, solar power is the most promising. While efforts have been made to improve the performance-per-watt, conventional architecture power management schemes incur significant solar energy loss since they are largely workload-driven and unaware of the supply-side attributes. Existing solar power harvesting techniques improve the energy utilization but increase the environmental burden and capital investment due to the inclusion of large-scale batteries. Moreover, solar power harvesting itself cannot guarantee high performance without appropriate load adaptation. To this end, we propose SolarCore, a solar energy driven, multi-core architecture power management scheme that combines maximal power provisioning control and workload run-time optimization. Using real-world meteorological data across different geographic sites and seasons, we show that SolarCore is capable of achieving the optimal operation condition (e.g. maximal power point) of solar panels autonomously under various environmental conditions with a high green energy utilization of 82% on average. We propose efficient heuristics for allocating the time varying solar power across multiple cores and our algorithm can further improve the workload performance by 10.8% compared with that of round-robin adaptation, and at least 43% compared with that of conventional fixed-power budget control. This paper makes the first step on maximally reducing the carbon footprint of computing systems through the usage of renewable energy sources. We expect that the novel joint optimization techniques proposed in this paper will contribute to building a truly sustainable, high-performance computing environment.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128872057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 94

Dynamically Specialized Datapaths for energy efficient computing 用于节能计算的动态专用数据路径

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749755

Venkatraman Govindaraju, C. Ho, K. Sankaralingam

Due to limits in technology scaling, energy efficiency of logic devices is decreasing in successive generations. To provide continued performance improvements without increasing power, regardless of the sequential or parallel nature of the application, microarchitectural energy efficiency must improve. We propose Dynamically Specialized Datapaths to improve the energy efficiency of general purpose programmable processors. The key insights of this work are the following. First, applications execute in phases and these phases can be determined by creating a path-tree of basic-blocks rooted at the inner-most loop. Second, specialized datapaths corresponding to these path-trees, which we refer to as DySER blocks, can be constructed by interconnecting a set of heterogeneous computation units with a circuit-switched network. These blocks can be easily integrated with a processor pipeline. A synthesized RTL implementation using an industry 55nm technology library shows a 64-functional-unit DySER block occupies approximately the same area as a 64 KB single-ported SRAM and can execute at 2 GHz. We extend the GCC compiler to identify path-trees and code-mapping to DySER and evaluate the PAR-SEC, SPEC and Parboil benchmarks suites. Our results show that in most cases two DySER blocks can achieve the same performance (within 5%) as having a specialized hardware module for each path-tree. A 64-FU DySER block can cover 12% to 100% of the dynamically executed instruction stream. When integrated with a dual-issue out-of-order processor, two DySER blocks provide geometric mean speedup of 2.1X (1.15X to 10X), and geometric mean energy reduction of 40% (up to 70%), and 60% energy reduction if no performance improvement is required.

由于技术规模的限制，逻辑器件的能量效率在不断下降。为了在不增加功率的情况下提供持续的性能改进，无论应用程序是顺序的还是并行的，都必须提高微架构的能效。为了提高通用可编程处理器的能效，我们提出了动态专用数据路径。这项工作的关键见解如下。首先，应用程序分阶段执行，这些阶段可以通过创建植根于最内层循环的基本块的路径树来确定。其次，与这些路径树相对应的专用数据路径，我们称之为DySER块，可以通过将一组异构计算单元与电路交换网络互连来构建。这些块可以很容易地与处理器管道集成。使用工业55nm技术库的综合RTL实现显示，64个功能单元的d斯勒块占用与64 KB单端口SRAM大致相同的面积，并且可以在2 GHz下执行。我们扩展了GCC编译器，以识别路径树和代码映射到dser，并评估PAR-SEC、SPEC和Parboil基准套件。我们的结果表明，在大多数情况下，两个DySER块可以达到与为每个路径树使用专用硬件模块相同的性能(在5%以内)。一个64-FU的daser块可以覆盖12%到100%的动态执行指令流。当与双问题无序处理器集成时，两个DySER块提供2.1倍(1.15倍至10倍)的几何平均加速，几何平均能耗降低40%(高达70%)，如果不需要性能改进，则能耗降低60%。

{"title":"Dynamically Specialized Datapaths for energy efficient computing","authors":"Venkatraman Govindaraju, C. Ho, K. Sankaralingam","doi":"10.1109/HPCA.2011.5749755","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749755","url":null,"abstract":"Due to limits in technology scaling, energy efficiency of logic devices is decreasing in successive generations. To provide continued performance improvements without increasing power, regardless of the sequential or parallel nature of the application, microarchitectural energy efficiency must improve. We propose Dynamically Specialized Datapaths to improve the energy efficiency of general purpose programmable processors. The key insights of this work are the following. First, applications execute in phases and these phases can be determined by creating a path-tree of basic-blocks rooted at the inner-most loop. Second, specialized datapaths corresponding to these path-trees, which we refer to as DySER blocks, can be constructed by interconnecting a set of heterogeneous computation units with a circuit-switched network. These blocks can be easily integrated with a processor pipeline. A synthesized RTL implementation using an industry 55nm technology library shows a 64-functional-unit DySER block occupies approximately the same area as a 64 KB single-ported SRAM and can execute at 2 GHz. We extend the GCC compiler to identify path-trees and code-mapping to DySER and evaluate the PAR-SEC, SPEC and Parboil benchmarks suites. Our results show that in most cases two DySER blocks can achieve the same performance (within 5%) as having a specialized hardware module for each path-tree. A 64-FU DySER block can cover 12% to 100% of the dynamically executed instruction stream. When integrated with a dual-issue out-of-order processor, two DySER blocks provide geometric mean speedup of 2.1X (1.15X to 10X), and geometric mean energy reduction of 40% (up to 70%), and 60% energy reduction if no performance improvement is required.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115636899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 218

HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor HAQu:在芯片多处理器上为细粒度线程进行的硬件加速排队

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749720

Sanghoon Lee, Devesh Tiwari, Yan Solihin, James Tuck

Queues are commonly used in multithreaded programs for synchronization and communication. However, because software queues tend to be too expensive to support finegrained parallelism, hardware queues have been proposed to reduce overhead of communication between cores. Hardware queues require modifications to the processor core and need a custom interconnect. They also pose difficulties for the operating system because their state must be preserved across context switches. To solve these problems, we propose a hardware-accelerated queue, or HAQu. HAQu adds hardware to a CMP that accelerates operations on software queues. Our design implements fast queueing through an application's address space with operations that are compatible with a fully software queue. Our design provides accelerated and OS-transparent performance in three general ways: (1) it provides a single instruction for enqueueing and dequeueing which significantly reduces the overhead when used in fine-grained threading; (2) operations on the queue are designed to leverage low-level details of the coherence protocol; and (3) hardware ensures that the full state of the queue is stored in the application's address space, thereby ensuring virtualization. We have evaluated our design in the context of application domains: offloading fine-grained checks for improved software reliability, and automatic, fine-grained parallelization using decoupled software pipelining.

队列通常用于多线程程序中的同步和通信。但是，由于软件队列的开销太大，无法支持细粒度的并行性，因此建议使用硬件队列来减少内核之间的通信开销。硬件队列需要修改处理器核心并需要自定义互连。它们也给操作系统带来了困难，因为它们的状态必须跨上下文切换保持。为了解决这些问题，我们提出了一个硬件加速队列(HAQu)。HAQu将硬件添加到CMP中，以加速对软件队列的操作。我们的设计通过应用程序的地址空间实现快速排队，其操作与完全软件队列兼容。我们的设计在三个方面提供了加速和操作系统透明的性能:(1)它为排队和退队列提供了一条指令，这大大减少了在细粒度线程中使用时的开销;(2)队列上的操作被设计为利用一致性协议的底层细节;(3)硬件确保队列的全部状态存储在应用程序的地址空间中，从而确保虚拟化。我们已经在应用程序领域的上下文中评估了我们的设计:卸载细粒度检查以提高软件可靠性，以及使用解耦的软件流水线实现自动的细粒度并行化。

{"title":"HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor","authors":"Sanghoon Lee, Devesh Tiwari, Yan Solihin, James Tuck","doi":"10.1109/HPCA.2011.5749720","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749720","url":null,"abstract":"Queues are commonly used in multithreaded programs for synchronization and communication. However, because software queues tend to be too expensive to support finegrained parallelism, hardware queues have been proposed to reduce overhead of communication between cores. Hardware queues require modifications to the processor core and need a custom interconnect. They also pose difficulties for the operating system because their state must be preserved across context switches. To solve these problems, we propose a hardware-accelerated queue, or HAQu. HAQu adds hardware to a CMP that accelerates operations on software queues. Our design implements fast queueing through an application's address space with operations that are compatible with a fully software queue. Our design provides accelerated and OS-transparent performance in three general ways: (1) it provides a single instruction for enqueueing and dequeueing which significantly reduces the overhead when used in fine-grained threading; (2) operations on the queue are designed to leverage low-level details of the coherence protocol; and (3) hardware ensures that the full state of the queue is stored in the application's address space, thereby ensuring virtualization. We have evaluated our design in the context of application domains: offloading fine-grained checks for improved software reliability, and automatic, fine-grained parallelization using decoupled software pipelining.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"os-13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123387963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Calvin: Deterministic or not? Free will to choose Calvin:决定论还是不决定论?选择的自由意志

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749741

Derek Hower, P. Dudnik, M. Hill, D. Wood

Most shared memory systems maximize performance by unpredictably resolving memory races. Unpredictable memory races can lead to nondeterminism in parallel programs, which can suffer from hard-to-reproduce hiesenbugs. We introduce Calvin, a shared memory model capable of executing in a conventional nondeterministic mode when performance is paramount and a deterministic mode when execution repeatability is important. Unlike prior hardware proposals for deterministic execution, Calvin exploits the flexibility of a memory consistency model weaker than sequential consistency. Specifically, Calvin logically orders memory operations into strata that are compatible with the Total Store Order (TSO). Calvin is also designed with the needs of future power-aware processors in mind, and does not require any speculation support. We develop a Calvin-MIST implementation that uses an unordered coalescing write cache, multiple-write coherence protocol, and delayed (timebomb) invalidations while maintaining TSO compatibility. Results show that Calvin-MIST can execute workloads in conventional mode at speeds comparable to a conventional system (providing compatibility) or execute deterministically for a modest average slowdown of less than 20% (when determinism is valued).

大多数共享内存系统通过不可预测地解决内存竞争来最大化性能。不可预测的内存竞争可能导致并行程序中的不确定性，这可能会导致难以重现的代码错误。我们介绍了Calvin，这是一种共享内存模型，当性能至关重要时，它能够以传统的非确定性模式执行，而当执行可重复性很重要时，它能够以确定性模式执行。与之前关于确定性执行的硬件建议不同，Calvin利用了比顺序一致性更弱的内存一致性模型的灵活性。具体来说，Calvin在逻辑上将内存操作按与总存储顺序(TSO)兼容的层次排序。Calvin在设计时也考虑到了未来功耗感知处理器的需求，并且不需要任何推测支持。我们开发了一个Calvin-MIST实现，它在保持TSO兼容性的同时使用无序合并写缓存、多写一致性协议和延迟(定时炸弹)失效。结果表明，Calvin-MIST可以在传统模式下以与传统系统相当的速度执行工作负载(提供兼容性)，或者以低于20%的适度平均减速(在考虑确定性时)确定地执行工作负载。

{"title":"Calvin: Deterministic or not? Free will to choose","authors":"Derek Hower, P. Dudnik, M. Hill, D. Wood","doi":"10.1109/HPCA.2011.5749741","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749741","url":null,"abstract":"Most shared memory systems maximize performance by unpredictably resolving memory races. Unpredictable memory races can lead to nondeterminism in parallel programs, which can suffer from hard-to-reproduce hiesenbugs. We introduce Calvin, a shared memory model capable of executing in a conventional nondeterministic mode when performance is paramount and a deterministic mode when execution repeatability is important. Unlike prior hardware proposals for deterministic execution, Calvin exploits the flexibility of a memory consistency model weaker than sequential consistency. Specifically, Calvin logically orders memory operations into strata that are compatible with the Total Store Order (TSO). Calvin is also designed with the needs of future power-aware processors in mind, and does not require any speculation support. We develop a Calvin-MIST implementation that uses an unordered coalescing write cache, multiple-write coherence protocol, and delayed (timebomb) invalidations while maintaining TSO compatibility. Results show that Calvin-MIST can execute workloads in conventional mode at speeds comparable to a conventional system (providing compatibility) or execute deterministically for a modest average slowdown of less than 20% (when determinism is valued).","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129133333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 74

A case for guarded power gating for multi-core processors 一种用于多核处理器的保护电源门控的案例

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749737

Niti Madan, A. Buyuktosunoglu, P. Bose, M. Annavaram

Dynamic power management has become an essential part of multi-core processors and associated systems. Dedicated controllers with embedded power management firmware are now an integral part of design in such multi-core server systems. Devising a robust power management policy that meets system-intended functionality across a diverse range of workloads remains a key challenge. One of the primary issues of concern in architecting a power management policy is that of performance degradation beyond a specified limit. A secondary issue is that of negative power savings. Guarding against such “holes” in the management policy is crucial in order to ensure successful deployment and use in real customer environments. It is also important to focus on developing new models and addressing the limitations of current modeling infrastructure, in analyzing alternate management policies during the design of modern multi-core systems. In this concept paper, we highlight the above specific challenges that are faced today by the server chip and system design industry in the area of power management.

动态电源管理已经成为多核处理器和相关系统的重要组成部分。具有嵌入式电源管理固件的专用控制器现在是这种多核服务器系统设计中不可或缺的一部分。设计一个健壮的电源管理策略，满足不同工作负载范围内的系统预期功能，仍然是一个关键挑战。在构建电源管理策略时需要关注的主要问题之一是超过指定限制的性能下降。第二个问题是负节能。为了确保在实际客户环境中成功部署和使用，防止管理策略中的此类“漏洞”至关重要。在分析现代多核系统设计期间的备选管理策略时，关注开发新模型和解决当前建模基础设施的局限性也很重要。在这篇概念论文中，我们强调了当今服务器芯片和系统设计行业在电源管理领域所面临的上述具体挑战。

引用次数: 75