首页 > 最新文献

2011 IEEE 17th International Symposium on High Performance Computer Architecture最新文献

英文 中文
Dynamically Specialized Datapaths for energy efficient computing 用于节能计算的动态专用数据路径
Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749755
Venkatraman Govindaraju, C. Ho, K. Sankaralingam
Due to limits in technology scaling, energy efficiency of logic devices is decreasing in successive generations. To provide continued performance improvements without increasing power, regardless of the sequential or parallel nature of the application, microarchitectural energy efficiency must improve. We propose Dynamically Specialized Datapaths to improve the energy efficiency of general purpose programmable processors. The key insights of this work are the following. First, applications execute in phases and these phases can be determined by creating a path-tree of basic-blocks rooted at the inner-most loop. Second, specialized datapaths corresponding to these path-trees, which we refer to as DySER blocks, can be constructed by interconnecting a set of heterogeneous computation units with a circuit-switched network. These blocks can be easily integrated with a processor pipeline. A synthesized RTL implementation using an industry 55nm technology library shows a 64-functional-unit DySER block occupies approximately the same area as a 64 KB single-ported SRAM and can execute at 2 GHz. We extend the GCC compiler to identify path-trees and code-mapping to DySER and evaluate the PAR-SEC, SPEC and Parboil benchmarks suites. Our results show that in most cases two DySER blocks can achieve the same performance (within 5%) as having a specialized hardware module for each path-tree. A 64-FU DySER block can cover 12% to 100% of the dynamically executed instruction stream. When integrated with a dual-issue out-of-order processor, two DySER blocks provide geometric mean speedup of 2.1X (1.15X to 10X), and geometric mean energy reduction of 40% (up to 70%), and 60% energy reduction if no performance improvement is required.
由于技术规模的限制,逻辑器件的能量效率在不断下降。为了在不增加功率的情况下提供持续的性能改进,无论应用程序是顺序的还是并行的,都必须提高微架构的能效。为了提高通用可编程处理器的能效,我们提出了动态专用数据路径。这项工作的关键见解如下。首先,应用程序分阶段执行,这些阶段可以通过创建植根于最内层循环的基本块的路径树来确定。其次,与这些路径树相对应的专用数据路径,我们称之为DySER块,可以通过将一组异构计算单元与电路交换网络互连来构建。这些块可以很容易地与处理器管道集成。使用工业55nm技术库的综合RTL实现显示,64个功能单元的d斯勒块占用与64 KB单端口SRAM大致相同的面积,并且可以在2 GHz下执行。我们扩展了GCC编译器,以识别路径树和代码映射到dser,并评估PAR-SEC、SPEC和Parboil基准套件。我们的结果表明,在大多数情况下,两个DySER块可以达到与为每个路径树使用专用硬件模块相同的性能(在5%以内)。一个64-FU的daser块可以覆盖12%到100%的动态执行指令流。当与双问题无序处理器集成时,两个DySER块提供2.1倍(1.15倍至10倍)的几何平均加速,几何平均能耗降低40%(高达70%),如果不需要性能改进,则能耗降低60%。
{"title":"Dynamically Specialized Datapaths for energy efficient computing","authors":"Venkatraman Govindaraju, C. Ho, K. Sankaralingam","doi":"10.1109/HPCA.2011.5749755","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749755","url":null,"abstract":"Due to limits in technology scaling, energy efficiency of logic devices is decreasing in successive generations. To provide continued performance improvements without increasing power, regardless of the sequential or parallel nature of the application, microarchitectural energy efficiency must improve. We propose Dynamically Specialized Datapaths to improve the energy efficiency of general purpose programmable processors. The key insights of this work are the following. First, applications execute in phases and these phases can be determined by creating a path-tree of basic-blocks rooted at the inner-most loop. Second, specialized datapaths corresponding to these path-trees, which we refer to as DySER blocks, can be constructed by interconnecting a set of heterogeneous computation units with a circuit-switched network. These blocks can be easily integrated with a processor pipeline. A synthesized RTL implementation using an industry 55nm technology library shows a 64-functional-unit DySER block occupies approximately the same area as a 64 KB single-ported SRAM and can execute at 2 GHz. We extend the GCC compiler to identify path-trees and code-mapping to DySER and evaluate the PAR-SEC, SPEC and Parboil benchmarks suites. Our results show that in most cases two DySER blocks can achieve the same performance (within 5%) as having a specialized hardware module for each path-tree. A 64-FU DySER block can cover 12% to 100% of the dynamically executed instruction stream. When integrated with a dual-issue out-of-order processor, two DySER blocks provide geometric mean speedup of 2.1X (1.15X to 10X), and geometric mean energy reduction of 40% (up to 70%), and 60% energy reduction if no performance improvement is required.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115636899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 218
HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor HAQu:在芯片多处理器上为细粒度线程进行的硬件加速排队
Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749720
Sanghoon Lee, Devesh Tiwari, Yan Solihin, James Tuck
Queues are commonly used in multithreaded programs for synchronization and communication. However, because software queues tend to be too expensive to support finegrained parallelism, hardware queues have been proposed to reduce overhead of communication between cores. Hardware queues require modifications to the processor core and need a custom interconnect. They also pose difficulties for the operating system because their state must be preserved across context switches. To solve these problems, we propose a hardware-accelerated queue, or HAQu. HAQu adds hardware to a CMP that accelerates operations on software queues. Our design implements fast queueing through an application's address space with operations that are compatible with a fully software queue. Our design provides accelerated and OS-transparent performance in three general ways: (1) it provides a single instruction for enqueueing and dequeueing which significantly reduces the overhead when used in fine-grained threading; (2) operations on the queue are designed to leverage low-level details of the coherence protocol; and (3) hardware ensures that the full state of the queue is stored in the application's address space, thereby ensuring virtualization. We have evaluated our design in the context of application domains: offloading fine-grained checks for improved software reliability, and automatic, fine-grained parallelization using decoupled software pipelining.
队列通常用于多线程程序中的同步和通信。但是,由于软件队列的开销太大,无法支持细粒度的并行性,因此建议使用硬件队列来减少内核之间的通信开销。硬件队列需要修改处理器核心并需要自定义互连。它们也给操作系统带来了困难,因为它们的状态必须跨上下文切换保持。为了解决这些问题,我们提出了一个硬件加速队列(HAQu)。HAQu将硬件添加到CMP中,以加速对软件队列的操作。我们的设计通过应用程序的地址空间实现快速排队,其操作与完全软件队列兼容。我们的设计在三个方面提供了加速和操作系统透明的性能:(1)它为排队和退队列提供了一条指令,这大大减少了在细粒度线程中使用时的开销;(2)队列上的操作被设计为利用一致性协议的底层细节;(3)硬件确保队列的全部状态存储在应用程序的地址空间中,从而确保虚拟化。我们已经在应用程序领域的上下文中评估了我们的设计:卸载细粒度检查以提高软件可靠性,以及使用解耦的软件流水线实现自动的细粒度并行化。
{"title":"HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor","authors":"Sanghoon Lee, Devesh Tiwari, Yan Solihin, James Tuck","doi":"10.1109/HPCA.2011.5749720","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749720","url":null,"abstract":"Queues are commonly used in multithreaded programs for synchronization and communication. However, because software queues tend to be too expensive to support finegrained parallelism, hardware queues have been proposed to reduce overhead of communication between cores. Hardware queues require modifications to the processor core and need a custom interconnect. They also pose difficulties for the operating system because their state must be preserved across context switches. To solve these problems, we propose a hardware-accelerated queue, or HAQu. HAQu adds hardware to a CMP that accelerates operations on software queues. Our design implements fast queueing through an application's address space with operations that are compatible with a fully software queue. Our design provides accelerated and OS-transparent performance in three general ways: (1) it provides a single instruction for enqueueing and dequeueing which significantly reduces the overhead when used in fine-grained threading; (2) operations on the queue are designed to leverage low-level details of the coherence protocol; and (3) hardware ensures that the full state of the queue is stored in the application's address space, thereby ensuring virtualization. We have evaluated our design in the context of application domains: offloading fine-grained checks for improved software reliability, and automatic, fine-grained parallelization using decoupled software pipelining.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"os-13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123387963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Data-triggered threads: Eliminating redundant computation 数据触发线程:消除冗余计算
Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749727
Hung-Wei Tseng, D. Tullsen
This paper introduces the concept of data-triggered threads. Unlike threads in parallel programs in conventional programming models, these threads are initiated on a change to a memory location. This enables increased parallelism and the elimination of redundant, unnecessary computation. This paper focuses primarily on the latter. It is shown that 78% of all loads fetch redundant data, leading to a high incidence of redundant computation. By expressing computation through data-triggered threads, that computation is executed once when the data changes, and is skipped whenever the data does not change. The set of C SPEC benchmarks show performance speedup of up to 5.9X, and averaging 46%.
本文介绍了数据触发线程的概念。与传统编程模型中并行程序中的线程不同,这些线程在内存位置发生更改时启动。这增加了并行性,消除了冗余和不必要的计算。本文主要关注后者。结果表明,78%的负载获取冗余数据,导致冗余计算的发生率很高。通过通过数据触发的线程表示计算,该计算在数据更改时执行一次,在数据不更改时跳过。C SPEC基准测试显示,性能加速高达5.9X,平均为46%。
{"title":"Data-triggered threads: Eliminating redundant computation","authors":"Hung-Wei Tseng, D. Tullsen","doi":"10.1109/HPCA.2011.5749727","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749727","url":null,"abstract":"This paper introduces the concept of data-triggered threads. Unlike threads in parallel programs in conventional programming models, these threads are initiated on a change to a memory location. This enables increased parallelism and the elimination of redundant, unnecessary computation. This paper focuses primarily on the latter. It is shown that 78% of all loads fetch redundant data, leading to a high incidence of redundant computation. By expressing computation through data-triggered threads, that computation is executed once when the data changes, and is skipped whenever the data does not change. The set of C SPEC benchmarks show performance speedup of up to 5.9X, and averaging 46%.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130458732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
CHIPPER: A low-complexity bufferless deflection router chip:一种低复杂度的无缓冲偏转路由器
Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749724
Chris Fallin, Chris Craik, O. Mutlu
As Chip Multiprocessors (CMPs) scale to tens or hundreds of nodes, the interconnect becomes a significant factor in cost, energy consumption and performance. Recent work has explored many design tradeoffs for networks-on-chip (NoCs) with novel router architectures to reduce hardware cost. In particular, recent work proposes bufferless deflection routing to eliminate router buffers. The high cost of buffers makes this choice potentially appealing, especially for low-to-medium network loads. However, current bufferless designs usually add complexity to control logic. Deflection routing introduces a sequential dependence in port allocation, yielding a slow critical path. Explicit mechanisms are required for livelock freedom due to the non-minimal nature of deflection. Finally, deflection routing can fragment packets, and the reassembly buffers require large worst-case sizing to avoid deadlock, due to the lack of network backpressure. The complexity that arises out of these three problems has discouraged practical adoption of bufferless routing. To counter this, we propose CHIPPER (Cheap-Interconnect Partially Permuting Router), a simplified router microarchitecture that eliminates in-router buffers and the crossbar. We introduce three key insights: first, that deflection routing port allocation maps naturally to a permutation network within the router; second, that livelock freedom requires only an implicit token-passing scheme, eliminating expensive age-based priorities; and finally, that flow control can provide correctness in the absence of network backpressure, avoiding deadlock and allowing cache miss buffers (MSHRs) to be used as reassembly buffers. Using multiprogrammed SPEC CPU2006, server, and desktop application workloads and SPLASH-2 multithreaded workloads, we achieve an average 54.9% network power reduction for 13.6% average performance degradation (multipro-grammed) and 73.4% power reduction for 1.9% slowdown (multithreaded), with minimal degradation and large power savings at low-to-medium load. Finally, we show 36.2% router area reduction relative to buffered routing, with comparable timing.
随着芯片多处理器(cmp)扩展到数十或数百个节点,互连成为成本,能耗和性能的重要因素。最近的工作探索了许多设计权衡与新型路由器架构的片上网络(noc),以降低硬件成本。特别是,最近的工作提出了无缓冲偏转路由,以消除路由器缓冲。缓冲区的高成本使得这种选择具有潜在的吸引力,特别是对于低到中等网络负载。然而,目前的无缓冲设计通常会增加控制逻辑的复杂性。偏转路由在端口分配中引入顺序依赖,产生缓慢的关键路径。由于挠度的非最小性质,需要明确的机制来实现活畜自由。最后,偏转路由可能使数据包分片,并且由于缺乏网络反压力,重组缓冲区需要较大的最坏情况大小以避免死锁。这三个问题产生的复杂性阻碍了无缓冲路由的实际采用。为了解决这个问题,我们提出了CHIPPER(廉价互连部分置换路由器),这是一种简化的路由器微架构,消除了路由器内缓冲区和交叉条。我们介绍了三个关键的见解:首先,偏转路由端口分配自然映射到路由器内的排列网络;其次,牲畜自由只需要一个隐含的令牌传递方案,消除了昂贵的基于年龄的优先级;最后,流控制可以在没有网络反压的情况下提供正确性,避免死锁并允许缓存缺失缓冲区(MSHRs)用作重组缓冲区。使用多程序SPEC CPU2006、服务器和桌面应用程序工作负载以及SPLASH-2多线程工作负载,我们实现了平均54.9%的网络功耗降低,13.6%的平均性能下降(多程序),73.4%的功耗降低,1.9%的速度下降(多线程),在低到中等负载下,性能下降最小,节省了大量功耗。最后,与缓冲路由相比,我们显示了36.2%的路由器面积减少,时间相当。
{"title":"CHIPPER: A low-complexity bufferless deflection router","authors":"Chris Fallin, Chris Craik, O. Mutlu","doi":"10.1109/HPCA.2011.5749724","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749724","url":null,"abstract":"As Chip Multiprocessors (CMPs) scale to tens or hundreds of nodes, the interconnect becomes a significant factor in cost, energy consumption and performance. Recent work has explored many design tradeoffs for networks-on-chip (NoCs) with novel router architectures to reduce hardware cost. In particular, recent work proposes bufferless deflection routing to eliminate router buffers. The high cost of buffers makes this choice potentially appealing, especially for low-to-medium network loads. However, current bufferless designs usually add complexity to control logic. Deflection routing introduces a sequential dependence in port allocation, yielding a slow critical path. Explicit mechanisms are required for livelock freedom due to the non-minimal nature of deflection. Finally, deflection routing can fragment packets, and the reassembly buffers require large worst-case sizing to avoid deadlock, due to the lack of network backpressure. The complexity that arises out of these three problems has discouraged practical adoption of bufferless routing. To counter this, we propose CHIPPER (Cheap-Interconnect Partially Permuting Router), a simplified router microarchitecture that eliminates in-router buffers and the crossbar. We introduce three key insights: first, that deflection routing port allocation maps naturally to a permutation network within the router; second, that livelock freedom requires only an implicit token-passing scheme, eliminating expensive age-based priorities; and finally, that flow control can provide correctness in the absence of network backpressure, avoiding deadlock and allowing cache miss buffers (MSHRs) to be used as reassembly buffers. Using multiprogrammed SPEC CPU2006, server, and desktop application workloads and SPLASH-2 multithreaded workloads, we achieve an average 54.9% network power reduction for 13.6% average performance degradation (multipro-grammed) and 73.4% power reduction for 1.9% slowdown (multithreaded), with minimal degradation and large power savings at low-to-medium load. Finally, we show 36.2% router area reduction relative to buffered routing, with comparable timing.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127545125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 224
CloudCache: Expanding and shrinking private caches CloudCache:扩展和收缩私有缓存
Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749731
Hyunjin Lee, Sangyeun Cho, B. Childers
The number of cores in a single chip multiprocessor is expected to grow in coming years. Likewise, aggregate on-chip cache capacity is increasing fast and its effective utilization is becoming ever more important. Furthermore, available cores are expected to be underutilized due to the power wall and highly heterogeneous future workloads. This trend makes existing L2 cache management techniques less effective for two problems: increased capacity interference between working cores and longer L2 access latency. We propose a novel scalable cache management framework called CloudCache that creates dynamically expanding and shrinking L2 caches for working threads with fine-grained hardware monitoring and control. The key architectural components of CloudCache are L2 cache chaining, inter- and intra-bank cache partitioning, and a performance-optimized coherence protocol. Our extensive experimental evaluation demonstrates that CloudCache significantly improves performance of a wide range of workloads when all or a subset of cores are occupied.
单芯片多处理器的核心数量预计将在未来几年增长。同样,片上聚合缓存容量正在快速增长,其有效利用变得越来越重要。此外,由于电源墙和未来高度异构的工作负载,预计可用的核心将得不到充分利用。这种趋势使得现有的L2缓存管理技术在两个问题上不那么有效:工作核心之间的容量干扰增加和L2访问延迟延长。我们提出了一种新的可扩展缓存管理框架,称为CloudCache,它通过细粒度硬件监控和控制为工作线程创建动态扩展和收缩L2缓存。CloudCache的关键架构组件是L2缓存链,银行间和银行内部缓存分区,以及性能优化的一致性协议。我们广泛的实验评估表明,当所有内核或子集被占用时,CloudCache显着提高了各种工作负载的性能。
{"title":"CloudCache: Expanding and shrinking private caches","authors":"Hyunjin Lee, Sangyeun Cho, B. Childers","doi":"10.1109/HPCA.2011.5749731","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749731","url":null,"abstract":"The number of cores in a single chip multiprocessor is expected to grow in coming years. Likewise, aggregate on-chip cache capacity is increasing fast and its effective utilization is becoming ever more important. Furthermore, available cores are expected to be underutilized due to the power wall and highly heterogeneous future workloads. This trend makes existing L2 cache management techniques less effective for two problems: increased capacity interference between working cores and longer L2 access latency. We propose a novel scalable cache management framework called CloudCache that creates dynamically expanding and shrinking L2 caches for working threads with fine-grained hardware monitoring and control. The key architectural components of CloudCache are L2 cache chaining, inter- and intra-bank cache partitioning, and a performance-optimized coherence protocol. Our extensive experimental evaluation demonstrates that CloudCache significantly improves performance of a wide range of workloads when all or a subset of cores are occupied.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128157723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 80
Relaxing non-volatility for fast and energy-efficient STT-RAM caches 为快速和节能的STT-RAM缓存放松非易失性
Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749716
IV ClintonWillsSmullen, Vidyabhushan Mohan, Anurag Nigam, S. Gurumurthi, M. Stan
Spin-Transfer Torque RAM (STT-RAM) is an emerging non-volatile memory technology that is a potential universal memory that could replace SRAM in processor caches. This paper presents a novel approach for redesigning STT-RAM memory cells to reduce the high dynamic energy and slow write latencies. We lower the retention time by reducing the planar area of the cell, thereby reducing the write current, which we then use with CACTI to design caches and memories. We simulate quad-core processor designs using a combination of SRAM- and STT-RAM-based caches. Since ultra-low retention STT-RAM may lose data, we also provide a preliminary evaluation for a simple, DRAMstyle refresh policy. We found that a pure STT-RAM cache hierarchy provides the best energy efficiency, though a hybrid design of SRAM-based L1 caches with reduced-retention STT-RAM L2 and L3 caches eliminates performance loss while still reducing the energy-delay product by more than 70%.
自旋转移扭矩RAM (STT-RAM)是一种新兴的非易失性存储技术,是一种潜在的通用存储器,可以取代处理器缓存中的SRAM。本文提出了一种重新设计STT-RAM存储单元的新方法,以减少高动态能量和慢写入延迟。我们通过减少单元的平面面积来降低保留时间,从而减少写入电流,然后我们将其与CACTI一起用于设计缓存和存储器。我们使用基于SRAM和stt - ram的缓存的组合来模拟四核处理器设计。由于超低保留STT-RAM可能会丢失数据,我们还提供了一个简单的初步评估,DRAMstyle刷新策略。我们发现,纯STT-RAM缓存层次结构提供了最佳的能源效率,尽管基于sram的L1缓存与减少保留的STT-RAM L2和L3缓存的混合设计消除了性能损失,同时仍将能量延迟产品减少了70%以上。
{"title":"Relaxing non-volatility for fast and energy-efficient STT-RAM caches","authors":"IV ClintonWillsSmullen, Vidyabhushan Mohan, Anurag Nigam, S. Gurumurthi, M. Stan","doi":"10.1109/HPCA.2011.5749716","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749716","url":null,"abstract":"Spin-Transfer Torque RAM (STT-RAM) is an emerging non-volatile memory technology that is a potential universal memory that could replace SRAM in processor caches. This paper presents a novel approach for redesigning STT-RAM memory cells to reduce the high dynamic energy and slow write latencies. We lower the retention time by reducing the planar area of the cell, thereby reducing the write current, which we then use with CACTI to design caches and memories. We simulate quad-core processor designs using a combination of SRAM- and STT-RAM-based caches. Since ultra-low retention STT-RAM may lose data, we also provide a preliminary evaluation for a simple, DRAMstyle refresh policy. We found that a pure STT-RAM cache hierarchy provides the best energy efficiency, though a hybrid design of SRAM-based L1 caches with reduced-retention STT-RAM L2 and L3 caches eliminates performance loss while still reducing the energy-delay product by more than 70%.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124188110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 411
Power shifting in Thrifty Interconnection Network 节约型互联网络中的权力转移
Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749725
Jian Li, Wei Huang, C. Lefurgy, Lixin Zhang, W. Denzel, Richard R. Treumann, Kun Wang
This paper presents two complementary techniques to manage the power consumption of large-scale systems with a packet-switched interconnection network. First, we propose Thrifty Interconnection Network (TIN), where the network links are activated and de-activated dynamically with little or no overhead by using inherent system events to timely trigger link activation or de-activation. Second, we propose Network Power Shifting (NPS) that dynamically shifts the power budget between the compute nodes and their corresponding network components. TIN activates and trains the links in the interconnection network, just-in-time before the network communication is about to happen, and thriftily puts them into a low-power mode when communication is finished, hence reducing unnecessary network power consumption. Furthermore, the compute nodes can absorb the extra power budget shifted from its attached network components and increase their processor frequency for higher performance with NPS. Our simulation results on a set of real-world workload traces show that TIN can achieve on average 60% network power reduction, with the support of only one low-power mode. When NPS is enabled, the two together can achieve 12% application performance improvement and 13% overall system energy reduction. Further performance improvement is possible if the compute nodes can speed up more and fully utilize the extra power budget reinvested from the thrifty network with more aggressive cooling support.
本文提出了两种互补的技术来管理具有分组交换互连网络的大型系统的功耗。首先,我们提出节俭互连网络(TIN),其中网络链路通过使用固有的系统事件及时触发链路激活或去激活,以很少或没有开销的方式动态激活和去激活。其次,我们提出了网络功率转移(NPS),它在计算节点和相应的网络组件之间动态地转移功率预算。TIN在网络通信即将发生之前及时激活和训练互联网络中的链路,并在通信结束时节省地将其置于低功耗模式,从而减少不必要的网络功耗。此外,计算节点可以吸收其附加网络组件转移的额外功率预算,并通过NPS提高其处理器频率以获得更高的性能。我们对一组实际工作负载跟踪的仿真结果表明,在仅支持一种低功耗模式的情况下,TIN可以实现平均60%的网络功耗降低。当启用NPS时,两者可以共同实现12%的应用性能提升和13%的整体系统能耗降低。如果计算节点可以更快地加速,并充分利用节省网络再投资的额外电力预算,并提供更积极的冷却支持,则进一步的性能改进是可能的。
{"title":"Power shifting in Thrifty Interconnection Network","authors":"Jian Li, Wei Huang, C. Lefurgy, Lixin Zhang, W. Denzel, Richard R. Treumann, Kun Wang","doi":"10.1109/HPCA.2011.5749725","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749725","url":null,"abstract":"This paper presents two complementary techniques to manage the power consumption of large-scale systems with a packet-switched interconnection network. First, we propose Thrifty Interconnection Network (TIN), where the network links are activated and de-activated dynamically with little or no overhead by using inherent system events to timely trigger link activation or de-activation. Second, we propose Network Power Shifting (NPS) that dynamically shifts the power budget between the compute nodes and their corresponding network components. TIN activates and trains the links in the interconnection network, just-in-time before the network communication is about to happen, and thriftily puts them into a low-power mode when communication is finished, hence reducing unnecessary network power consumption. Furthermore, the compute nodes can absorb the extra power budget shifted from its attached network components and increase their processor frequency for higher performance with NPS. Our simulation results on a set of real-world workload traces show that TIN can achieve on average 60% network power reduction, with the support of only one low-power mode. When NPS is enabled, the two together can achieve 12% application performance improvement and 13% overall system energy reduction. Further performance improvement is possible if the compute nodes can speed up more and fully utilize the extra power budget reinvested from the thrifty network with more aggressive cooling support.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133993379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
MOPED: Orchestrating interprocess message data on CMPs 在cmp上编排进程间消息数据
Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749721
Junli Gu, S. Lumetta, Rakesh Kumar, Yihe Sun
Future CMPs will combine many simple cores with deep cache hierarchies. With more cores, cache resources per core are fewer, and must be shared carefully to avoid poor utilization due to conflicts and pollution. Explicit motion of data in these architectures, such as message passing, can provide hints about program behavior that can be used to hide latency and improve cache behavior. However, to make these models attractive, synchronization overhead and data copying must also be offloaded from the processors. In this paper, we describe a Message Orchestration and Performance Enhancement Device (MOPED) that provides hardware mechanisms to support state-of-the-art message passing protocols such as MPI. MOPED extends the per-processor cache controllers and coherence protocol to support message synchronization and management in hardware, to transfer message data efficiently without intermediate buffer copies, and to place useful data in caches in a timely manner. MOPED thus allows full overlap between communication and computation on the cores. We extended a 16-core full-system simulator based on Simics and FeS2. MOPED interacts with the directory controllers to orchestrate message data. We evaluated benefits to performance and coherence traffic by integrating MOPED into the MPICH runtime. Relative to unmodified MPI execution, MOPED reduces execution time of real applications (NAS Parallel Benchmarks) by 17–45% and of communication microbenchmarks (Intel's IMB) by 76–94%. Off-chip memory misses are reduced by 43–88% for applications and by 75–100% for microbenchmarks.
未来的cmp将结合许多简单的核心和深度缓存层次结构。核心越多,每个核心的缓存资源就越少,必须谨慎地共享,以避免由于冲突和污染而导致利用率低下。这些体系结构中数据的显式移动(例如消息传递)可以提供有关程序行为的提示,这些提示可用于隐藏延迟和改进缓存行为。然而,要使这些模型具有吸引力,还必须从处理器中卸载同步开销和数据复制。在本文中,我们描述了一个消息编排和性能增强设备(mped),它提供了硬件机制来支持最先进的消息传递协议,如MPI。mop扩展了每个处理器的缓存控制器和一致性协议,以支持硬件中的消息同步和管理,有效地传输消息数据而无需中间缓冲区副本,并及时将有用的数据放入缓存中。因此,mop允许核心上的通信和计算完全重叠。我们扩展了一个基于Simics和FeS2的16核全系统模拟器。MOPED与目录控制器交互以编排消息数据。我们通过将mopp集成到MPICH运行时来评估对性能和一致性流量的好处。相对于未修改的MPI执行,mopd将实际应用程序(NAS并行基准)的执行时间减少了17-45%,通信微基准(英特尔的IMB)的执行时间减少了76-94%。对于应用程序来说,片外内存丢失减少了43-88%,对于微基准测试来说,减少了75-100%。
{"title":"MOPED: Orchestrating interprocess message data on CMPs","authors":"Junli Gu, S. Lumetta, Rakesh Kumar, Yihe Sun","doi":"10.1109/HPCA.2011.5749721","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749721","url":null,"abstract":"Future CMPs will combine many simple cores with deep cache hierarchies. With more cores, cache resources per core are fewer, and must be shared carefully to avoid poor utilization due to conflicts and pollution. Explicit motion of data in these architectures, such as message passing, can provide hints about program behavior that can be used to hide latency and improve cache behavior. However, to make these models attractive, synchronization overhead and data copying must also be offloaded from the processors. In this paper, we describe a Message Orchestration and Performance Enhancement Device (MOPED) that provides hardware mechanisms to support state-of-the-art message passing protocols such as MPI. MOPED extends the per-processor cache controllers and coherence protocol to support message synchronization and management in hardware, to transfer message data efficiently without intermediate buffer copies, and to place useful data in caches in a timely manner. MOPED thus allows full overlap between communication and computation on the cores. We extended a 16-core full-system simulator based on Simics and FeS2. MOPED interacts with the directory controllers to orchestrate message data. We evaluated benefits to performance and coherence traffic by integrating MOPED into the MPICH runtime. Relative to unmodified MPI execution, MOPED reduces execution time of real applications (NAS Parallel Benchmarks) by 17–45% and of communication microbenchmarks (Intel's IMB) by 76–94%. Off-chip memory misses are reduced by 43–88% for applications and by 75–100% for microbenchmarks.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134314487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Keynote address II: How's the parallel computing revolution going? 主题演讲二:并行计算革命进展如何?
Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749730
K. McKinley
Two trends changed the computing landscape over the past decade: (1) hardware vendors started delivering chip multiprocessors (CMPs) instead of uniprocessors, and (2) software developers increasingly chose managed languages instead of native languages. Unfortunately, the former change is disrupting the virtuous-cycle between performance improvements and software innovation. Establishing a new parallel performance virtuous cycle for managed languages will require scalable applications executing on scalable Virtual Machine (VM) services, since the VM schedules, monitors, compiles, optimizes, garbage collects, and executes together with the application. This talk describes current progress, opportunities, and challenges for scalable VM services. The parallel computing revolution urgently needs more innovations.
在过去的十年中,有两个趋势改变了计算领域:(1)硬件供应商开始提供芯片多处理器(cmp)而不是单处理器,(2)软件开发人员越来越多地选择托管语言而不是本地语言。不幸的是,前一种变化正在破坏性能改进和软件创新之间的良性循环。为托管语言建立新的并行性能良性循环将需要在可伸缩的虚拟机(VM)服务上执行可伸缩的应用程序,因为VM与应用程序一起调度、监视、编译、优化、垃圾收集和执行。本次演讲描述了可扩展虚拟机服务的当前进展、机遇和挑战。并行计算革命迫切需要更多的创新。
{"title":"Keynote address II: How's the parallel computing revolution going?","authors":"K. McKinley","doi":"10.1109/HPCA.2011.5749730","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749730","url":null,"abstract":"Two trends changed the computing landscape over the past decade: (1) hardware vendors started delivering chip multiprocessors (CMPs) instead of uniprocessors, and (2) software developers increasingly chose managed languages instead of native languages. Unfortunately, the former change is disrupting the virtuous-cycle between performance improvements and software innovation. Establishing a new parallel performance virtuous cycle for managed languages will require scalable applications executing on scalable Virtual Machine (VM) services, since the VM schedules, monitors, compiles, optimizes, garbage collects, and executes together with the application. This talk describes current progress, opportunities, and challenges for scalable VM services. The parallel computing revolution urgently needs more innovations.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114538613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Abstraction and microarchitecture scaling in early-stage power modeling 早期功率建模中的抽象和微架构缩放
Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749746
H. Jacobson, A. Buyuktosunoglu, P. Bose, E. Acar, R. Eickemeyer
Early-stage, microarchitecture-level power modeling methodologies have been used in industry and academic research for a decade (or more). Such methods use cycle-accurate performance simulators and deduce active power based on utilization markers. A key question faced in this context is: what key utilization metrics to monitor, and how many are needed for accuracy? Is there a systematic way to select the “best” markers? We also pose a key follow-on question: is it possible to perform accurate scaling of an abstracted model to enable exploration of new microarchitecture features? In this paper, we address these particular questions and examine the results for a range of abstraction levels. We highlight innovative insights for intelligent abstraction and microarchitecture scaling, and point out the pitfalls of abstractions that are not based on a systematic methodology or sound theory.
早期阶段的微架构级功率建模方法已经在工业和学术研究中使用了十年(或更长时间)。这种方法使用周期精确的性能模拟器,并根据利用率标记推断有功功率。在此上下文中面临的一个关键问题是:要监视哪些关键利用率指标,以及需要多少个指标才能达到准确性?是否有一个系统的方法来选择“最好”的标记?我们还提出了一个关键的后续问题:是否有可能对抽象模型进行精确缩放,从而探索新的微架构特性?在本文中,我们解决了这些特定的问题,并检查了一系列抽象级别的结果。我们强调了智能抽象和微架构扩展的创新见解,并指出了不基于系统方法或可靠理论的抽象的陷阱。
{"title":"Abstraction and microarchitecture scaling in early-stage power modeling","authors":"H. Jacobson, A. Buyuktosunoglu, P. Bose, E. Acar, R. Eickemeyer","doi":"10.1109/HPCA.2011.5749746","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749746","url":null,"abstract":"Early-stage, microarchitecture-level power modeling methodologies have been used in industry and academic research for a decade (or more). Such methods use cycle-accurate performance simulators and deduce active power based on utilization markers. A key question faced in this context is: what key utilization metrics to monitor, and how many are needed for accuracy? Is there a systematic way to select the “best” markers? We also pose a key follow-on question: is it possible to perform accurate scaling of an abstracted model to enable exploration of new microarchitecture features? In this paper, we address these particular questions and examine the results for a range of abstraction levels. We highlight innovative insights for intelligent abstraction and microarchitecture scaling, and point out the pitfalls of abstractions that are not based on a systematic methodology or sound theory.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"343 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134158605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
期刊
2011 IEEE 17th International Symposium on High Performance Computer Architecture
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1