2011 IEEE 17th International Symposium on High Performance Computer Architecture最新文献

英文中文

Data-triggered threads: Eliminating redundant computation 数据触发线程:消除冗余计算

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749727

Hung-Wei Tseng, D. Tullsen

This paper introduces the concept of data-triggered threads. Unlike threads in parallel programs in conventional programming models, these threads are initiated on a change to a memory location. This enables increased parallelism and the elimination of redundant, unnecessary computation. This paper focuses primarily on the latter. It is shown that 78% of all loads fetch redundant data, leading to a high incidence of redundant computation. By expressing computation through data-triggered threads, that computation is executed once when the data changes, and is skipped whenever the data does not change. The set of C SPEC benchmarks show performance speedup of up to 5.9X, and averaging 46%.

本文介绍了数据触发线程的概念。与传统编程模型中并行程序中的线程不同，这些线程在内存位置发生更改时启动。这增加了并行性，消除了冗余和不必要的计算。本文主要关注后者。结果表明，78%的负载获取冗余数据，导致冗余计算的发生率很高。通过通过数据触发的线程表示计算，该计算在数据更改时执行一次，在数据不更改时跳过。C SPEC基准测试显示，性能加速高达5.9X，平均为46%。

引用次数: 34

Fast thread migration via cache working set prediction 通过缓存工作集预测快速线程迁移

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749728

Jeffery A. Brown, Leo Porter, D. Tullsen

The most significant source of lost performance when a thread migrates between cores is the loss of cache state. A significant boost in post-migration performance is possible if the cache working set can be moved, proactively, with the thread. This work accelerates thread startup performance after migration by predicting and prefetching the working set of the application into the new cache. It shows that simply moving cache state performs poorly, and that moving the instruction working set can be even more critical than data. This paper demonstrates a technique that captures the access behavior of a thread, summarizes that behavior into a compact form for transfer between cores, and then prefetches appropriate data into the new caches based on the summary. It presents a detailed study of single-thread migration effects, and then demonstrates its utility on a speculative multithreading architecture. Working set prediction as much as doubles the performance of short-lived threads, and in a full speculative multithreading implementation, the technique is also shown to nearly double the effectiveness of the spawned threads.

当线程在内核之间迁移时，最重要的性能损失来源是缓存状态的丢失。如果缓存工作集可以与线程一起主动移动，那么迁移后的性能就有可能得到显著提升。这项工作通过预测和预取应用程序的工作集到新的缓存中来加速迁移后的线程启动性能。它表明，简单地移动缓存状态的性能很差，并且移动指令工作集可能比数据更重要。本文演示了一种技术，该技术捕获线程的访问行为，将该行为总结为紧凑的形式，以便在内核之间传输，然后根据摘要将适当的数据预取到新的缓存中。它详细研究了单线程迁移的影响，然后演示了它在推测的多线程体系结构上的实用性。工作集预测可以将寿命较短的线程的性能提高一倍，并且在完全推测的多线程实现中，该技术还可以将派生线程的效率提高近一倍。

{"title":"Fast thread migration via cache working set prediction","authors":"Jeffery A. Brown, Leo Porter, D. Tullsen","doi":"10.1109/HPCA.2011.5749728","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749728","url":null,"abstract":"The most significant source of lost performance when a thread migrates between cores is the loss of cache state. A significant boost in post-migration performance is possible if the cache working set can be moved, proactively, with the thread. This work accelerates thread startup performance after migration by predicting and prefetching the working set of the application into the new cache. It shows that simply moving cache state performs poorly, and that moving the instruction working set can be even more critical than data. This paper demonstrates a technique that captures the access behavior of a thread, summarizes that behavior into a compact form for transfer between cores, and then prefetches appropriate data into the new caches based on the summary. It presents a detailed study of single-thread migration effects, and then demonstrates its utility on a speculative multithreading architecture. Working set prediction as much as doubles the performance of short-lived threads, and in a full speculative multithreading implementation, the technique is also shown to nearly double the effectiveness of the spawned threads.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131537951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

FREE-p: Protecting non-volatile memory against both hard and soft errors FREE-p:保护非易失性内存免受硬错误和软错误的侵害

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749752

D. Yoon, Naveen Muralimanohar, Jichuan Chang, Parthasarathy Ranganathan, N. Jouppi, M. Erez

Emerging non-volatile memories such as phase-change RAM (PCRAM) offer significant advantages but suffer from write endurance problems. However, prior solutions are oblivious to soft errors (recently raised as a potential issue even for PCRAM) and are incompatible with high-level fault tolerance techniques such as chipkill. To additionally address such failures requires unnecessarily high costs for techniques that focus singularly on wear-out tolerance. In this paper, we propose fine-grained remapping with ECC and embedded pointers (FREE-p). FREE-p remaps fine-grained worn-out NVRAM blocks without requiring large dedicated storage. We discuss how FREE-p protects against both hard and soft errors and can be extended to chipkill. Further, FREE-p can be implemented purely in the memory controller, avoiding custom NVRAM devices. In addition to these benefits, FREE-p increases NVRAM lifetime by up to 26% over the state-of-the-art even with severe process variation while performance degradation is less than 2% for the initial 7 years.

新兴的非易失性存储器(如相变RAM (PCRAM))具有显著的优势，但存在写入持久性问题。然而，先前的解决方案忽略了软错误(最近甚至作为PCRAM的潜在问题提出)，并且与高级容错技术(如chipkill)不兼容。此外，为了解决此类故障，需要为专注于磨损公差的技术提供不必要的高成本。在本文中，我们提出了使用ECC和嵌入式指针(FREE-p)的细粒度重映射。FREE-p重新映射细粒度磨损的NVRAM块，而不需要大型专用存储。我们讨论FREE-p如何防止硬错误和软错误，并可以扩展到芯片杀伤。此外，FREE-p可以完全在内存控制器中实现，避免定制NVRAM设备。除了这些优点之外，即使在严重的工艺变化下，FREE-p也将NVRAM的使用寿命提高了26%，而在最初的7年里，性能下降不到2%。

引用次数: 182

CHIPPER: A low-complexity bufferless deflection router chip:一种低复杂度的无缓冲偏转路由器

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749724

Chris Fallin, Chris Craik, O. Mutlu

As Chip Multiprocessors (CMPs) scale to tens or hundreds of nodes, the interconnect becomes a significant factor in cost, energy consumption and performance. Recent work has explored many design tradeoffs for networks-on-chip (NoCs) with novel router architectures to reduce hardware cost. In particular, recent work proposes bufferless deflection routing to eliminate router buffers. The high cost of buffers makes this choice potentially appealing, especially for low-to-medium network loads. However, current bufferless designs usually add complexity to control logic. Deflection routing introduces a sequential dependence in port allocation, yielding a slow critical path. Explicit mechanisms are required for livelock freedom due to the non-minimal nature of deflection. Finally, deflection routing can fragment packets, and the reassembly buffers require large worst-case sizing to avoid deadlock, due to the lack of network backpressure. The complexity that arises out of these three problems has discouraged practical adoption of bufferless routing. To counter this, we propose CHIPPER (Cheap-Interconnect Partially Permuting Router), a simplified router microarchitecture that eliminates in-router buffers and the crossbar. We introduce three key insights: first, that deflection routing port allocation maps naturally to a permutation network within the router; second, that livelock freedom requires only an implicit token-passing scheme, eliminating expensive age-based priorities; and finally, that flow control can provide correctness in the absence of network backpressure, avoiding deadlock and allowing cache miss buffers (MSHRs) to be used as reassembly buffers. Using multiprogrammed SPEC CPU2006, server, and desktop application workloads and SPLASH-2 multithreaded workloads, we achieve an average 54.9% network power reduction for 13.6% average performance degradation (multipro-grammed) and 73.4% power reduction for 1.9% slowdown (multithreaded), with minimal degradation and large power savings at low-to-medium load. Finally, we show 36.2% router area reduction relative to buffered routing, with comparable timing.

随着芯片多处理器(cmp)扩展到数十或数百个节点，互连成为成本，能耗和性能的重要因素。最近的工作探索了许多设计权衡与新型路由器架构的片上网络(noc)，以降低硬件成本。特别是，最近的工作提出了无缓冲偏转路由，以消除路由器缓冲。缓冲区的高成本使得这种选择具有潜在的吸引力，特别是对于低到中等网络负载。然而，目前的无缓冲设计通常会增加控制逻辑的复杂性。偏转路由在端口分配中引入顺序依赖，产生缓慢的关键路径。由于挠度的非最小性质，需要明确的机制来实现活畜自由。最后，偏转路由可能使数据包分片，并且由于缺乏网络反压力，重组缓冲区需要较大的最坏情况大小以避免死锁。这三个问题产生的复杂性阻碍了无缓冲路由的实际采用。为了解决这个问题，我们提出了CHIPPER(廉价互连部分置换路由器)，这是一种简化的路由器微架构，消除了路由器内缓冲区和交叉条。我们介绍了三个关键的见解:首先，偏转路由端口分配自然映射到路由器内的排列网络;其次，牲畜自由只需要一个隐含的令牌传递方案，消除了昂贵的基于年龄的优先级;最后，流控制可以在没有网络反压的情况下提供正确性，避免死锁并允许缓存缺失缓冲区(MSHRs)用作重组缓冲区。使用多程序SPEC CPU2006、服务器和桌面应用程序工作负载以及SPLASH-2多线程工作负载，我们实现了平均54.9%的网络功耗降低，13.6%的平均性能下降(多程序)，73.4%的功耗降低，1.9%的速度下降(多线程)，在低到中等负载下，性能下降最小，节省了大量功耗。最后，与缓冲路由相比，我们显示了36.2%的路由器面积减少，时间相当。

{"title":"CHIPPER: A low-complexity bufferless deflection router","authors":"Chris Fallin, Chris Craik, O. Mutlu","doi":"10.1109/HPCA.2011.5749724","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749724","url":null,"abstract":"As Chip Multiprocessors (CMPs) scale to tens or hundreds of nodes, the interconnect becomes a significant factor in cost, energy consumption and performance. Recent work has explored many design tradeoffs for networks-on-chip (NoCs) with novel router architectures to reduce hardware cost. In particular, recent work proposes bufferless deflection routing to eliminate router buffers. The high cost of buffers makes this choice potentially appealing, especially for low-to-medium network loads. However, current bufferless designs usually add complexity to control logic. Deflection routing introduces a sequential dependence in port allocation, yielding a slow critical path. Explicit mechanisms are required for livelock freedom due to the non-minimal nature of deflection. Finally, deflection routing can fragment packets, and the reassembly buffers require large worst-case sizing to avoid deadlock, due to the lack of network backpressure. The complexity that arises out of these three problems has discouraged practical adoption of bufferless routing. To counter this, we propose CHIPPER (Cheap-Interconnect Partially Permuting Router), a simplified router microarchitecture that eliminates in-router buffers and the crossbar. We introduce three key insights: first, that deflection routing port allocation maps naturally to a permutation network within the router; second, that livelock freedom requires only an implicit token-passing scheme, eliminating expensive age-based priorities; and finally, that flow control can provide correctness in the absence of network backpressure, avoiding deadlock and allowing cache miss buffers (MSHRs) to be used as reassembly buffers. Using multiprogrammed SPEC CPU2006, server, and desktop application workloads and SPLASH-2 multithreaded workloads, we achieve an average 54.9% network power reduction for 13.6% average performance degradation (multipro-grammed) and 73.4% power reduction for 1.9% slowdown (multithreaded), with minimal degradation and large power savings at low-to-medium load. Finally, we show 36.2% router area reduction relative to buffered routing, with comparable timing.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127545125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 224

CloudCache: Expanding and shrinking private caches CloudCache:扩展和收缩私有缓存

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749731

Hyunjin Lee, Sangyeun Cho, B. Childers

The number of cores in a single chip multiprocessor is expected to grow in coming years. Likewise, aggregate on-chip cache capacity is increasing fast and its effective utilization is becoming ever more important. Furthermore, available cores are expected to be underutilized due to the power wall and highly heterogeneous future workloads. This trend makes existing L2 cache management techniques less effective for two problems: increased capacity interference between working cores and longer L2 access latency. We propose a novel scalable cache management framework called CloudCache that creates dynamically expanding and shrinking L2 caches for working threads with fine-grained hardware monitoring and control. The key architectural components of CloudCache are L2 cache chaining, inter- and intra-bank cache partitioning, and a performance-optimized coherence protocol. Our extensive experimental evaluation demonstrates that CloudCache significantly improves performance of a wide range of workloads when all or a subset of cores are occupied.

单芯片多处理器的核心数量预计将在未来几年增长。同样，片上聚合缓存容量正在快速增长，其有效利用变得越来越重要。此外，由于电源墙和未来高度异构的工作负载，预计可用的核心将得不到充分利用。这种趋势使得现有的L2缓存管理技术在两个问题上不那么有效:工作核心之间的容量干扰增加和L2访问延迟延长。我们提出了一种新的可扩展缓存管理框架，称为CloudCache，它通过细粒度硬件监控和控制为工作线程创建动态扩展和收缩L2缓存。CloudCache的关键架构组件是L2缓存链，银行间和银行内部缓存分区，以及性能优化的一致性协议。我们广泛的实验评估表明，当所有内核或子集被占用时，CloudCache显着提高了各种工作负载的性能。

引用次数: 80

Relaxing non-volatility for fast and energy-efficient STT-RAM caches 为快速和节能的STT-RAM缓存放松非易失性

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749716

IV ClintonWillsSmullen, Vidyabhushan Mohan, Anurag Nigam, S. Gurumurthi, M. Stan

Spin-Transfer Torque RAM (STT-RAM) is an emerging non-volatile memory technology that is a potential universal memory that could replace SRAM in processor caches. This paper presents a novel approach for redesigning STT-RAM memory cells to reduce the high dynamic energy and slow write latencies. We lower the retention time by reducing the planar area of the cell, thereby reducing the write current, which we then use with CACTI to design caches and memories. We simulate quad-core processor designs using a combination of SRAM- and STT-RAM-based caches. Since ultra-low retention STT-RAM may lose data, we also provide a preliminary evaluation for a simple, DRAMstyle refresh policy. We found that a pure STT-RAM cache hierarchy provides the best energy efficiency, though a hybrid design of SRAM-based L1 caches with reduced-retention STT-RAM L2 and L3 caches eliminates performance loss while still reducing the energy-delay product by more than 70%.

自旋转移扭矩RAM (STT-RAM)是一种新兴的非易失性存储技术，是一种潜在的通用存储器，可以取代处理器缓存中的SRAM。本文提出了一种重新设计STT-RAM存储单元的新方法，以减少高动态能量和慢写入延迟。我们通过减少单元的平面面积来降低保留时间，从而减少写入电流，然后我们将其与CACTI一起用于设计缓存和存储器。我们使用基于SRAM和stt - ram的缓存的组合来模拟四核处理器设计。由于超低保留STT-RAM可能会丢失数据，我们还提供了一个简单的初步评估，DRAMstyle刷新策略。我们发现，纯STT-RAM缓存层次结构提供了最佳的能源效率，尽管基于sram的L1缓存与减少保留的STT-RAM L2和L3缓存的混合设计消除了性能损失，同时仍将能量延迟产品减少了70%以上。

引用次数: 411

Power shifting in Thrifty Interconnection Network 节约型互联网络中的权力转移

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749725

Jian Li, Wei Huang, C. Lefurgy, Lixin Zhang, W. Denzel, Richard R. Treumann, Kun Wang

This paper presents two complementary techniques to manage the power consumption of large-scale systems with a packet-switched interconnection network. First, we propose Thrifty Interconnection Network (TIN), where the network links are activated and de-activated dynamically with little or no overhead by using inherent system events to timely trigger link activation or de-activation. Second, we propose Network Power Shifting (NPS) that dynamically shifts the power budget between the compute nodes and their corresponding network components. TIN activates and trains the links in the interconnection network, just-in-time before the network communication is about to happen, and thriftily puts them into a low-power mode when communication is finished, hence reducing unnecessary network power consumption. Furthermore, the compute nodes can absorb the extra power budget shifted from its attached network components and increase their processor frequency for higher performance with NPS. Our simulation results on a set of real-world workload traces show that TIN can achieve on average 60% network power reduction, with the support of only one low-power mode. When NPS is enabled, the two together can achieve 12% application performance improvement and 13% overall system energy reduction. Further performance improvement is possible if the compute nodes can speed up more and fully utilize the extra power budget reinvested from the thrifty network with more aggressive cooling support.

本文提出了两种互补的技术来管理具有分组交换互连网络的大型系统的功耗。首先，我们提出节俭互连网络(TIN)，其中网络链路通过使用固有的系统事件及时触发链路激活或去激活，以很少或没有开销的方式动态激活和去激活。其次，我们提出了网络功率转移(NPS)，它在计算节点和相应的网络组件之间动态地转移功率预算。TIN在网络通信即将发生之前及时激活和训练互联网络中的链路，并在通信结束时节省地将其置于低功耗模式，从而减少不必要的网络功耗。此外，计算节点可以吸收其附加网络组件转移的额外功率预算，并通过NPS提高其处理器频率以获得更高的性能。我们对一组实际工作负载跟踪的仿真结果表明，在仅支持一种低功耗模式的情况下，TIN可以实现平均60%的网络功耗降低。当启用NPS时，两者可以共同实现12%的应用性能提升和13%的整体系统能耗降低。如果计算节点可以更快地加速，并充分利用节省网络再投资的额外电力预算，并提供更积极的冷却支持，则进一步的性能改进是可能的。

{"title":"Power shifting in Thrifty Interconnection Network","authors":"Jian Li, Wei Huang, C. Lefurgy, Lixin Zhang, W. Denzel, Richard R. Treumann, Kun Wang","doi":"10.1109/HPCA.2011.5749725","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749725","url":null,"abstract":"This paper presents two complementary techniques to manage the power consumption of large-scale systems with a packet-switched interconnection network. First, we propose Thrifty Interconnection Network (TIN), where the network links are activated and de-activated dynamically with little or no overhead by using inherent system events to timely trigger link activation or de-activation. Second, we propose Network Power Shifting (NPS) that dynamically shifts the power budget between the compute nodes and their corresponding network components. TIN activates and trains the links in the interconnection network, just-in-time before the network communication is about to happen, and thriftily puts them into a low-power mode when communication is finished, hence reducing unnecessary network power consumption. Furthermore, the compute nodes can absorb the extra power budget shifted from its attached network components and increase their processor frequency for higher performance with NPS. Our simulation results on a set of real-world workload traces show that TIN can achieve on average 60% network power reduction, with the support of only one low-power mode. When NPS is enabled, the two together can achieve 12% application performance improvement and 13% overall system energy reduction. Further performance improvement is possible if the compute nodes can speed up more and fully utilize the extra power budget reinvested from the thrifty network with more aggressive cooling support.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133993379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

MOPED: Orchestrating interprocess message data on CMPs 在cmp上编排进程间消息数据

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749721

Junli Gu, S. Lumetta, Rakesh Kumar, Yihe Sun

Future CMPs will combine many simple cores with deep cache hierarchies. With more cores, cache resources per core are fewer, and must be shared carefully to avoid poor utilization due to conflicts and pollution. Explicit motion of data in these architectures, such as message passing, can provide hints about program behavior that can be used to hide latency and improve cache behavior. However, to make these models attractive, synchronization overhead and data copying must also be offloaded from the processors. In this paper, we describe a Message Orchestration and Performance Enhancement Device (MOPED) that provides hardware mechanisms to support state-of-the-art message passing protocols such as MPI. MOPED extends the per-processor cache controllers and coherence protocol to support message synchronization and management in hardware, to transfer message data efficiently without intermediate buffer copies, and to place useful data in caches in a timely manner. MOPED thus allows full overlap between communication and computation on the cores. We extended a 16-core full-system simulator based on Simics and FeS2. MOPED interacts with the directory controllers to orchestrate message data. We evaluated benefits to performance and coherence traffic by integrating MOPED into the MPICH runtime. Relative to unmodified MPI execution, MOPED reduces execution time of real applications (NAS Parallel Benchmarks) by 17–45% and of communication microbenchmarks (Intel's IMB) by 76–94%. Off-chip memory misses are reduced by 43–88% for applications and by 75–100% for microbenchmarks.

未来的cmp将结合许多简单的核心和深度缓存层次结构。核心越多，每个核心的缓存资源就越少，必须谨慎地共享，以避免由于冲突和污染而导致利用率低下。这些体系结构中数据的显式移动(例如消息传递)可以提供有关程序行为的提示，这些提示可用于隐藏延迟和改进缓存行为。然而，要使这些模型具有吸引力，还必须从处理器中卸载同步开销和数据复制。在本文中，我们描述了一个消息编排和性能增强设备(mped)，它提供了硬件机制来支持最先进的消息传递协议，如MPI。mop扩展了每个处理器的缓存控制器和一致性协议，以支持硬件中的消息同步和管理，有效地传输消息数据而无需中间缓冲区副本，并及时将有用的数据放入缓存中。因此，mop允许核心上的通信和计算完全重叠。我们扩展了一个基于Simics和FeS2的16核全系统模拟器。MOPED与目录控制器交互以编排消息数据。我们通过将mopp集成到MPICH运行时来评估对性能和一致性流量的好处。相对于未修改的MPI执行，mopd将实际应用程序(NAS并行基准)的执行时间减少了17-45%，通信微基准(英特尔的IMB)的执行时间减少了76-94%。对于应用程序来说，片外内存丢失减少了43-88%，对于微基准测试来说，减少了75-100%。

{"title":"MOPED: Orchestrating interprocess message data on CMPs","authors":"Junli Gu, S. Lumetta, Rakesh Kumar, Yihe Sun","doi":"10.1109/HPCA.2011.5749721","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749721","url":null,"abstract":"Future CMPs will combine many simple cores with deep cache hierarchies. With more cores, cache resources per core are fewer, and must be shared carefully to avoid poor utilization due to conflicts and pollution. Explicit motion of data in these architectures, such as message passing, can provide hints about program behavior that can be used to hide latency and improve cache behavior. However, to make these models attractive, synchronization overhead and data copying must also be offloaded from the processors. In this paper, we describe a Message Orchestration and Performance Enhancement Device (MOPED) that provides hardware mechanisms to support state-of-the-art message passing protocols such as MPI. MOPED extends the per-processor cache controllers and coherence protocol to support message synchronization and management in hardware, to transfer message data efficiently without intermediate buffer copies, and to place useful data in caches in a timely manner. MOPED thus allows full overlap between communication and computation on the cores. We extended a 16-core full-system simulator based on Simics and FeS2. MOPED interacts with the directory controllers to orchestrate message data. We evaluated benefits to performance and coherence traffic by integrating MOPED into the MPICH runtime. Relative to unmodified MPI execution, MOPED reduces execution time of real applications (NAS Parallel Benchmarks) by 17–45% and of communication microbenchmarks (Intel's IMB) by 76–94%. Off-chip memory misses are reduced by 43–88% for applications and by 75–100% for microbenchmarks.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134314487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Keynote address II: How's the parallel computing revolution going? 主题演讲二:并行计算革命进展如何?

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749730

K. McKinley

Two trends changed the computing landscape over the past decade: (1) hardware vendors started delivering chip multiprocessors (CMPs) instead of uniprocessors, and (2) software developers increasingly chose managed languages instead of native languages. Unfortunately, the former change is disrupting the virtuous-cycle between performance improvements and software innovation. Establishing a new parallel performance virtuous cycle for managed languages will require scalable applications executing on scalable Virtual Machine (VM) services, since the VM schedules, monitors, compiles, optimizes, garbage collects, and executes together with the application. This talk describes current progress, opportunities, and challenges for scalable VM services. The parallel computing revolution urgently needs more innovations.

在过去的十年中，有两个趋势改变了计算领域:(1)硬件供应商开始提供芯片多处理器(cmp)而不是单处理器，(2)软件开发人员越来越多地选择托管语言而不是本地语言。不幸的是，前一种变化正在破坏性能改进和软件创新之间的良性循环。为托管语言建立新的并行性能良性循环将需要在可伸缩的虚拟机(VM)服务上执行可伸缩的应用程序，因为VM与应用程序一起调度、监视、编译、优化、垃圾收集和执行。本次演讲描述了可扩展虚拟机服务的当前进展、机遇和挑战。并行计算革命迫切需要更多的创新。

引用次数: 0

Abstraction and microarchitecture scaling in early-stage power modeling 早期功率建模中的抽象和微架构缩放

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749746

H. Jacobson, A. Buyuktosunoglu, P. Bose, E. Acar, R. Eickemeyer

Early-stage, microarchitecture-level power modeling methodologies have been used in industry and academic research for a decade (or more). Such methods use cycle-accurate performance simulators and deduce active power based on utilization markers. A key question faced in this context is: what key utilization metrics to monitor, and how many are needed for accuracy? Is there a systematic way to select the “best” markers? We also pose a key follow-on question: is it possible to perform accurate scaling of an abstracted model to enable exploration of new microarchitecture features? In this paper, we address these particular questions and examine the results for a range of abstraction levels. We highlight innovative insights for intelligent abstraction and microarchitecture scaling, and point out the pitfalls of abstractions that are not based on a systematic methodology or sound theory.

早期阶段的微架构级功率建模方法已经在工业和学术研究中使用了十年(或更长时间)。这种方法使用周期精确的性能模拟器，并根据利用率标记推断有功功率。在此上下文中面临的一个关键问题是:要监视哪些关键利用率指标，以及需要多少个指标才能达到准确性?是否有一个系统的方法来选择“最好”的标记?我们还提出了一个关键的后续问题:是否有可能对抽象模型进行精确缩放，从而探索新的微架构特性?在本文中，我们解决了这些特定的问题，并检查了一系列抽象级别的结果。我们强调了智能抽象和微架构扩展的创新见解，并指出了不基于系统方法或可靠理论的抽象的陷阱。

引用次数: 39

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2011 IEEE 17th International Symposium on High Performance Computer Architecture

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀