Proceedings of the 40th Annual International Symposium on Computer Architecture最新文献_第4页

Agile, efficient virtualization power management with low-latency server power states 具有低延迟服务器电源状态的敏捷、高效的虚拟化电源管理

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485931

C. Isci, S. McIntosh, J. Kephart, R. Das, James E. Hanson, Scott Piper, Robert R. Wolford, Tom Brey, Robert Kantner, Allen Ng, J. Norris, Abdoulaye Traore, M. Frissora

One of the main driving forces of the growing adoption of virtualization is its dramatic simplification of the provisioning and dynamic management of IT resources. By decoupling running entities from the underlying physical resources, and by providing easy-to-use controls to allocate, deallocate and migrate virtual machines (VMs) across physical boundaries, virtualization opens up new opportunities for improving overall system resource use and power efficiency. While a range of techniques for dynamic, distributed resource management of virtualized systems have been proposed and have seen their widespread adoption in enterprise systems, similar techniques for dynamic power management have seen limited acceptance. The main barrier to dynamic, power-aware virtualization management stems not from the limitations of virtualization, but rather from the underlying physical systems; and in particular, the high latency and energy cost of power state change actions suited for virtualization power management. In this work, we first explore the feasibility of low-latency power states for enterprise server systems and demonstrate, with real prototypes, their quantitative energy-performance trade offs compared to traditional server power states. Then, we demonstrate an end-to-end power-aware virtualization management solution leveraging these states, and evaluate the dramatically-favorable power-performance characteristics achievable with such systems. We present, via both real system implementations and scale-out simulations, that virtualization power management with low-latency server power states can achieve comparable overheads as base distributed resource management in virtualized systems, and thus can benefit from the same level of adoption, while delivering close to energy-proportional power efficiency.

越来越多地采用虚拟化的主要驱动力之一是它极大地简化了IT资源的供应和动态管理。通过将运行的实体与底层物理资源解耦，并提供易于使用的控件来跨物理边界分配、释放和迁移虚拟机，虚拟化为改进整体系统资源使用和能效开辟了新的机会。虽然已经提出了一系列用于虚拟系统的动态、分布式资源管理的技术，并在企业系统中得到了广泛采用，但用于动态电源管理的类似技术的接受程度有限。动态的、功率感知的虚拟化管理的主要障碍不是来自虚拟化的限制，而是来自底层物理系统;特别是，电源状态更改操作的高延迟和能源成本适合于虚拟化电源管理。在这项工作中，我们首先探索了企业服务器系统的低延迟电源状态的可行性，并通过真实的原型演示了与传统服务器电源状态相比，它们的定量能源性能权衡。然后，我们将演示利用这些状态的端到端电源感知虚拟化管理解决方案，并评估使用此类系统可实现的非常有利的电源性能特征。我们通过实际系统实现和向外扩展模拟表明，具有低延迟服务器电源状态的虚拟化电源管理可以实现与虚拟化系统中的基本分布式资源管理相当的开销，因此可以从相同级别的采用中受益，同时提供接近能量比例的电源效率。

{"title":"Agile, efficient virtualization power management with low-latency server power states","authors":"C. Isci, S. McIntosh, J. Kephart, R. Das, James E. Hanson, Scott Piper, Robert R. Wolford, Tom Brey, Robert Kantner, Allen Ng, J. Norris, Abdoulaye Traore, M. Frissora","doi":"10.1145/2485922.2485931","DOIUrl":"https://doi.org/10.1145/2485922.2485931","url":null,"abstract":"One of the main driving forces of the growing adoption of virtualization is its dramatic simplification of the provisioning and dynamic management of IT resources. By decoupling running entities from the underlying physical resources, and by providing easy-to-use controls to allocate, deallocate and migrate virtual machines (VMs) across physical boundaries, virtualization opens up new opportunities for improving overall system resource use and power efficiency. While a range of techniques for dynamic, distributed resource management of virtualized systems have been proposed and have seen their widespread adoption in enterprise systems, similar techniques for dynamic power management have seen limited acceptance. The main barrier to dynamic, power-aware virtualization management stems not from the limitations of virtualization, but rather from the underlying physical systems; and in particular, the high latency and energy cost of power state change actions suited for virtualization power management. In this work, we first explore the feasibility of low-latency power states for enterprise server systems and demonstrate, with real prototypes, their quantitative energy-performance trade offs compared to traditional server power states. Then, we demonstrate an end-to-end power-aware virtualization management solution leveraging these states, and evaluate the dramatically-favorable power-performance characteristics achievable with such systems. We present, via both real system implementations and scale-out simulations, that virtualization power management with low-latency server power states can achieve comparable overheads as base distributed resource management in virtualized systems, and thus can benefit from the same level of adoption, while delivering close to energy-proportional power efficiency.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82227667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

STREX: boosting instruction cache reuse in OLTP workloads through stratified transaction execution STREX:通过分层事务执行提高OLTP工作负载中的指令缓存重用

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485946

Islam Atta, Pınar Tözün, Xin Tong, A. Ailamaki, Andreas Moshovos

Online transaction processing (OLTP) workload performance suffers from instruction stalls; the instruction footprint of a typical transaction exceeds by far the capacity of an L1 cache, leading to ongoing cache thrashing. Several proposed techniques remove some instruction stalls in exchange for error-prone instrumentation to the code base, or a sharp increase in the L1-I cache unit area and power. Others reduce instruction miss latency by better utilizing a shared L2 cache. SLICC [2], a recently proposed thread migration technique that exploits transaction instruction locality, is promising for high core counts but performs sub-optimally or may hurt performance when running on few cores. This paper corroborates that OLTP transactions exhibit significant intra- and inter-thread overlap in their instruction footprint, and analyzes the instruction stall reduction benefits. This paper presents STREX, a hardware, programmer-transparent technique that exploits typical transaction behavior to improve instruction reuse in first level caches. STREX time-multiplexes the execution of similar transactions dynamically on a single core so that instructions fetched by one transaction are reused by all other transactions executing in the system as much as possible. STREX dynamically slices the execution of each transaction into cache-sized segments simply by observing when blocks are brought in the cache and when they are evicted. Experiments show that, when compared to baseline execution on 2--16 cores, STREX consistently improves performance while reducing the number of L1 instruction and data misses by 37% and 14% on average, respectively. Finally, this paper proposes a practical hybrid technique that combines STREX and SLICC, thereby guaranteeing performance benefits regardless of the number of available cores and the workload's footprint.

联机事务处理(OLTP)工作负载性能受到指令停滞的影响;典型事务的指令占用远远超过L1缓存的容量，从而导致持续的缓存抖动。几种建议的技术移除了一些指令停顿，以换取对代码库进行容易出错的插装，或者大幅度增加L1-I缓存单元的面积和功率。另一些则通过更好地利用共享L2缓存来减少指令遗漏延迟。SLICC[2]是最近提出的一种利用事务指令局域性的线程迁移技术，它有望用于高核数，但在少数核上运行时性能不佳或可能会损害性能。本文证实了OLTP事务在其指令占用中表现出显著的线程内和线程间重叠，并分析了指令延迟减少的好处。本文介绍了STREX，一种硬件、程序员透明的技术，它利用典型的事务行为来提高一级缓存中的指令重用。STREX在单个核心上动态地执行类似的事务，以便一个事务获取的指令尽可能地被系统中执行的所有其他事务重用。STREX通过观察块何时被带入缓存以及何时被驱逐，将每个事务的执行动态地划分为缓存大小的段。实验表明，与2- 16核的基准执行相比，STREX持续提高性能，同时将L1指令和数据丢失的数量平均分别减少37%和14%。最后，本文提出了一种实用的混合技术，它结合了STREX和SLICC，从而保证了性能优势，而不管可用内核的数量和工作负载的占用。

{"title":"STREX: boosting instruction cache reuse in OLTP workloads through stratified transaction execution","authors":"Islam Atta, Pınar Tözün, Xin Tong, A. Ailamaki, Andreas Moshovos","doi":"10.1145/2485922.2485946","DOIUrl":"https://doi.org/10.1145/2485922.2485946","url":null,"abstract":"Online transaction processing (OLTP) workload performance suffers from instruction stalls; the instruction footprint of a typical transaction exceeds by far the capacity of an L1 cache, leading to ongoing cache thrashing. Several proposed techniques remove some instruction stalls in exchange for error-prone instrumentation to the code base, or a sharp increase in the L1-I cache unit area and power. Others reduce instruction miss latency by better utilizing a shared L2 cache. SLICC [2], a recently proposed thread migration technique that exploits transaction instruction locality, is promising for high core counts but performs sub-optimally or may hurt performance when running on few cores. This paper corroborates that OLTP transactions exhibit significant intra- and inter-thread overlap in their instruction footprint, and analyzes the instruction stall reduction benefits. This paper presents STREX, a hardware, programmer-transparent technique that exploits typical transaction behavior to improve instruction reuse in first level caches. STREX time-multiplexes the execution of similar transactions dynamically on a single core so that instructions fetched by one transaction are reused by all other transactions executing in the system as much as possible. STREX dynamically slices the execution of each transaction into cache-sized segments simply by observing when blocks are brought in the cache and when they are evicted. Experiments show that, when compared to baseline execution on 2--16 cores, STREX consistently improves performance while reducing the number of L1 instruction and data misses by 37% and 14% on average, respectively. Finally, this paper proposes a practical hybrid technique that combines STREX and SLICC, thereby guaranteeing performance benefits regardless of the number of available cores and the workload's footprint.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81352370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Improving memory scheduling via processor-side load criticality information 通过处理器端负载临界信息改进内存调度

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485930

Saugata Ghose, Hyo-Gun Lee, José F. Martínez

We hypothesize that performing processor-side analysis of load instructions, and providing this pre-digested information to memory schedulers judiciously, can increase the sophistication of memory decisions while maintaining a lean memory controller that can take scheduling actions quickly. This is increasingly important as DRAM frequencies continue to increase relative to processor speed. In this paper we propose one such mechanism, pairing up a processor-side load criticality predictor with a lean memory controller that prioritizes load requests based on ranking information supplied from the processor side. Using a sophisticated multi-core simulator that includes a detailed quad-channel DDR3 DRAM model, we demonstrate that this mechanism can improve performance significantly on a CMP, with minimal overhead and virtually no changes to the processor itself. We show that our design compares favorably to several state-of-the-art schedulers.

我们假设，执行负载指令的处理器端分析，并明智地向内存调度器提供这些预消化的信息，可以提高内存决策的复杂性，同时保持可以快速执行调度操作的精简内存控制器。随着DRAM频率相对于处理器速度的不断提高，这一点变得越来越重要。在本文中，我们提出了一种这样的机制，将处理器端负载临界预测器与基于处理器端提供的排名信息对负载请求进行优先级排序的精益内存控制器配对。使用一个复杂的多核模拟器，包括一个详细的四通道DDR3 DRAM模型，我们证明了这种机制可以显著提高CMP上的性能，开销最小，几乎不改变处理器本身。我们表明，我们的设计优于几个最先进的调度程序。

引用次数: 81

Robust architectural support for transactional memory in the power architecture 电源架构中对事务性内存的健壮架构支持

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485942

Harold W. Cain, Maged M. Michael, Brad Frey, C. May, Derek Williams, Hung Q. Le

On the twentieth anniversary of the original publication [10], following ten years of intense activity in the research literature, hardware support for transactional memory (TM) has finally become a commercial reality, with HTM-enabled chips currently or soon-to-be available from many hardware vendors. In this paper we describe architectural support for TM added to a future version of the Power ISA™. Two imperatives drove the development: the desire to complement our weakly-consistent memory model with a more friendly interface to simplify the development and porting of multithreaded applications, and the need for robustness beyond that of some early implementations. In the process of commercializing the feature, we had to resolve some previously unexplored interactions between TM and existing features of the ISA, for example translation shootdown, interrupt handling, atomic read-modify-write primitives, and our weakly consistent memory model. We describe these interactions, the overall architecture, and discuss the motivation and rationale for our choices of architectural semantics, beyond what is typically found in reference manuals.

在原始出版物发表20周年之际[10]，经过十年的激烈研究文献活动，对事务性内存(TM)的硬件支持终于成为商业现实，许多硬件供应商目前或即将提供支持html的芯片。在本文中，我们描述了对Power ISA™未来版本中添加的TM的体系结构支持。有两个必要因素推动了开发:希望用更友好的接口来补充弱一致性的内存模型，以简化多线程应用程序的开发和移植，以及对超出某些早期实现的健壮性的需求。在将该特性商业化的过程中，我们必须解决一些以前未探索过的TM与ISA现有特性之间的交互，例如翻译关闭、中断处理、原子读-修改-写原语以及弱一致性内存模型。我们描述了这些交互、整个体系结构，并讨论了我们选择体系结构语义的动机和基本原理，超出了参考手册中通常找到的内容。

{"title":"Robust architectural support for transactional memory in the power architecture","authors":"Harold W. Cain, Maged M. Michael, Brad Frey, C. May, Derek Williams, Hung Q. Le","doi":"10.1145/2485922.2485942","DOIUrl":"https://doi.org/10.1145/2485922.2485942","url":null,"abstract":"On the twentieth anniversary of the original publication [10], following ten years of intense activity in the research literature, hardware support for transactional memory (TM) has finally become a commercial reality, with HTM-enabled chips currently or soon-to-be available from many hardware vendors. In this paper we describe architectural support for TM added to a future version of the Power ISA™. Two imperatives drove the development: the desire to complement our weakly-consistent memory model with a more friendly interface to simplify the development and porting of multithreaded applications, and the need for robustness beyond that of some early implementations. In the process of commercializing the feature, we had to resolve some previously unexplored interactions between TM and existing features of the ISA, for example translation shootdown, interrupt handling, atomic read-modify-write primitives, and our weakly consistent memory model. We describe these interactions, the overall architecture, and discuss the motivation and rationale for our choices of architectural semantics, beyond what is typically found in reference manuals.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89878032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 127

Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation 最大化SIMD通道排列的gpgpu中的SIMD资源利用率

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485953

Minsoo Rhu, M. Erez

Current GPUs maintain high programmability by abstracting the SIMD nature of the hardware as independent concurrent threads of control with hardware responsible for generating predicate masks to utilize the SIMD hardware for different flows of control. This dynamic masking leads to poor utilization of SIMD resources when the control of different threads in the same SIMD group diverges. Prior research suggests that SIMD groups be formed dynamically by compacting a large number of threads into groups, mitigating the impact of divergence. To maintain hardware efficiency, however, the alignment of a thread to a SIMD lane is fixed, limiting the potential for compaction. We observe that control frequently diverges in a manner that prevents compaction because of the way in which the fixed alignment of threads to lanes is done. This paper presents an in-depth analysis on the causes for ineffective compaction. An important observation is that in many cases, control diverges because of programmatic branches, which do not depend on input data. This behavior, when combined with the default mapping of threads to lanes, severely restricts compaction. We then propose SIMD lane permutation (SLP) as an optimization to expand the applicability of compaction in such cases of lane alignment. SLP seeks to rearrange how threads are mapped to lanes to allow even programmatic branches to be compacted effectively, improving SIMD utilization up to 34% accompanied by a maximum 25% performance boost.

当前的gpu通过将硬件的SIMD特性抽象为独立的并发控制线程来保持高可编程性，其中硬件负责生成谓词掩码，以利用SIMD硬件进行不同的控制流。当对同一SIMD组中的不同线程的控制出现分歧时，这种动态屏蔽会导致SIMD资源利用率低下。先前的研究表明，SIMD组可以通过将大量线程压缩成组来动态地形成，从而减轻发散的影响。但是，为了保持硬件效率，线程与SIMD通道的对齐是固定的，从而限制了压缩的可能性。我们观察到，由于线程对通道的固定对齐方式，控制经常以一种防止压缩的方式偏离。本文对压实无效的原因进行了深入分析。一个重要的观察是，在许多情况下，由于不依赖于输入数据的编程分支，控制发生了发散。当与线程到通道的默认映射结合使用时，这种行为严重限制了压缩。然后，我们提出SIMD车道排列(SLP)作为一种优化，以扩大在这种车道对齐情况下压缩的适用性。SLP试图重新安排线程映射到通道的方式，以允许有效地压缩编程分支，从而将SIMD利用率提高到34%，同时最大提高25%的性能。

{"title":"Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation","authors":"Minsoo Rhu, M. Erez","doi":"10.1145/2485922.2485953","DOIUrl":"https://doi.org/10.1145/2485922.2485953","url":null,"abstract":"Current GPUs maintain high programmability by abstracting the SIMD nature of the hardware as independent concurrent threads of control with hardware responsible for generating predicate masks to utilize the SIMD hardware for different flows of control. This dynamic masking leads to poor utilization of SIMD resources when the control of different threads in the same SIMD group diverges. Prior research suggests that SIMD groups be formed dynamically by compacting a large number of threads into groups, mitigating the impact of divergence. To maintain hardware efficiency, however, the alignment of a thread to a SIMD lane is fixed, limiting the potential for compaction. We observe that control frequently diverges in a manner that prevents compaction because of the way in which the fixed alignment of threads to lanes is done. This paper presents an in-depth analysis on the causes for ineffective compaction. An important observation is that in many cases, control diverges because of programmatic branches, which do not depend on input data. This behavior, when combined with the default mapping of threads to lanes, severely restricts compaction. We then propose SIMD lane permutation (SLP) as an optimization to expand the applicability of compaction in such cases of lane alignment. SLP seeks to rearrange how threads are mapped to lanes to allow even programmatic branches to be compacted effectively, improving SIMD utilization up to 34% accompanied by a maximum 25% performance boost.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91203041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Bit mapping for balanced PCM cell programming 平衡PCM单元编程的位映射

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485959

Yu Du, Miao Zhou, B. Childers, D. Mossé, R. Melhem

Write bandwidth is an inherent performance bottleneck for Phase Change Memory (PCM) for two reasons. First, PCM cells have long programming time, and second, only a limited number of PCM cells can be programmed concurrently due to programming current and write circuit constraints, For each PCM write, the data bits of the write request are typically mapped to multiple cell groups and processed in parallel. We observed that an unbalanced distribution of modified data bits among cell groups significantly increases PCM write time and hurts effective write bandwidth. To address this issue, we first uncover the cyclical and cluster patterns for modified data bits. Next, we propose double XOR mapping (D-XOR) to distribute modified data bits among cell groups in a balanced way. D-XOR can reduce PCM write service time by 45% on average, which increases PCM write throughput by 1.8x. As error correction (redundant bits) is critical for PCM, we also consider the impact of redundancy information in mapping data and error correction bits to cell groups. Our techniques lead to a 51% average reduction in write service time for a PCM main memory with ECC, which increases IPC by 12%.

由于两个原因，写带宽是相变存储器(PCM)固有的性能瓶颈。首先，PCM单元有很长的编程时间，其次，由于编程电流和写入电路的限制，只有有限数量的PCM单元可以同时编程。对于每个PCM写入，写入请求的数据位通常映射到多个单元组并并行处理。我们观察到，在单元组中修改的数据位的不平衡分布显着增加了PCM写时间并损害了有效写带宽。为了解决这个问题，我们首先揭示修改数据位的循环模式和集群模式。接下来，我们提出了双异或映射(D-XOR)，以平衡的方式在单元组之间分配修改的数据位。D-XOR可使PCM写服务时间平均减少45%，使PCM写吞吐量提高1.8倍。由于纠错(冗余位)对PCM至关重要，我们还考虑了将数据和纠错位映射到单元组时冗余信息的影响。我们的技术使具有ECC的PCM主存储器的写服务时间平均减少了51%，从而使IPC提高了12%。

{"title":"Bit mapping for balanced PCM cell programming","authors":"Yu Du, Miao Zhou, B. Childers, D. Mossé, R. Melhem","doi":"10.1145/2485922.2485959","DOIUrl":"https://doi.org/10.1145/2485922.2485959","url":null,"abstract":"Write bandwidth is an inherent performance bottleneck for Phase Change Memory (PCM) for two reasons. First, PCM cells have long programming time, and second, only a limited number of PCM cells can be programmed concurrently due to programming current and write circuit constraints, For each PCM write, the data bits of the write request are typically mapped to multiple cell groups and processed in parallel. We observed that an unbalanced distribution of modified data bits among cell groups significantly increases PCM write time and hurts effective write bandwidth. To address this issue, we first uncover the cyclical and cluster patterns for modified data bits. Next, we propose double XOR mapping (D-XOR) to distribute modified data bits among cell groups in a balanced way. D-XOR can reduce PCM write service time by 45% on average, which increases PCM write throughput by 1.8x. As error correction (redundant bits) is critical for PCM, we also consider the impact of redundancy information in mapping data and error correction bits to cell groups. Our techniques lead to a 51% average reduction in write service time for a PCM main memory with ECC, which increases IPC by 12%.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88850407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

Catnap: energy proportional multiple network-on-chip Catnap:能量成比例的多重片上网络

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485950

R. Das, S. Narayanasamy, Sudhir K. Satpathy, R. Dreslinski

Multiple networks have been used in several processor implementations to scale bandwidth and ensure protocol-level deadlock freedom for different message classes. In this paper, we observe that a multiple-network design is also attractive from a power perspective and can be leveraged to achieve energy proportionality by effective power gating. Unlike a single-network design, a multiple-network design is more amenable to power gating, as its subnetworks (subnets) can be power gated without compromising the connectivity of the network. To exploit this opportunity, we propose the Catnap architecture which consists of synergistic subnet selection and power-gating policies. Catnap maximizes the number of consecutive idle cycles in a router, while avoiding performance loss due to overloading a subnet. We evaluate a 256-core processor with a concentrated mesh topology using synthetic traffic and 35 applications. We show that the average network power of a power-gating optimized multiple-network design with four subnets could be 44% lower than a bandwidth equivalent single-network design for an average performance cost of about 5%.

在多个处理器实现中使用了多个网络来扩展带宽并确保不同消息类的协议级死锁自由。在本文中，我们观察到从功率角度来看，多网络设计也很有吸引力，并且可以通过有效的功率门控来实现能量比例。与单网络设计不同，多网络设计更适合于电源门控，因为它的子网(子网)可以在不影响网络连接的情况下进行电源门控。为了利用这一机会，我们提出了由协同子网选择和功率门控制策略组成的Catnap架构。Catnap可以最大限度地提高路由器连续空闲周期的数量，同时避免子网过载导致的性能损失。我们使用合成流量和35个应用程序评估了具有集中网格拓扑的256核处理器。我们表明，具有四个子网的功率门控优化的多网络设计的平均网络功率可以比带宽等效的单网络设计低44%，平均性能成本约为5%。

{"title":"Catnap: energy proportional multiple network-on-chip","authors":"R. Das, S. Narayanasamy, Sudhir K. Satpathy, R. Dreslinski","doi":"10.1145/2485922.2485950","DOIUrl":"https://doi.org/10.1145/2485922.2485950","url":null,"abstract":"Multiple networks have been used in several processor implementations to scale bandwidth and ensure protocol-level deadlock freedom for different message classes. In this paper, we observe that a multiple-network design is also attractive from a power perspective and can be leveraged to achieve energy proportionality by effective power gating. Unlike a single-network design, a multiple-network design is more amenable to power gating, as its subnetworks (subnets) can be power gated without compromising the connectivity of the network. To exploit this opportunity, we propose the Catnap architecture which consists of synergistic subnet selection and power-gating policies. Catnap maximizes the number of consecutive idle cycles in a router, while avoiding performance loss due to overloading a subnet. We evaluate a 256-core processor with a concentrated mesh topology using synthetic traffic and 35 applications. We show that the average network power of a power-gating optimized multiple-network design with four subnets could be 44% lower than a bandwidth equivalent single-network design for an average performance cost of about 5%.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78045498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 156

Convolution engine: balancing efficiency & flexibility in specialized computing 卷积引擎:在专业计算中平衡效率和灵活性

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485925

W. Qadeer, R. Hameed, Ofer Shacham, P. Venkatesan, C. Kozyrakis, M. Horowitz

This paper focuses on the trade-off between flexibility and efficiency in specialized computing. We observe that specialized units achieve most of their efficiency gains by tuning data storage and compute structures and their connectivity to the data-flow and data-locality patterns in the kernels. Hence, by identifying key data-flow patterns used in a domain, we can create efficient engines that can be programmed and reused across a wide range of applications. We present an example, the Convolution Engine (CE), specialized for the convolution-like data-flow that is common in computational photography, image processing, and video processing applications. CE achieves energy efficiency by capturing data reuse patterns, eliminating data transfer overheads, and enabling a large number of operations per memory access. We quantify the tradeoffs in efficiency and flexibility and demonstrate that CE is within a factor of 2-3x of the energy and area efficiency of custom units optimized for a single kernel. CE improves energy and area efficiency by 8-15x over a SIMD engine for most applications.

本文关注的是在专用计算中灵活性和效率之间的权衡。我们观察到，专门的单元通过调优数据存储和计算结构以及它们与内核中的数据流和数据局部性模式的连接来实现大部分效率增益。因此，通过识别域中使用的关键数据流模式，我们可以创建高效的引擎，这些引擎可以在广泛的应用程序中编程和重用。我们给出了一个例子，卷积引擎(CE)，专门用于在计算摄影、图像处理和视频处理应用中常见的类似卷积的数据流。CE通过捕获数据重用模式、消除数据传输开销和支持每次内存访问的大量操作来实现能源效率。我们量化了效率和灵活性的权衡，并证明CE在为单个内核优化的定制单元的能量和面积效率的2-3倍之内。对于大多数应用，CE比SIMD引擎提高了8-15倍的能量和面积效率。

{"title":"Convolution engine: balancing efficiency & flexibility in specialized computing","authors":"W. Qadeer, R. Hameed, Ofer Shacham, P. Venkatesan, C. Kozyrakis, M. Horowitz","doi":"10.1145/2485922.2485925","DOIUrl":"https://doi.org/10.1145/2485922.2485925","url":null,"abstract":"This paper focuses on the trade-off between flexibility and efficiency in specialized computing. We observe that specialized units achieve most of their efficiency gains by tuning data storage and compute structures and their connectivity to the data-flow and data-locality patterns in the kernels. Hence, by identifying key data-flow patterns used in a domain, we can create efficient engines that can be programmed and reused across a wide range of applications. We present an example, the Convolution Engine (CE), specialized for the convolution-like data-flow that is common in computational photography, image processing, and video processing applications. CE achieves energy efficiency by capturing data reuse patterns, eliminating data transfer overheads, and enabling a large number of operations per memory access. We quantify the tradeoffs in efficiency and flexibility and demonstrate that CE is within a factor of 2-3x of the energy and area efficiency of custom units optimized for a single kernel. CE improves energy and area efficiency by 8-15x over a SIMD engine for most applications.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91260961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 187

Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers 气泡通量:精确的在线QoS管理，提高了仓库规模计算机的利用率

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485974

Hailong Yang, Alex D. Breslow, Jason Mars, Lingjia Tang

Ensuring the quality of service (QoS) for latency-sensitive applications while allowing co-locations of multiple applications on servers is critical for improving server utilization and reducing cost in modern warehouse-scale computers (WSCs). Recent work relies on static profiling to precisely predict the QoS degradation that results from performance interference among co-running applications to increase the number of "safe" co-locations. However, these static profiling techniques have several critical limitations: 1) a priori knowledge of all workloads is required for profiling, 2) it is difficult for the prediction to capture or adapt to phase or load changes of applications, and 3) the prediction technique is limited to only two co-running applications. To address all of these limitations, we present Bubble-Flux, an integrated dynamic interference measurement and online QoS management mechanism to provide accurate QoS control and maximize server utilization. Bubble-Flux uses a Dynamic Bubble to probe servers in real time to measure the instantaneous pressure on the shared hardware resources and precisely predict how the QoS of a latency-sensitive job will be affected by potential co-runners. Once "safe" batch jobs are selected and mapped to a server, Bubble-Flux uses an Online Flux Engine to continuously monitor the QoS of the latency-sensitive application and control the execution of batch jobs to adapt to dynamic input, phase, and load changes to deliver satisfactory QoS. Batch applications remain in a state of flux throughout execution. Our results show that the utilization improvement achieved by Bubble-Flux is up to 2.2x better than the prior static approach.

确保对延迟敏感的应用程序的服务质量(QoS)，同时允许多个应用程序在服务器上共存，这对于提高服务器利用率和降低现代仓库级计算机(wsc)的成本至关重要。最近的工作依赖于静态分析来精确预测由于协同运行应用程序之间的性能干扰而导致的QoS退化，以增加“安全”协同位置的数量。然而，这些静态分析技术有几个关键的限制:1)分析需要对所有工作负载的先验知识，2)预测很难捕捉或适应应用程序的阶段或负载变化，3)预测技术仅限于两个共同运行的应用程序。为了解决所有这些限制，我们提出了Bubble-Flux，一种集成的动态干扰测量和在线QoS管理机制，以提供准确的QoS控制并最大限度地提高服务器利用率。Bubble- flux使用Dynamic Bubble实时探测服务器，以测量共享硬件资源上的瞬时压力，并精确预测对延迟敏感的作业的QoS如何受到潜在的合作者的影响。一旦选择了“安全”批处理作业并将其映射到服务器，Bubble-Flux就会使用在线通量引擎持续监控延迟敏感应用程序的QoS，并控制批处理作业的执行，以适应动态输入、阶段和负载变化，从而提供令人满意的QoS。批处理应用程序在整个执行过程中始终处于不稳定状态。结果表明，气泡通量法的利用率比静态方法提高了2.2倍。

{"title":"Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers","authors":"Hailong Yang, Alex D. Breslow, Jason Mars, Lingjia Tang","doi":"10.1145/2485922.2485974","DOIUrl":"https://doi.org/10.1145/2485922.2485974","url":null,"abstract":"Ensuring the quality of service (QoS) for latency-sensitive applications while allowing co-locations of multiple applications on servers is critical for improving server utilization and reducing cost in modern warehouse-scale computers (WSCs). Recent work relies on static profiling to precisely predict the QoS degradation that results from performance interference among co-running applications to increase the number of \"safe\" co-locations. However, these static profiling techniques have several critical limitations: 1) a priori knowledge of all workloads is required for profiling, 2) it is difficult for the prediction to capture or adapt to phase or load changes of applications, and 3) the prediction technique is limited to only two co-running applications. To address all of these limitations, we present Bubble-Flux, an integrated dynamic interference measurement and online QoS management mechanism to provide accurate QoS control and maximize server utilization. Bubble-Flux uses a Dynamic Bubble to probe servers in real time to measure the instantaneous pressure on the shared hardware resources and precisely predict how the QoS of a latency-sensitive job will be affected by potential co-runners. Once \"safe\" batch jobs are selected and mapped to a server, Bubble-Flux uses an Online Flux Engine to continuously monitor the QoS of the latency-sensitive application and control the execution of batch jobs to adapt to dynamic input, phase, and load changes to deliver satisfactory QoS. Batch applications remain in a state of flux throughout execution. Our results show that the utilization improvement achieved by Bubble-Flux is up to 2.2x better than the prior static approach.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90814096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 370

Resilient die-stacked DRAM caches 弹性叠片DRAM缓存

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485958

Jaewoong Sim, G. Loh, Vilas Sridharan, Mike O'Connor

Die-stacked DRAM can provide large amounts of in-package, high-bandwidth cache storage. For server and high-performance computing markets, however, such DRAM caches must also provide sufficient support for reliability and fault tolerance. While conventional off-chip memory provides ECC support by adding one or more extra chips, this may not be practical in a 3D stack. In this paper, we present a DRAM cache organization that uses error-correcting codes (ECCs), strong checksums (CRCs), and dirty data duplication to detect and correct a wide range of stacked DRAM failures, from traditional bit errors to large-scale row, column, bank, and channel failures. With only a modest performance degradation compared to a DRAM cache with no ECC support, our proposal can correct all single-bit failures, and 99.9993% of all row, column, and bank failures, providing more than a 54,000x improvement in the FIT rate of silent-data corruptions compared to basic SECDED ECC protection.

模堆叠DRAM可以提供大量的封装内高带宽缓存存储。然而，对于服务器和高性能计算市场，这种DRAM缓存还必须提供足够的可靠性和容错性支持。虽然传统的片外存储器通过添加一个或多个额外的芯片来提供ECC支持，但这在3D堆栈中可能不实用。在本文中，我们提出了一个DRAM缓存组织，它使用纠错码(ecc)，强校验和(crc)和脏数据复制来检测和纠正各种堆叠DRAM故障，从传统的位错误到大规模的行，列，银行和通道故障。与不支持ECC的DRAM缓存相比，我们的建议只有适度的性能下降，可以纠正所有单比特故障，以及99.9993%的行、列和银行故障，与基本的SECDED ECC保护相比，静默数据损坏的FIT率提高了54,000倍以上。

引用次数: 48