2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)最新文献

Redundant Memory Mappings for fast access to large memories 冗余内存映射用于快速访问大内存

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2749471

Vasileios Karakostas, Jayneel Gandhi, Furkan Ayar, A. Cristal, M. Hill, K. McKinley, M. Nemirovsky, M. Swift, O. Unsal

Page-based virtual memory improves programmer productivity, security, and memory utilization, but incurs performance overheads due to costly page table walks after TLB misses. This overhead can reach 50% for modern workloads that access increasingly vast memory with stagnating TLB sizes. To reduce the overhead of virtual memory, this paper proposes Redundant Memory Mappings (RMM), which leverage ranges of pages and provides an efficient, alternative representation of many virtual-to-physical mappings. We define a range be a subset of process's pages that are virtually and physically contiguous. RMM translates each range with a single range table entry, enabling a modest number of entries to translate most of the process's address space. RMM operates in parallel with standard paging and uses a software range table and hardware range TLB with arbitrarily large reach. We modify the operating system to automatically detect ranges and to increase their likelihood with eager page allocation. RMM is thus transparent to applications. We prototype RMM software in Linux and emulate the hardware. RMM performs substantially better than paging alone and huge pages, and improves a wider variety of workloads than direct segments (one range per program), reducing the overhead of virtual memory to less than 1% on average.

基于页面的虚拟内存提高了程序员的工作效率、安全性和内存利用率，但是由于TLB丢失后的页表遍历代价高昂，从而导致性能开销。对于访问越来越大的内存且TLB大小停滞不前的现代工作负载，此开销可能达到50%。为了减少虚拟内存的开销，本文提出了冗余内存映射(RMM)，它利用页面范围并提供许多虚拟到物理映射的有效替代表示。我们将范围定义为进程页面的子集，这些页面实际上和物理上是连续的。RMM用一个范围表项来翻译每个范围，从而允许适当数量的表项来翻译进程的大部分地址空间。RMM与标准分页并行操作，并使用具有任意大范围的软件范围表和硬件范围TLB。我们修改操作系统，使其自动检测范围，并通过动态页面分配增加它们的可能性。因此，RMM对应用程序是透明的。我们在Linux环境下对RMM软件进行了原型设计，并对硬件进行了仿真。RMM的性能比单独分页和大页面要好得多，并且比直接分段(每个程序一个范围)改善了更多种类的工作负载，将虚拟内存的开销平均减少到不到1%。

{"title":"Redundant Memory Mappings for fast access to large memories","authors":"Vasileios Karakostas, Jayneel Gandhi, Furkan Ayar, A. Cristal, M. Hill, K. McKinley, M. Nemirovsky, M. Swift, O. Unsal","doi":"10.1145/2749469.2749471","DOIUrl":"https://doi.org/10.1145/2749469.2749471","url":null,"abstract":"Page-based virtual memory improves programmer productivity, security, and memory utilization, but incurs performance overheads due to costly page table walks after TLB misses. This overhead can reach 50% for modern workloads that access increasingly vast memory with stagnating TLB sizes. To reduce the overhead of virtual memory, this paper proposes Redundant Memory Mappings (RMM), which leverage ranges of pages and provides an efficient, alternative representation of many virtual-to-physical mappings. We define a range be a subset of process's pages that are virtually and physically contiguous. RMM translates each range with a single range table entry, enabling a modest number of entries to translate most of the process's address space. RMM operates in parallel with standard paging and uses a software range table and hardware range TLB with arbitrarily large reach. We modify the operating system to automatically detect ranges and to increase their likelihood with eager page allocation. RMM is thus transparent to applications. We prototype RMM software in Linux and emulate the hardware. RMM performs substantially better than paging alone and huge pages, and improves a wider variety of workloads than direct segments (one range per program), reducing the overhead of virtual memory to less than 1% on average.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"15 1","pages":"66-78"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72988232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 151

Multiple Clone Row DRAM: A low latency and area optimized DRAM 多克隆行DRAM:低延迟、区域优化的DRAM

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750402

Jungwhan Choi, Wongyu Shin, Jaemin Jang, Jinwoong Suh, Yongkee Kwon, Youngsuk Moon, L. Kim

Several previous works have changed DRAM bank structure to reduce memory access latency and have shown performance improvement. However, changes in the area-optimized DRAM bank can incur large area-overhead. To solve this problem, we propose Multiple Clone Row DRAM (MCR-DRAM), which uses existing DRAM bank structure without any modification.

以前的一些工作已经改变了DRAM存储结构，以减少内存访问延迟，并显示出性能改进。但是，对区域优化的DRAM库进行更改可能会导致较大的区域开销。为了解决这个问题，我们提出了多克隆行DRAM (MCR-DRAM)，它使用现有的DRAM库结构而不做任何修改。

引用次数: 49

Harmonia: Balancing compute and memory power in high-performance GPUs 和谐:在高性能gpu上平衡计算能力和内存能力

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750404

Indrani Paul, Wei Huang, Manish Arora, S. Yalamanchili

In this paper, we address the problem of efficiently managing the relative power demands of a high-performance GPU and its memory subsystem. We develop a management approach that dynamically tunes the hardware operating configurations to maintain balance between the power dissipated in compute versus memory access across GPGPU application phases. Our goal is to reduce power with minimal performance degradation. Accordingly, we construct predictors that assess the online sensitivity of applications to three hardware tunables-compute frequency, number of active compute units, and memory bandwidth. Using these sensitivity predictors, we propose a two-level coordinated power management scheme, Harmonia, which coordinates the hardware power states of the GPU and the memory system. Through hardware measurements on a commodity GPU, we evaluate Harmonia against a state-of-the-practice commodity GPU power management scheme, as well as an oracle scheme. Results show that Harmonia improves measured energy-delay squared (ED2) by up to 36% (12% on average) with negligible performance loss across representative GPGPU workloads, and on an average is within 3% of the oracle scheme.

在本文中，我们解决了高效管理高性能GPU及其内存子系统的相对功率需求的问题。我们开发了一种管理方法，可以动态调整硬件操作配置，以保持GPGPU应用程序阶段的计算功耗与内存访问之间的平衡。我们的目标是在最小化性能下降的情况下降低功耗。因此，我们构建了预测器来评估应用程序对三个硬件可调项的在线敏感性——计算频率、活动计算单元的数量和内存带宽。利用这些灵敏度预测，我们提出了一种两级协调电源管理方案Harmonia，该方案协调了GPU和存储系统的硬件电源状态。通过对商品GPU的硬件测量，我们将Harmonia与最实用的商品GPU电源管理方案以及oracle方案进行了比较。结果表明，Harmonia将测量的能量延迟平方(ED2)提高了36%(平均12%)，而在代表性GPGPU工作负载上的性能损失可以忽略不计，平均在oracle方案的3%以内。

{"title":"Harmonia: Balancing compute and memory power in high-performance GPUs","authors":"Indrani Paul, Wei Huang, Manish Arora, S. Yalamanchili","doi":"10.1145/2749469.2750404","DOIUrl":"https://doi.org/10.1145/2749469.2750404","url":null,"abstract":"In this paper, we address the problem of efficiently managing the relative power demands of a high-performance GPU and its memory subsystem. We develop a management approach that dynamically tunes the hardware operating configurations to maintain balance between the power dissipated in compute versus memory access across GPGPU application phases. Our goal is to reduce power with minimal performance degradation. Accordingly, we construct predictors that assess the online sensitivity of applications to three hardware tunables-compute frequency, number of active compute units, and memory bandwidth. Using these sensitivity predictors, we propose a two-level coordinated power management scheme, Harmonia, which coordinates the hardware power states of the GPU and the memory system. Through hardware measurements on a commodity GPU, we evaluate Harmonia against a state-of-the-practice commodity GPU power management scheme, as well as an oracle scheme. Results show that Harmonia improves measured energy-delay squared (ED2) by up to 36% (12% on average) with negligible performance loss across representative GPGPU workloads, and on an average is within 3% of the oracle scheme.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"10 1","pages":"54-65"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80886713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 58

Flexible software profiling of GPU architectures 灵活的软件分析GPU架构

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750375

M. Stephenson, S. Hari, Yunsup Lee, Eiman Ebrahimi, Daniel R. Johnson, D. Nellans, Mike O'Connor, S. Keckler

To aid application characterization and architecture design space exploration, researchers and engineers have developed a wide range of tools for CPUs, including simulators, profilers, and binary instrumentation tools. With the advent of GPU computing, GPU manufacturers have developed similar tools leveraging hardware profiling and debugging hooks. To date, these tools are largely limited by the fixed menu of options provided by the tool developer and do not offer the user the flexibility to observe or act on events not in the menu. This paper presents SASSI (NVIDIA assembly code “SASS” Instrumentor), a low-level assembly-language instrumentation tool for GPUs. Like CPU binary instrumentation tools, SASSI allows a user to specify instructions at which to inject user-provided instrumentation code. These facilities allow strategic placement of counters and code into GPU assembly code to collect user-directed, fine-grained statistics at hardware speeds. SASSI instrumentation is inherently parallel, leveraging the concurrency of the underlying hardware. In addition to the details of SASSI, this paper provides four case studies that show how SASSI can be used to characterize applications and explore the architecture design space along the dimensions of instruction control flow, memory systems, value similarity, and resilience.

为了帮助应用程序表征和架构设计空间探索，研究人员和工程师开发了广泛的cpu工具，包括模拟器，分析器和二进制仪器工具。随着GPU计算的出现，GPU制造商已经开发出利用硬件分析和调试挂钩的类似工具。到目前为止，这些工具在很大程度上受到工具开发人员提供的固定菜单选项的限制，并且不能为用户提供观察或操作菜单中没有的事件的灵活性。本文介绍了SASSI (NVIDIA汇编代码“SASS”Instrumentor)，一个用于gpu的低级汇编语言检测工具。与CPU二进制检测工具类似，SASSI允许用户指定注入用户提供的检测代码的指令。这些工具允许在GPU汇编代码中策略性地放置计数器和代码，以硬件速度收集用户导向的细粒度统计信息。SASSI插装本身就是并行的，利用底层硬件的并发性。除了SASSI的细节之外，本文还提供了四个案例研究，展示了如何使用SASSI来描述应用程序，并沿着指令控制流、内存系统、值相似性和弹性的维度探索体系结构设计空间。

{"title":"Flexible software profiling of GPU architectures","authors":"M. Stephenson, S. Hari, Yunsup Lee, Eiman Ebrahimi, Daniel R. Johnson, D. Nellans, Mike O'Connor, S. Keckler","doi":"10.1145/2749469.2750375","DOIUrl":"https://doi.org/10.1145/2749469.2750375","url":null,"abstract":"To aid application characterization and architecture design space exploration, researchers and engineers have developed a wide range of tools for CPUs, including simulators, profilers, and binary instrumentation tools. With the advent of GPU computing, GPU manufacturers have developed similar tools leveraging hardware profiling and debugging hooks. To date, these tools are largely limited by the fixed menu of options provided by the tool developer and do not offer the user the flexibility to observe or act on events not in the menu. This paper presents SASSI (NVIDIA assembly code “SASS” Instrumentor), a low-level assembly-language instrumentation tool for GPUs. Like CPU binary instrumentation tools, SASSI allows a user to specify instructions at which to inject user-provided instrumentation code. These facilities allow strategic placement of counters and code into GPU assembly code to collect user-directed, fine-grained statistics at hardware speeds. SASSI instrumentation is inherently parallel, leveraging the concurrency of the underlying hardware. In addition to the details of SASSI, this paper provides four case studies that show how SASSI can be used to characterize applications and explore the architecture design space along the dimensions of instruction control flow, memory systems, value similarity, and resilience.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"8 1","pages":"185-197"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86409186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 92

Dynamic Thread Block Launch: A lightweight execution mechanism to support irregular applications on GPUs 动态线程块启动:支持gpu上不规则应用程序的轻量级执行机制

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750393

Jin Wang, Norman Rubin, A. Sidelnik, S. Yalamanchili

GPUs have been proven effective for structured applications that map well to the rigid 1D-3D grid of threads in modern bulk synchronous parallel (BSP) programming languages. However, less success has been encountered in mapping data intensive irregular applications such as graph analytics, relational databases, and machine learning. Recently introduced nested device-side kernel launching functionality in the GPU is a step in the right direction, but still falls short of being able to effectively harness the GPUs performance potential. We propose a new mechanism called Dynamic Thread Block Launch (DTBL) to extend the current bulk synchronous parallel model underlying the current GPU execution model by supporting dynamic spawning of lightweight thread blocks. This mechanism supports the nested launching of thread blocks rather than kernels to execute dynamically occurring parallel work elements. This paper describes the execution model of DTBL, device-runtime support, and microarchitecture extensions to track and execute dynamically spawned thread blocks. Experiments with a set of irregular data intensive CUDA applications executing on a cycle-level simulator show that DTBL achieves average 1.21x speedup over the original flat implementation and average 1.40x over the implementation with device-side kernel launches using CUDA Dynamic Parallelism.

gpu已经被证明对结构化应用程序是有效的，这些应用程序可以很好地映射到现代批量同步并行(BSP)编程语言中刚性的1D-3D线程网格。然而，在映射数据密集型不规则应用程序(如图分析、关系数据库和机器学习)方面取得的成功较少。最近在GPU中引入的嵌套设备端内核启动功能是朝着正确方向迈出的一步，但仍然无法有效地利用GPU的性能潜力。我们提出了一种新的机制，称为动态线程块启动(Dynamic Thread Block Launch, DTBL)，通过支持轻量级线程块的动态生成来扩展当前GPU执行模型底层的批量同步并行模型。这种机制支持线程块的嵌套启动，而不是内核来执行动态发生的并行工作元素。本文描述了DTBL的执行模型、设备运行时支持以及用于跟踪和执行动态生成的线程块的微体系结构扩展。在周期级模拟器上执行的一组不规则数据密集型CUDA应用程序的实验表明，DTBL比原始的平面实现平均提高1.21倍，比使用CUDA动态并行的设备端内核启动的实现平均提高1.40倍。

{"title":"Dynamic Thread Block Launch: A lightweight execution mechanism to support irregular applications on GPUs","authors":"Jin Wang, Norman Rubin, A. Sidelnik, S. Yalamanchili","doi":"10.1145/2749469.2750393","DOIUrl":"https://doi.org/10.1145/2749469.2750393","url":null,"abstract":"GPUs have been proven effective for structured applications that map well to the rigid 1D-3D grid of threads in modern bulk synchronous parallel (BSP) programming languages. However, less success has been encountered in mapping data intensive irregular applications such as graph analytics, relational databases, and machine learning. Recently introduced nested device-side kernel launching functionality in the GPU is a step in the right direction, but still falls short of being able to effectively harness the GPUs performance potential. We propose a new mechanism called Dynamic Thread Block Launch (DTBL) to extend the current bulk synchronous parallel model underlying the current GPU execution model by supporting dynamic spawning of lightweight thread blocks. This mechanism supports the nested launching of thread blocks rather than kernels to execute dynamically occurring parallel work elements. This paper describes the execution model of DTBL, device-runtime support, and microarchitecture extensions to track and execute dynamically spawned thread blocks. Experiments with a set of irregular data intensive CUDA applications executing on a cycle-level simulator show that DTBL achieves average 1.21x speedup over the original flat implementation and average 1.40x over the implementation with device-side kernel launches using CUDA Dynamic Parallelism.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"15 1","pages":"528-540"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86213715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 59

A variable warp size architecture 可变经纱尺寸的结构

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750410

Timothy G. Rogers, Daniel R. Johnson, Mike O'Connor, S. Keckler

This paper studies the effect of warp sizing and scheduling on performance and efficiency in GPUs. We propose Variable Warp Sizing (VWS) which improves the performance of divergent applications by using a small base warp size in the presence of control flow and memory divergence. When appropriate, our proposed technique groups sets of these smaller warps together by ganging their execution in the warp scheduler, improving performance and energy efficiency for regular applications. Warp ganging is necessary to prevent performance degradation on regular workloads due to memory convergence slip, which results from the inability of smaller warps to exploit the same intra-warp memory locality as larger warps. This paper explores the effect of warp sizing on control flow divergence, memory divergence, and locality. For an estimated 5% area cost, our ganged scheduling microarchitecture results in a simulated 35% performance improvement on divergent workloads by allowing smaller groups of threads to proceed independently, and eliminates the performance degradation due to memory convergence slip that is observed when convergent applications are executed with smaller warp sizes.

本文研究了经纱大小和调度对gpu性能和效率的影响。我们提出了可变经纱尺寸(VWS)，它通过在存在控制流和内存发散的情况下使用较小的基本经纱尺寸来改善发散应用程序的性能。在适当的时候，我们提出的技术将这些较小的经线集合在一起，通过在经线调度器中组合它们的执行，提高常规应用程序的性能和能源效率。为了防止在常规工作负载上由于内存收敛滑动而导致的性能下降，曲速连接是必要的，这是由于较小的曲速无法利用与较大的曲速相同的曲速内部内存位置造成的。本文探讨了经纱大小对控制流发散、记忆发散和局部性的影响。对于估计的5%的面积成本，我们的联合调度微架构通过允许较小的线程组独立进行，在不同的工作负载上产生了模拟的35%的性能改进，并消除了由于内存收敛滑动而导致的性能下降，当收敛应用程序以较小的warp大小执行时观察到。

{"title":"A variable warp size architecture","authors":"Timothy G. Rogers, Daniel R. Johnson, Mike O'Connor, S. Keckler","doi":"10.1145/2749469.2750410","DOIUrl":"https://doi.org/10.1145/2749469.2750410","url":null,"abstract":"This paper studies the effect of warp sizing and scheduling on performance and efficiency in GPUs. We propose Variable Warp Sizing (VWS) which improves the performance of divergent applications by using a small base warp size in the presence of control flow and memory divergence. When appropriate, our proposed technique groups sets of these smaller warps together by ganging their execution in the warp scheduler, improving performance and energy efficiency for regular applications. Warp ganging is necessary to prevent performance degradation on regular workloads due to memory convergence slip, which results from the inability of smaller warps to exploit the same intra-warp memory locality as larger warps. This paper explores the effect of warp sizing on control flow divergence, memory divergence, and locality. For an estimated 5% area cost, our ganged scheduling microarchitecture results in a simulated 35% performance improvement on divergent workloads by allowing smaller groups of threads to proceed independently, and eliminates the performance degradation due to memory convergence slip that is observed when convergent applications are executed with smaller warp sizes.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"6 1","pages":"489-501"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83968802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

LaZy Superscalar 懒惰的超标量体系结构

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750409

Görkem Asilioglu, Zhaoxiang Jin, Murat Köksal, Omkar Javeri, Soner Önder

LaZy Superscalar is a processor architecture which delays the execution of fetched instructions until their results are needed by other instructions. This approach eliminates dead instructions and provides the necessary means to fuse dependent instructions across multiple control dependencies by explicitly tracking control and data dependencies through a matrix based scheduler. We present this novel redesign of scheduling, recovery and commit mechanisms and evaluate the performance of the proposed architecture. Our simulations using Spec 2006 benchmark suite indicate that LaZy Superscalar can achieve significant speed-ups while providing respectable power savings compared to a conventional superscalar processor.

LaZy Superscalar是一种处理器架构，它延迟获取指令的执行，直到其他指令需要它们的结果。这种方法消除了死指令，并通过基于矩阵的调度器显式跟踪控制和数据依赖关系，提供了必要的方法来融合跨多个控制依赖关系的依赖指令。我们提出了这种新的重新设计的调度、恢复和提交机制，并评估了所提议架构的性能。我们使用Spec 2006基准测试套件进行的模拟表明，与传统的超标量处理器相比，LaZy超标量处理器可以实现显著的加速，同时提供可观的功耗节省。

引用次数: 4

Heracles: Improving resource efficiency at scale 赫拉克勒斯:大规模提高资源效率

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2749475

David Lo, Liqun Cheng, R. Govindaraju, Parthasarathy Ranganathan, C. Kozyrakis

User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared resources can cause latency spikes that violate the service-level objectives of latency-sensitive tasks. The resulting under-utilization hurts both the affordability and energy-efficiency of large-scale datacenters. With technology scaling slowing down, it becomes important to address this opportunity. We present Heracles, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service. Heracles dynamically manages multiple hardware and software isolation mechanisms, such as CPU, memory, and network isolation, to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best-effort tasks. We evaluate Heracles using production latency-critical and batch workloads from Google and demonstrate average server utilizations of 90% without latency violations across all the load and colocation scenarios that we evaluated.

面向用户的、对延迟敏感的服务(如websearch)在日常低流量期间未充分利用其计算资源。在生产服务中很少为其他任务重用这些资源，因为对共享资源的争用可能导致延迟峰值，从而违反对延迟敏感的任务的服务级目标。由此导致的利用率不足损害了大型数据中心的可负担性和能源效率。随着技术规模的放缓，抓住这个机会变得非常重要。我们介绍了Heracles，这是一种基于反馈的控制器，可以在延迟关键服务的同时安全地配置最努力的任务。Heracles动态地管理多种硬件和软件隔离机制(如CPU、内存和网络隔离)，以确保对延迟敏感的作业满足延迟目标，同时最大限度地利用分配给“最佳努力”任务的资源。我们使用b谷歌的生产延迟关键型和批处理工作负载来评估Heracles，并演示了在我们评估的所有负载和托管场景中，平均服务器利用率为90%，没有延迟违规。

{"title":"Heracles: Improving resource efficiency at scale","authors":"David Lo, Liqun Cheng, R. Govindaraju, Parthasarathy Ranganathan, C. Kozyrakis","doi":"10.1145/2749469.2749475","DOIUrl":"https://doi.org/10.1145/2749469.2749475","url":null,"abstract":"User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared resources can cause latency spikes that violate the service-level objectives of latency-sensitive tasks. The resulting under-utilization hurts both the affordability and energy-efficiency of large-scale datacenters. With technology scaling slowing down, it becomes important to address this opportunity. We present Heracles, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service. Heracles dynamically manages multiple hardware and software isolation mechanisms, such as CPU, memory, and network isolation, to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best-effort tasks. We evaluate Heracles using production latency-critical and batch workloads from Google and demonstrate average server utilizations of 90% without latency violations across all the load and colocation scenarios that we evaluated.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"27 1","pages":"450-462"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83788340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 496

MBus: An ultra-low power interconnect bus for next generation nanopower systems MBus:用于下一代纳米电源系统的超低功耗互连总线

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750376

P. Pannuto, Yoonmyung Lee, Ye-Sheng Kuo, Z. Foo, B. Kempke, Gyouho Kim, R. Dreslinski, D. Blaauw, P. Dutta

As we show in this paper, I/O has become the limiting factor in scaling down size and power toward the goal of invisible computing. Achieving this goal will require composing optimized and specialized-yet reusable-components with an interconnect that permits tiny, ultra-low power systems. In contrast to today's interconnects which are limited by power-hungry pull-ups or high-overhead chip-select lines, our approach provides a superset of common bus features but at lower power, with fixed area and pin count, using fully synthesizable logic, and with surprisingly low protocol overhead. We present MBus, a new 4-pin, 22.6 pJ/bit/chip chip-to-chip interconnect made of two "shoot-through" rings. MBus facilitates ultra-low power system operation by implementing automatic power-gating of each chip in the system, easing the integration of active, inactive, and activating circuits on a single die. In addition, we introduce a new bus primitive: power oblivious communication, which guarantees message reception regardless of the recipient's power state when a message is sent. This disentangles power management from communication, greatly simplifying the creation of viable, modular, and heterogeneous systems that operate on the order of nanowatts. To evaluate the viability, power, performance, overhead, and scalability of our design, we build both hardware and software implementations of MBus and show its seamless operation across two FPGAs and twelve custom chips from three different semiconductor processes. A three-chip, 2.2 mm3 MBus system draws 8 nW of total system standby power and uses only 22.6 pJ/bit/chip for communication. This is the lowest power for any system bus with MBus 's feature set.

正如我们在本文中所展示的，I/O已经成为缩小大小和实现不可见计算目标的限制因素。要实现这一目标，需要组合优化的、专门的、可重复使用的组件，并通过互连实现微型、超低功耗的系统。与当今受耗电上拉或高开销芯片选择线限制的互连相比，我们的方法提供了通用总线功能的超集，但功耗较低，具有固定的面积和引脚数，使用完全可合成的逻辑，并且具有令人惊讶的低协议开销。我们提出了MBus，一种新的4引脚，22.6 pJ/bit/chip芯片对芯片互连，由两个“穿透”环组成。MBus通过实现系统中每个芯片的自动功率门控，简化了在单个芯片上集成有源、无源和激活电路，从而促进了超低功耗系统的运行。此外，我们还引入了一种新的总线原语:功率无关通信，它保证消息在发送时无论接收方的功率状态如何都能接收到消息。这将电源管理从通信中解放出来，极大地简化了可行的、模块化的、异构的、以毫瓦量级运行的系统的创建。为了评估我们设计的可行性、功耗、性能、开销和可扩展性，我们构建了MBus的硬件和软件实现，并展示了其在两个fpga和来自三种不同半导体工艺的12个定制芯片上的无缝运行。一个三芯片、2.2 mm3 MBus系统的总系统待机功率为8nw，通信功耗仅为22.6 pJ/bit/chip。这是具有MBus特性集的任何系统总线的最低功耗。

{"title":"MBus: An ultra-low power interconnect bus for next generation nanopower systems","authors":"P. Pannuto, Yoonmyung Lee, Ye-Sheng Kuo, Z. Foo, B. Kempke, Gyouho Kim, R. Dreslinski, D. Blaauw, P. Dutta","doi":"10.1145/2749469.2750376","DOIUrl":"https://doi.org/10.1145/2749469.2750376","url":null,"abstract":"As we show in this paper, I/O has become the limiting factor in scaling down size and power toward the goal of invisible computing. Achieving this goal will require composing optimized and specialized-yet reusable-components with an interconnect that permits tiny, ultra-low power systems. In contrast to today's interconnects which are limited by power-hungry pull-ups or high-overhead chip-select lines, our approach provides a superset of common bus features but at lower power, with fixed area and pin count, using fully synthesizable logic, and with surprisingly low protocol overhead. We present MBus, a new 4-pin, 22.6 pJ/bit/chip chip-to-chip interconnect made of two \"shoot-through\" rings. MBus facilitates ultra-low power system operation by implementing automatic power-gating of each chip in the system, easing the integration of active, inactive, and activating circuits on a single die. In addition, we introduce a new bus primitive: power oblivious communication, which guarantees message reception regardless of the recipient's power state when a message is sent. This disentangles power management from communication, greatly simplifying the creation of viable, modular, and heterogeneous systems that operate on the order of nanowatts. To evaluate the viability, power, performance, overhead, and scalability of our design, we build both hardware and software implementations of MBus and show its seamless operation across two FPGAs and twelve custom chips from three different semiconductor processes. A three-chip, 2.2 mm3 MBus system draws 8 nW of total system standby power and uses only 22.6 pJ/bit/chip for communication. This is the lowest power for any system bus with MBus 's feature set.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"16 1","pages":"629-641"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82292796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Probable cause: The deanonymizing effects of approximate DRAM 可能原因:近似DRAM的去匿名化效果

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750419

Amir Rahmati, Matthew Hicks, Daniel E. Holcomb, Kevin Fu

Approximate computing research seeks to trade-off the accuracy of computation for increases in performance or reductions in power consumption. The observation driving approximate computing is that many applications tolerate small amounts of error which allows for an opportunistic relaxation of guard bands (e.g., clock rate and voltage). Besides affecting performance and power, reducing guard bands exposes analog properties of traditionally digital components. For DRAM, one analog property exposed by approximation is the variability of memory cell decay times. In this paper, we show how the differing cell decay times of approximate DRAM creates an error pattern that serves as a system identifying fingerprint. To validate this observation, we build an approximate memory platform and perform experiments that show that the fingerprint due to approximation is device dependent and resilient to changes in environment and level of approximation. To identify a DRAM chip given an approximate output, we develop a distance metric that yields a two-orders-of-magnitude difference in the distance between approximate results produced by the same DRAM chip and those produced by other DRAM chips. We use these results to create a mathematical model of approximate DRAM that we leverage to explore the end-to-end deanonymizing effects of approximate memory using a commodity system running an image manipulation program. The results from our experiment show that given less than 100 approximate outputs, the fingerprint for an approximate DRAM begins to converge to a single, machine identifying fingerprint.

近似计算研究寻求权衡计算的准确性，以提高性能或降低功耗。观察驱动近似计算是，许多应用容忍少量的误差，这允许机会性地放松保护带(例如，时钟速率和电压)。除了影响性能和功率外，减少保护带还暴露了传统数字元件的模拟特性。对于DRAM，通过近似暴露的一个模拟特性是存储单元衰减时间的可变性。在本文中，我们展示了近似DRAM的不同单元衰减时间如何产生作为系统识别指纹的错误模式。为了验证这一观察结果，我们建立了一个近似记忆平台，并进行了实验，表明由于近似而产生的指纹依赖于设备，并且对环境和近似水平的变化具有弹性。为了识别给定近似输出的DRAM芯片，我们开发了一个距离度量，该度量在相同DRAM芯片产生的近似结果与其他DRAM芯片产生的近似结果之间的距离上产生两个数量级的差异。我们使用这些结果来创建近似DRAM的数学模型，我们利用该模型来探索使用运行图像处理程序的商品系统的近似内存的端到端去匿名化效果。我们的实验结果表明，给定少于100个近似输出，近似DRAM的指纹开始收敛到单个机器识别指纹。

{"title":"Probable cause: The deanonymizing effects of approximate DRAM","authors":"Amir Rahmati, Matthew Hicks, Daniel E. Holcomb, Kevin Fu","doi":"10.1145/2749469.2750419","DOIUrl":"https://doi.org/10.1145/2749469.2750419","url":null,"abstract":"Approximate computing research seeks to trade-off the accuracy of computation for increases in performance or reductions in power consumption. The observation driving approximate computing is that many applications tolerate small amounts of error which allows for an opportunistic relaxation of guard bands (e.g., clock rate and voltage). Besides affecting performance and power, reducing guard bands exposes analog properties of traditionally digital components. For DRAM, one analog property exposed by approximation is the variability of memory cell decay times. In this paper, we show how the differing cell decay times of approximate DRAM creates an error pattern that serves as a system identifying fingerprint. To validate this observation, we build an approximate memory platform and perform experiments that show that the fingerprint due to approximation is device dependent and resilient to changes in environment and level of approximation. To identify a DRAM chip given an approximate output, we develop a distance metric that yields a two-orders-of-magnitude difference in the distance between approximate results produced by the same DRAM chip and those produced by other DRAM chips. We use these results to create a mathematical model of approximate DRAM that we leverage to explore the end-to-end deanonymizing effects of approximate memory using a commodity system running an image manipulation program. The results from our experiment show that given less than 100 approximate outputs, the fingerprint for an approximate DRAM begins to converge to a single, machine identifying fingerprint.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"4 1","pages":"604-615"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80184843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31