2011 38th Annual International Symposium on Computer Architecture (ISCA)最新文献

Rapid identification of architectural bottlenecks via precise event counting 通过精确的事件计数快速识别架构瓶颈

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000107

J. Demme, S. Sethumadhavan

On-chip performance counters play a vital role in computer architecture research due to their ability to quickly provide insights into application behaviors that are time consuming to characterize with traditional methods. The usefulness of modern performance counters, however, is limited by inefficient techniques used today to access them. Current access techniques rely on imprecise sampling or heavyweight kernel interaction forcing users to choose between precision or speed and thus restricting the use of performance counter hardware. In this paper, we describe new methods that enable precise, lightweight interfacing to on-chip performance counters. These low-overhead techniques allow precise reading of virtualized counters in low tens of nanoseconds, which is one to two orders of magnitude faster than current access techniques. Further, these tools provide several fresh insights on the behavior of modern parallel programs such as MySQL and Firefox, which were previously obscured (or impossible to obtain) by existing methods for characterization. Based on case studies with our new access methods, we discuss seven implications for computer architects in the cloud era and three methods for enhancing hardware counters further. Taken together, these observations have the potential to open up new avenues for architecture research.

片上性能计数器在计算机体系结构研究中起着至关重要的作用，因为它们能够快速提供对应用程序行为的洞察，而传统方法需要花费大量时间来表征这些行为。然而，现代性能计数器的有用性受到目前使用的低效技术的限制。当前的访问技术依赖于不精确的采样或重量级内核交互，迫使用户在精度或速度之间做出选择，从而限制了性能计数器硬件的使用。在本文中，我们描述了新的方法，使精确，轻量级接口芯片上的性能计数器。这些低开销技术允许在几十纳秒内精确读取虚拟计数器，这比当前的访问技术快一到两个数量级。此外，这些工具提供了一些关于现代并行程序(如MySQL和Firefox)行为的新见解，这些见解以前是通过现有的表征方法模糊(或不可能获得)的。基于我们的新访问方法的案例研究，我们讨论了云时代计算机架构师的七个含义以及进一步增强硬件计数器的三种方法。总的来说，这些观察结果有可能为建筑研究开辟新的途径。

{"title":"Rapid identification of architectural bottlenecks via precise event counting","authors":"J. Demme, S. Sethumadhavan","doi":"10.1145/2000064.2000107","DOIUrl":"https://doi.org/10.1145/2000064.2000107","url":null,"abstract":"On-chip performance counters play a vital role in computer architecture research due to their ability to quickly provide insights into application behaviors that are time consuming to characterize with traditional methods. The usefulness of modern performance counters, however, is limited by inefficient techniques used today to access them. Current access techniques rely on imprecise sampling or heavyweight kernel interaction forcing users to choose between precision or speed and thus restricting the use of performance counter hardware. In this paper, we describe new methods that enable precise, lightweight interfacing to on-chip performance counters. These low-overhead techniques allow precise reading of virtualized counters in low tens of nanoseconds, which is one to two orders of magnitude faster than current access techniques. Further, these tools provide several fresh insights on the behavior of modern parallel programs such as MySQL and Firefox, which were previously obscured (or impossible to obtain) by existing methods for characterization. Based on case studies with our new access methods, we discuss seven implications for computer architects in the cloud era and three methods for enhancing hardware counters further. Taken together, these observations have the potential to open up new avenues for architecture research.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117337701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 74

Benefits and limitations of tapping into stored energy for datacenters 利用数据中心存储的能量的好处和限制

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000105

Sriram Govindan, A. Sivasubramaniam, B. Urgaonkar

Datacenter power consumption has a significant impact on both its recurring electricity bill (Op-ex) and one-time construction costs (Cap-ex). Existing work optimizing these costs has relied primarily on throttling devices or workload shaping, both with performance degrading implications. In this paper, we present a novel knob of energy buffer (eBuff) available in the form of UPS batteries in datacenters for this cost optimization. Intuitively, eBuff stores energy in UPS batteries during “valleys” - periods of lower demand, which can be drained during “peaks” - periods of higher demand. UPS batteries are normally used as a fail-over mechanism to transition to captive power sources upon utility failure. Furthermore, frequent discharges can cause UPS batteries to fail prematurely. We conduct detailed analysis of battery operation to figure out feasible operating regions given such battery lifetime and datacenter availability concerns. Using insights learned from this analysis, we develop peak reduction algorithms that combine the UPS battery knob with existing throttling based techniques for minimizing datacenter power costs. Using an experimental platform, we offer insights about Op-ex savings offered by eBuff for a wide range of workload peaks/valleys, UPS provisioning, and application SLA constraints. We find that eBuff can be used to realize 15-45% peak power reduction, corresponding to 6-18% savings in Op-ex across this spectrum. eBuff can also play a role in reducing Cap-ex costs by allowing tighter overbooking of power infrastructure components and we quantify the extent of such Cap-ex savings. To our knowledge, this is the first paper to exploit stored energy - typically lying untapped in the datacenter - to address the peak power draw problem.

数据中心的电力消耗对其经常性电费(Op-ex)和一次性建设成本(Cap-ex)都有重大影响。优化这些成本的现有工作主要依赖于节流设备或工作负载整形，这两种方法都会降低性能。在本文中，我们提出了一种新的能量缓冲旋钮(eBuff)，以数据中心UPS电池的形式提供，以实现这种成本优化。直观地说，eBuff在“低谷”(需求较低的时期)将能量储存在UPS电池中，可以在“高峰”(需求较高的时期)将其消耗掉。UPS电池通常用作故障转移机制，在公用事业发生故障时过渡到自备电源。此外，频繁放电会导致UPS电池过早失效。我们对电池运行进行了详细的分析，以确定考虑到电池寿命和数据中心可用性的可行操作区域。利用从该分析中获得的见解，我们开发了峰值降低算法，将UPS电池旋钮与现有的基于节流的技术相结合，以最大限度地降低数据中心的电力成本。通过实验平台，我们提供了eBuff在各种工作负载峰值/低谷、UPS供应和应用程序SLA约束下的Op-ex节省的见解。我们发现eBuff可以实现15-45%的峰值功耗降低，相当于在整个频谱中节省6-18%的Op-ex。eBuff也可以通过加强电力基础设施组件的超额预订，在降低资本支出成本方面发挥作用，我们量化了这种资本支出节省的程度。据我们所知，这是第一篇利用存储能量(通常在数据中心未开发)来解决峰值功耗问题的论文。

{"title":"Benefits and limitations of tapping into stored energy for datacenters","authors":"Sriram Govindan, A. Sivasubramaniam, B. Urgaonkar","doi":"10.1145/2000064.2000105","DOIUrl":"https://doi.org/10.1145/2000064.2000105","url":null,"abstract":"Datacenter power consumption has a significant impact on both its recurring electricity bill (Op-ex) and one-time construction costs (Cap-ex). Existing work optimizing these costs has relied primarily on throttling devices or workload shaping, both with performance degrading implications. In this paper, we present a novel knob of energy buffer (eBuff) available in the form of UPS batteries in datacenters for this cost optimization. Intuitively, eBuff stores energy in UPS batteries during “valleys” - periods of lower demand, which can be drained during “peaks” - periods of higher demand. UPS batteries are normally used as a fail-over mechanism to transition to captive power sources upon utility failure. Furthermore, frequent discharges can cause UPS batteries to fail prematurely. We conduct detailed analysis of battery operation to figure out feasible operating regions given such battery lifetime and datacenter availability concerns. Using insights learned from this analysis, we develop peak reduction algorithms that combine the UPS battery knob with existing throttling based techniques for minimizing datacenter power costs. Using an experimental platform, we offer insights about Op-ex savings offered by eBuff for a wide range of workload peaks/valleys, UPS provisioning, and application SLA constraints. We find that eBuff can be used to realize 15-45% peak power reduction, corresponding to 6-18% savings in Op-ex across this spectrum. eBuff can also play a role in reducing Cap-ex costs by allowing tighter overbooking of power infrastructure components and we quantify the extent of such Cap-ex savings. To our knowledge, this is the first paper to exploit stored energy - typically lying untapped in the datacenter - to address the peak power draw problem.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125538594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 231

DBAR: An efficient routing algorithm to support multiple concurrent applications in networks-on-chip DBAR:在片上网络中支持多个并发应用程序的有效路由算法

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000113

Sheng Ma, Natalie D. Enright Jerger, Zhiying Wang

With the emergence of many-core architectures, it is quite likely that multiple applications will run concurrently on a system. Existing locally and globally adaptive routing algorithms largely overlook issues associated with workload consolidation. The shortsightedness of locally adaptive routing algorithms limits performance due to poor network congestion avoidance. Globally adaptive routing algorithms attack this issue by introducing a congestion propagation network to obtain network status information beyond neighboring nodes. However, they may suffer from intra- and inter-application interference during output port selection for consolidated workloads, coupling the behavior of otherwise independent applications and negatively affecting performance. To address these two issues, we propose Destination-Based Adaptive Routing (DBAR). We design a novel low-cost congestion propagation network that leverages both local and non-local network information for more accurate congestion estimates. Thus, DBAR offers effective adaptivity for congestion beyond neighboring nodes. More importantly, by integrating the destination into the selection function, DBAR mitigates intra- and inter-application interference and offers dynamic isolation among regions. Experimental results show that DBAR can offer better performance than the best baseline algorithm for all measured configurations; it is well suited for workload consolidation. The wiring overhead of DBAR is low and DBAR provides improvement in the energy-delay product for medium and high injection rates.

随着多核体系结构的出现，多个应用程序很可能在一个系统上并发运行。现有的本地和全局自适应路由算法在很大程度上忽略了与工作负载整合相关的问题。局部自适应路由算法的短视性导致网络拥塞避免能力差，限制了算法的性能。全局自适应路由算法通过引入拥塞传播网络来获取相邻节点之外的网络状态信息来解决这一问题。但是，在为合并的工作负载选择输出端口期间，它们可能会受到应用程序内部和应用程序之间的干扰，从而耦合其他独立应用程序的行为，并对性能产生负面影响。为了解决这两个问题，我们提出了基于目的地的自适应路由(DBAR)。我们设计了一种新颖的低成本拥塞传播网络，利用本地和非本地网络信息进行更准确的拥塞估计。因此，DBAR为相邻节点之外的拥塞提供了有效的自适应能力。更重要的是，通过将目的地集成到选择功能中，DBAR减轻了应用程序内部和应用程序之间的干扰，并提供了区域之间的动态隔离。实验结果表明，在所有测量配置下，DBAR算法的性能都优于最佳基线算法;它非常适合工作负载整合。DBAR的布线开销低，DBAR为中高注入速率提供了能量延迟产品的改进。

{"title":"DBAR: An efficient routing algorithm to support multiple concurrent applications in networks-on-chip","authors":"Sheng Ma, Natalie D. Enright Jerger, Zhiying Wang","doi":"10.1145/2000064.2000113","DOIUrl":"https://doi.org/10.1145/2000064.2000113","url":null,"abstract":"With the emergence of many-core architectures, it is quite likely that multiple applications will run concurrently on a system. Existing locally and globally adaptive routing algorithms largely overlook issues associated with workload consolidation. The shortsightedness of locally adaptive routing algorithms limits performance due to poor network congestion avoidance. Globally adaptive routing algorithms attack this issue by introducing a congestion propagation network to obtain network status information beyond neighboring nodes. However, they may suffer from intra- and inter-application interference during output port selection for consolidated workloads, coupling the behavior of otherwise independent applications and negatively affecting performance. To address these two issues, we propose Destination-Based Adaptive Routing (DBAR). We design a novel low-cost congestion propagation network that leverages both local and non-local network information for more accurate congestion estimates. Thus, DBAR offers effective adaptivity for congestion beyond neighboring nodes. More importantly, by integrating the destination into the selection function, DBAR mitigates intra- and inter-application interference and offers dynamic isolation among regions. Experimental results show that DBAR can offer better performance than the best baseline algorithm for all measured configurations; it is well suited for workload consolidation. The wiring overhead of DBAR is low and DBAR provides improvement in the energy-delay product for medium and high injection rates.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115522028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 181

Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators 探索数据并行加速器中可编程性和效率之间的权衡

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000080

Yunsup Lee, Rimas Avizienis, Alex Bishara, R. Xia, Derek Lockhart, C. Batten, K. Asanović

We present a taxonomy and modular implementation approach for data-parallel accelerators, including the MIMD, vector-SIMD, subword-SIMD, SIMT, and vector-thread (VT) architectural design patterns. We have developed a new VT microarchitecture, Maven, based on the traditional vector-SIMD microarchitecture that is considerably simpler to implement and easier to program than previous VT designs. Using an extensive design-space exploration of full VLSI implementations of many accelerator design points, we evaluate the varying tradeoffs between programmability and implementation efficiency among the MIMD, vector-SIMD, and VT patterns on a workload of microbenchmarks and compiled application kernels. We find the vector cores provide greater efficiency than the MIMD cores, even on fairly irregular kernels. Our results suggest that the Maven VT microarchitecture is superior to the traditional vector-SIMD architecture, providing both greater efficiency and easier programmability.

我们提出了一种数据并行加速器的分类和模块化实现方法，包括MIMD、向量- simd、子词- simd、SIMT和向量-线程(VT)架构设计模式。我们基于传统的矢量simd微体系结构开发了一种新的VT微体系结构Maven，它比以前的VT设计更容易实现和编程。通过对许多加速器设计点的全VLSI实现进行广泛的设计空间探索，我们评估了在微基准测试和编译的应用程序内核的工作负载上，MIMD、矢量simd和VT模式之间的可编程性和实现效率之间的不同权衡。我们发现矢量内核比MIMD内核提供更高的效率，即使在相当不规则的内核上也是如此。我们的结果表明，Maven VT微体系结构优于传统的矢量simd体系结构，提供更高的效率和更容易的可编程性。

引用次数: 54

SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading SRAM-DRAM混合存储器，应用程序在细粒度多线程中有效地注册文件

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000094

Wing-Kei S. Yu, Ruirui C. Huang, Sarah Q. Xu, Sung-En Wang, E. Kan, G. Suh

Large register files are common in highly multi-threaded architectures such as GPUs. This paper presents a hybrid memory design that tightly integrates embedded DRAM into SRAM cells with a main application to reducing area and power consumption of multi-threaded register files. In the hybrid memory, each SRAM cell is augmented with multiple DRAM cells so that multiple bits can be stored in each cell. This configuration results in significant area and energy savings compared to the SRAM array with the same capacity due to compact DRAM cells. On other hand, the hybrid memory requires explicit data movements in order to access DRAM contexts. In order to minimize context switching impact, we introduce write-back buffers, background context switching, and context-aware thread scheduling, to the processor pipeline and the scheduler. Circuit and architecture simulations of GPU benchmarks suites show significant savings in register file area (38%) and energy (68%) over the traditional SRAM implementation, with minimal (1.4%) performance loss.

大型寄存器文件在gpu等高度多线程架构中很常见。本文提出了一种将嵌入式DRAM紧密集成到SRAM单元中的混合存储器设计，其主要应用是减少多线程寄存器文件的面积和功耗。在混合存储器中，每个SRAM单元都增加了多个DRAM单元，以便每个单元可以存储多个比特。由于紧凑的DRAM单元，与具有相同容量的SRAM阵列相比，这种配置可以显著节省面积和能源。另一方面，为了访问DRAM上下文，混合存储器需要显式的数据移动。为了最小化上下文切换的影响，我们在处理器管道和调度器中引入了回写缓冲区、后台上下文切换和上下文感知线程调度。GPU基准套件的电路和架构模拟显示，与传统的SRAM实现相比，寄存器文件面积(38%)和能量(68%)显着节省，性能损失最小(1.4%)。

引用次数: 83

Vantage: Scalable and efficient fine-grain cache partitioning 优势:可伸缩且高效的细粒度缓存分区

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000073

Daniel Sánchez, C. Kozyrakis

Cache partitioning has a wide range of uses in CMPs, from guaranteeing quality of service and controlled sharing to security-related techniques. However, existing cache partitioning schemes (such as way-partitioning) are limited to coarse-grain allocations, can only support few partitions, and reduce cache associativity, hurting performance. Hence, these techniques can only be applied to CMPs with 2-4 cores, but fail to scale to tens of cores. We present Vantage, a novel cache partitioning technique that overcomes the limitations of existing schemes: caches can have tens of partitions with sizes specified at cache line granularity, while maintaining high associativity and strong isolation among partitions. Vantage leverages cache arrays with good hashing and associativity, which enable soft-pinning a large portion of cache lines. It enforces capacity allocations by controlling the replacement process. Unlike prior schemes, Vantage provides strict isolation guarantees by partitioning most (e.g. 90%) of the cache instead of all of it. Vantage is derived from analytical models, which allow us to provide strong guarantees and bounds on associativity and sizing independent of the number of partitions and their behaviors. It is simple to implement, requiring around 1.5% state overhead and simple changes to the cache controller. We evaluate Vantage using extensive simulations. On a 32-core system, using 350 multi programmed workloads and one partition per core, partitioning the last-level cache with conventional techniques degrades throughput for 71 % of the workloads versus an unpartitioned cache (by 7% average, 25% maximum degradation), even when using 64-way caches. In contrast, Vantage improves throughput for 98% of the workloads, by 8% on average (up to 20%), using a 4-way cache.

缓存分区在cmp中有广泛的用途，从保证服务质量和受控共享到与安全相关的技术。但是，现有的缓存分区方案(例如way-partitioning)仅限于粗粒度分配，只能支持很少的分区，并且降低了缓存关联性，损害了性能。因此，这些技术只能应用于2-4核的cmp，而不能扩展到数十核。我们提出了一种新的缓存分区技术Vantage，它克服了现有方案的局限性:缓存可以有数十个分区，其大小按缓存线粒度指定，同时保持分区之间的高关联性和强隔离性。Vantage利用具有良好散列和关联性的缓存数组，这使得软固定大部分缓存行成为可能。它通过控制替换过程来强制容量分配。与以前的模式不同，Vantage通过分区大部分(例如90%)缓存而不是全部缓存来提供严格的隔离保证。Vantage来自于分析模型，它允许我们提供强大的保证和约束，而不依赖于分区的数量和它们的行为。它很容易实现，只需要大约1.5%的状态开销和对缓存控制器的简单更改。我们使用广泛的模拟来评估Vantage。在32核系统上，使用350个多编程工作负载和每个核一个分区，与未分区的缓存相比，使用传统技术对最后一级缓存进行分区会使71%的工作负载的吞吐量降低(平均降低7%，最大降低25%)，即使使用64路缓存也是如此。相比之下，Vantage使用4路缓存将98%的工作负载的吞吐量平均提高了8%(最高20%)。

{"title":"Vantage: Scalable and efficient fine-grain cache partitioning","authors":"Daniel Sánchez, C. Kozyrakis","doi":"10.1145/2000064.2000073","DOIUrl":"https://doi.org/10.1145/2000064.2000073","url":null,"abstract":"Cache partitioning has a wide range of uses in CMPs, from guaranteeing quality of service and controlled sharing to security-related techniques. However, existing cache partitioning schemes (such as way-partitioning) are limited to coarse-grain allocations, can only support few partitions, and reduce cache associativity, hurting performance. Hence, these techniques can only be applied to CMPs with 2-4 cores, but fail to scale to tens of cores. We present Vantage, a novel cache partitioning technique that overcomes the limitations of existing schemes: caches can have tens of partitions with sizes specified at cache line granularity, while maintaining high associativity and strong isolation among partitions. Vantage leverages cache arrays with good hashing and associativity, which enable soft-pinning a large portion of cache lines. It enforces capacity allocations by controlling the replacement process. Unlike prior schemes, Vantage provides strict isolation guarantees by partitioning most (e.g. 90%) of the cache instead of all of it. Vantage is derived from analytical models, which allow us to provide strong guarantees and bounds on associativity and sizing independent of the number of partitions and their behaviors. It is simple to implement, requiring around 1.5% state overhead and simple changes to the cache controller. We evaluate Vantage using extensive simulations. On a 32-core system, using 350 multi programmed workloads and one partition per core, partitioning the last-level cache with conventional techniques degrades throughput for 71 % of the workloads versus an unpartitioned cache (by 7% average, 25% maximum degradation), even when using 64-way caches. In contrast, Vantage improves throughput for 98% of the workloads, by 8% on average (up to 20%), using a 4-way cache.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125234108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 247

The impact of memory subsystem resource sharing on datacenter applications 内存子系统资源共享对数据中心应用程序的影响

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000099

Lingjia Tang, Jason Mars, Neil Vachharajani, R. Hundt, M. Soffa

In this paper we study the impact of sharing memory resources on five Google datacenter applications: a web search engine, bigtable, content analyzer, image stitching, and protocol buffer. While prior work has found neither positive nor negative effects from cache sharing across the PARSEC benchmark suite, we find that across these datacenter applications, there is both a sizable benefit and a potential degradation from improperly sharing resources. In this paper, we first present a study of the importance of thread-to-core mappings for applications in the datacenter as threads can be mapped to share or to not share caches and bus bandwidth. Second, we investigate the impact of co-locating threads from multiple applications with diverse memory behavior and discover that the best mapping for a given application changes depending on its co-runner. Third, we investigate the application characteristics that impact performance in the various thread-to-core mapping scenarios. Finally, we present both a heuristics-based and an adaptive approach to arrive at good thread-to-core decisions in the datacenter. We observe performance swings of up to 25% for web search and 40% for other key applications, simply based on how application threads are mapped to cores. By employing our adaptive thread-to-core mapper, the performance of the datacenter applications presented in this work improved by up to 22% over status quo thread-to-core mapping and performs within 3% of optimal.

在本文中，我们研究了共享内存资源对五个Google数据中心应用程序的影响:web搜索引擎、bigtable、内容分析器、图像拼接和协议缓冲区。虽然之前的工作没有发现跨PARSEC基准套件的缓存共享既没有正面影响，也没有负面影响，但我们发现，在这些数据中心应用程序中，不正确地共享资源既有相当大的好处，也有潜在的退化。在本文中，我们首先研究了线程到核心映射对数据中心应用程序的重要性，因为线程可以被映射为共享或不共享缓存和总线带宽。其次，我们研究了来自具有不同内存行为的多个应用程序的共定位线程的影响，并发现给定应用程序的最佳映射取决于其协同运行器。第三，我们研究了在各种线程到核映射场景中影响性能的应用程序特征。最后，我们提出了一种基于启发式和自适应的方法，以在数据中心中得出良好的线程到核决策。我们观察到web搜索的性能波动高达25%，其他关键应用程序的性能波动高达40%，这仅仅取决于应用程序线程如何映射到核心。通过使用我们的自适应线程到核映射器，本研究中提出的数据中心应用程序的性能比目前的线程到核映射提高了22%，性能比最优值提高了3%。

{"title":"The impact of memory subsystem resource sharing on datacenter applications","authors":"Lingjia Tang, Jason Mars, Neil Vachharajani, R. Hundt, M. Soffa","doi":"10.1145/2000064.2000099","DOIUrl":"https://doi.org/10.1145/2000064.2000099","url":null,"abstract":"In this paper we study the impact of sharing memory resources on five Google datacenter applications: a web search engine, bigtable, content analyzer, image stitching, and protocol buffer. While prior work has found neither positive nor negative effects from cache sharing across the PARSEC benchmark suite, we find that across these datacenter applications, there is both a sizable benefit and a potential degradation from improperly sharing resources. In this paper, we first present a study of the importance of thread-to-core mappings for applications in the datacenter as threads can be mapped to share or to not share caches and bus bandwidth. Second, we investigate the impact of co-locating threads from multiple applications with diverse memory behavior and discover that the best mapping for a given application changes depending on its co-runner. Third, we investigate the application characteristics that impact performance in the various thread-to-core mapping scenarios. Finally, we present both a heuristics-based and an adaptive approach to arrive at good thread-to-core decisions in the datacenter. We observe performance swings of up to 25% for web search and 40% for other key applications, simply based on how application threads are mapped to cores. By employing our adaptive thread-to-core mapper, the performance of the datacenter applications presented in this work improved by up to 22% over status quo thread-to-core mapping and performs within 3% of optimal.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124208825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 238

Scalable power control for many-core architectures running multi-threaded applications 为运行多线程应用程序的多核架构提供可扩展的功率控制

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000117

Kai Ma, Xue Li, Ming Chen, Xiaorui Wang

Optimizing the performance of a multi-core microprocessor within a power budget has recently received a lot of attention. However, most existing solutions are centralized and cannot scale well with the rapidly increasing level of core integration. While a few recent studies propose power control algorithms for many-core architectures, those solutions assume that the workload of every core is independent and therefore cannot effectively allocate power based on thread criticality to accelerate multi-threaded parallel applications, which are expected to be the primary workloads of many-core architectures. This paper presents a scalable power control solution for many-core microprocessors that is specifically designed to handle realistic workloads, i.e., a mixed group of single-threaded and multi-threaded applications. Our solution features a three-layer design. First, we adopt control theory to precisely control the power of the entire chip to its chip-level budget by adjusting the aggregated frequency of all the cores on the chip. Second, we dynamically group cores running the same applications and then partition the chip-level aggregated frequency quota among different groups for optimized overall microprocessor performance. Finally, we partition the group-level frequency quota among the cores in each group based on the measured thread criticality for shorter application completion time. As a result, our solution can optimize the microprocessor performance while precisely limiting the chip-level power consumption below the desired budget. Empirical results on a 12-core hardware testbed show that our control solution can provide precise power control, as well as 17% and 11% better application performance than two state-of-the-art solutions, on average, for mixed PARSEC and SPEC benchmarks. Furthermore, our extensive simulation results for 32, 64, and 128 cores, as well as overhead analysis for up to 4,096 cores, demonstrate that our solution is highly scalable to many-core architectures.

在功率预算内优化多核微处理器的性能最近受到了很多关注。然而，大多数现有的解决方案都是集中式的，不能很好地随核心集成水平的快速增长而扩展。虽然最近有一些研究提出了多核架构的功耗控制算法，但这些解决方案假设每个核心的工作负载是独立的，因此无法根据线程临界性有效地分配功率来加速多线程并行应用程序，而这些应用程序被认为是多核架构的主要工作负载。本文提出了一种针对多核微处理器的可扩展电源控制解决方案，该解决方案专门设计用于处理实际工作负载，即单线程和多线程应用程序的混合组。我们的解决方案采用三层设计。首先，我们采用控制理论，通过调整芯片上所有核心的聚合频率，将整个芯片的功耗精确控制到其芯片级预算。其次，我们动态分组运行相同应用程序的内核，然后在不同组之间划分芯片级聚合频率配额，以优化整体微处理器性能。最后，为了缩短应用程序完成时间，我们根据测量的线程临界性在每个组的内核之间划分组级频率配额。因此，我们的解决方案可以优化微处理器性能，同时精确地将芯片级功耗限制在所需的预算以下。在12核硬件测试平台上的经验结果表明，我们的控制解决方案可以提供精确的功率控制，并且在混合PARSEC和SPEC基准测试中，平均比两种最先进的解决方案提高17%和11%的应用性能。此外，我们对32核、64核和128核的广泛模拟结果，以及对高达4,096核的开销分析表明，我们的解决方案可高度扩展到多核架构。

{"title":"Scalable power control for many-core architectures running multi-threaded applications","authors":"Kai Ma, Xue Li, Ming Chen, Xiaorui Wang","doi":"10.1145/2000064.2000117","DOIUrl":"https://doi.org/10.1145/2000064.2000117","url":null,"abstract":"Optimizing the performance of a multi-core microprocessor within a power budget has recently received a lot of attention. However, most existing solutions are centralized and cannot scale well with the rapidly increasing level of core integration. While a few recent studies propose power control algorithms for many-core architectures, those solutions assume that the workload of every core is independent and therefore cannot effectively allocate power based on thread criticality to accelerate multi-threaded parallel applications, which are expected to be the primary workloads of many-core architectures. This paper presents a scalable power control solution for many-core microprocessors that is specifically designed to handle realistic workloads, i.e., a mixed group of single-threaded and multi-threaded applications. Our solution features a three-layer design. First, we adopt control theory to precisely control the power of the entire chip to its chip-level budget by adjusting the aggregated frequency of all the cores on the chip. Second, we dynamically group cores running the same applications and then partition the chip-level aggregated frequency quota among different groups for optimized overall microprocessor performance. Finally, we partition the group-level frequency quota among the cores in each group based on the measured thread criticality for shorter application completion time. As a result, our solution can optimize the microprocessor performance while precisely limiting the chip-level power consumption below the desired budget. Empirical results on a 12-core hardware testbed show that our control solution can provide precise power control, as well as 17% and 11% better application performance than two state-of-the-art solutions, on average, for mixed PARSEC and SPEC benchmarks. Furthermore, our extensive simulation results for 32, 64, and 128 cores, as well as overhead analysis for up to 4,096 cores, demonstrate that our solution is highly scalable to many-core architectures.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115276920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 155

A case for heterogeneous on-chip interconnects for CMPs cmp的异质片上互连案例

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000111

Asit K. Mishra, N. Vijaykrishnan, C. Das

Network-on-chip (NoC) has become a critical shared resource in the emerging Chip Multiprocessor (CMP) era. Most prior NoC designs have used the same type of router across the entire network. While this homogeneous network design eases the burden on a network designer, partitioning the resources equally among all routers across the network does not lead to optimal resource usage, and hence, affects the performance-power envelope. In this work, we propose to apportion the resources in an NoC to leverage the non-uniformity in network resource demand. Our proposal includes partitioning the network resources, specifically buffers and links, in an optimal manner. This approach results in redistributing resources such that routers that require more resources are allocated more buffers and wider links compared to routers demanding fewer resources. This results in a novel heterogeneous network, called HeteroNoC, which is composed of two types of routers - small power efficient routers, and big high performance routers. We evaluate a number of heterogeneous network configurations, composed of big and small routers, and show that giving more resources to routers along the diagonals in a mesh network provides maximum benefits in terms of performance and power. We also show the potential benefits of the HeteroNoC design by co-evaluating it with memory-controllers and configuring it with an asymmetric CMP consisting of heterogeneous cores.

在新兴的芯片多处理器(CMP)时代，片上网络(NoC)已成为一种重要的共享资源。大多数以前的NoC设计在整个网络中使用相同类型的路由器。虽然这种同构网络设计减轻了网络设计者的负担，但在网络上的所有路由器之间平均分配资源并不能实现最佳的资源使用，因此会影响性能-功率信封。在这项工作中，我们建议在NoC中分配资源，以利用网络资源需求的不均匀性。我们的建议包括以最优方式划分网络资源，特别是缓冲区和链路。这种方法会导致资源的重新分配，与需要更少资源的路由器相比，需要更多资源的路由器会被分配更多的缓冲区和更宽的链路。这就产生了一种新的异构网络，称为HeteroNoC，它由两种类型的路由器组成——小功率高效路由器和大型高性能路由器。我们评估了许多由大型和小型路由器组成的异构网络配置，并表明在网状网络中沿对角线向路由器提供更多资源可以在性能和功率方面提供最大的好处。我们还展示了HeteroNoC设计的潜在优势，通过对其与内存控制器进行共同评估，并将其配置为由异构内核组成的非对称CMP。

{"title":"A case for heterogeneous on-chip interconnects for CMPs","authors":"Asit K. Mishra, N. Vijaykrishnan, C. Das","doi":"10.1145/2000064.2000111","DOIUrl":"https://doi.org/10.1145/2000064.2000111","url":null,"abstract":"Network-on-chip (NoC) has become a critical shared resource in the emerging Chip Multiprocessor (CMP) era. Most prior NoC designs have used the same type of router across the entire network. While this homogeneous network design eases the burden on a network designer, partitioning the resources equally among all routers across the network does not lead to optimal resource usage, and hence, affects the performance-power envelope. In this work, we propose to apportion the resources in an NoC to leverage the non-uniformity in network resource demand. Our proposal includes partitioning the network resources, specifically buffers and links, in an optimal manner. This approach results in redistributing resources such that routers that require more resources are allocated more buffers and wider links compared to routers demanding fewer resources. This results in a novel heterogeneous network, called HeteroNoC, which is composed of two types of routers - small power efficient routers, and big high performance routers. We evaluate a number of heterogeneous network configurations, composed of big and small routers, and show that giving more resources to routers along the diagonals in a mesh network provides maximum benefits in terms of performance and power. We also show the potential benefits of the HeteroNoC design by co-evaluating it with memory-controllers and configuring it with an asymmetric CMP consisting of heterogeneous cores.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130246854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 129

Energy-efficient mechanisms for managing thread context in throughput processors 在吞吐量处理器中管理线程上下文的节能机制

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000093

Mark Gebhart, Daniel R. Johnson, D. Tarjan, S. Keckler, W. Dally, Erik Lindholm, K. Skadron

Modern graphics processing units (GPUs) use a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complicated thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we examine register file caching to replace accesses to the large main register file with accesses to a smaller structure containing the immediate register working set of active threads. Second, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency. Combined with register file caching, a two-level thread scheduler provides a further reduction in energy by limiting the allocation of temporary register cache resources to only the currently active subset of threads. We show that on average, across a variety of real world graphics and compute workloads, a 6-entry per-thread register file cache reduces the number of reads and writes to the main register file by 50% and 59% respectively. We further show that the active thread count can be reduced by a factor of 4 with minimal impact on performance, resulting in a 36% reduction of register file energy.

现代图形处理单元(gpu)使用大量的硬件线程来隐藏功能单元和内存访问延迟。极端多线程需要一个复杂的线程调度器和一个大的寄存器文件，这在能量和延迟方面都是昂贵的。我们提出了两种互补的技术来减少像gpu这样的大线程处理器上的能量。首先，我们检查寄存器文件缓存，将对大型主寄存器文件的访问替换为对包含活动线程的即时寄存器工作集的较小结构的访问。其次，我们研究了一个两级线程调度器，它维护一小组活动线程来隐藏ALU和本地内存访问延迟，以及一组较大的挂起线程来隐藏主内存延迟。与寄存器文件缓存相结合，两级线程调度器通过将临时寄存器缓存资源的分配限制为当前活动的线程子集，进一步降低了能耗。我们表明，平均而言，在各种现实世界的图形和计算工作负载中，每个线程6个条目的寄存器文件缓存分别将主寄存器文件的读和写次数减少了50%和59%。我们进一步表明，活动线程计数可以在对性能影响最小的情况下减少1 / 4，从而使寄存器文件能量减少36%。

{"title":"Energy-efficient mechanisms for managing thread context in throughput processors","authors":"Mark Gebhart, Daniel R. Johnson, D. Tarjan, S. Keckler, W. Dally, Erik Lindholm, K. Skadron","doi":"10.1145/2000064.2000093","DOIUrl":"https://doi.org/10.1145/2000064.2000093","url":null,"abstract":"Modern graphics processing units (GPUs) use a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complicated thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we examine register file caching to replace accesses to the large main register file with accesses to a smaller structure containing the immediate register working set of active threads. Second, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency. Combined with register file caching, a two-level thread scheduler provides a further reduction in energy by limiting the allocation of temporary register cache resources to only the currently active subset of threads. We show that on average, across a variety of real world graphics and compute workloads, a 6-entry per-thread register file cache reduces the number of reads and writes to the main register file by 50% and 59% respectively. We further show that the active thread count can be reduced by a factor of 4 with minimal impact on performance, resulting in a 36% reduction of register file energy.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126402075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 267