Proceedings. International Symposium on Computer Architecture最新文献_第7页

Ten ways to waste a parallel computer 浪费并行计算机的十种方法

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555755

K. Yelick

As clock speed increases taper off and hardware designers struggle to scale parallelism within a chip, software developers and researchers must face the challenge of writing portable software with no clear architectural target. On the hardware side, energy considerations will dominate many of the design decisions, and will ultimately limit what systems and applications can be built. This is especially true at the high end, where the next major milestone of exascale computing will be unattainable without major improvements in efficiency. Although hardware designers have long worried about the efficiency of their designs, especially for battery-operated devices, software developers in general have not. To illustrate this point, I will describe some of the top ways to waste time and therefore energy waiting for communication, synchronization, or interactions with users or other systems. Data movement, rather than computation, is the big consumer of energy, yet software often moves data up and down the memory hierarchy or across a network multiple times. At the same time, hardware designers need to take into account the constraints of the computational problems that will run on their systems, as a design that is poorly matched to the computational requirements will end up being inefficient. Drawing on my own experience in scientific computing, I will give examples of how to make the combination of hardware, algorithms and software more efficient, but also describe some of the challenges that are inherent in the application problems we want to solve. The community needs to take an integrated approach to the problem, and consider how much business or science can be done per Joule, rather than optimizing a particular component of the system in isolation. This will require rethinking the algorithms, programming models, and hardware in concert, and therefore an unprecedented level of collaboration and cooperation between hardware and software designers.

随着时钟速度的逐渐减少，硬件设计人员努力在芯片内扩展并行性，软件开发人员和研究人员必须面对编写没有明确架构目标的可移植软件的挑战。在硬件方面，能源方面的考虑将主导许多设计决策，并将最终限制系统和应用程序的构建。在高端领域尤其如此，如果没有效率的重大改进，百亿亿次计算的下一个重要里程碑将无法实现。尽管硬件设计师长期以来一直担心他们设计的效率，尤其是电池供电设备的效率，但软件开发人员一般不会。为了说明这一点，我将描述在等待与用户或其他系统的通信、同步或交互时浪费时间和精力的一些主要方法。数据移动，而不是计算，是最大的能源消耗者，然而软件经常在内存层次结构中上下移动数据或在网络中多次移动数据。同时，硬件设计人员需要考虑将在其系统上运行的计算问题的约束，因为与计算需求不匹配的设计最终将是低效的。根据我自己在科学计算方面的经验，我将举例说明如何使硬件、算法和软件的组合更有效，但也将描述我们想要解决的应用程序问题中固有的一些挑战。社区需要采取综合的方法来解决这个问题，并考虑每焦耳可以做多少业务或科学工作，而不是孤立地优化系统的特定组件。这需要重新考虑算法、编程模型和硬件，因此需要硬件和软件设计师之间前所未有的协作和合作。

{"title":"Ten ways to waste a parallel computer","authors":"K. Yelick","doi":"10.1145/1555754.1555755","DOIUrl":"https://doi.org/10.1145/1555754.1555755","url":null,"abstract":"As clock speed increases taper off and hardware designers struggle to scale parallelism within a chip, software developers and researchers must face the challenge of writing portable software with no clear architectural target. On the hardware side, energy considerations will dominate many of the design decisions, and will ultimately limit what systems and applications can be built. This is especially true at the high end, where the next major milestone of exascale computing will be unattainable without major improvements in efficiency.\u0000 Although hardware designers have long worried about the efficiency of their designs, especially for battery-operated devices, software developers in general have not. To illustrate this point, I will describe some of the top ways to waste time and therefore energy waiting for communication, synchronization, or interactions with users or other systems. Data movement, rather than computation, is the big consumer of energy, yet software often moves data up and down the memory hierarchy or across a network multiple times. At the same time, hardware designers need to take into account the constraints of the computational problems that will run on their systems, as a design that is poorly matched to the computational requirements will end up being inefficient. Drawing on my own experience in scientific computing, I will give examples of how to make the combination of hardware, algorithms and software more efficient, but also describe some of the challenges that are inherent in the application problems we want to solve. The community needs to take an integrated approach to the problem, and consider how much business or science can be done per Joule, rather than optimizing a particular component of the system in isolation. This will require rethinking the algorithms, programming models, and hardware in concert, and therefore an unprecedented level of collaboration and cooperation between hardware and software designers.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"13 1","pages":"1"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85265401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

InvisiFence: performance-transparent memory ordering in conventional multiprocessors InvisiFence:传统多处理器中性能透明的内存排序

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555785

Colin Blundell, Milo M. K. Martin, T. Wenisch

A multiprocessor's memory consistency model imposes ordering constraints among loads, stores, atomic operations, and memory fences. Even for consistency models that relax ordering among loads and stores, ordering constraints still induce significant performance penalties due to atomic operations and memory ordering fences. Several prior proposals reduce the performance penalty of strongly ordered models using post-retirement speculation, but these designs either (1) maintain speculative state at a per-store granularity, causing storage requirements to grow proportionally to speculation depth, or (2) employ distributed global commit arbitration using unconventional chunk-based invalidation mechanisms. In this paper we propose InvisiFence, an approach for implementing memory ordering based on post-retirement speculation that avoids these concerns. InvisiFence leverages minimalistic mechanisms for post-retirement speculation proposed in other contexts to (1) track speculative state efficiently at block-granularity with dedicated storage requirements independent of speculation depth, (2) provide fast commit by avoiding explicit commit arbitration, and (3) operate under a conventional invalidation-based cache coherence protocol. InvisiFence supports both modes of operation found in prior work: speculating only when necessary to minimize the risk of rollback-inducing violations or speculating continuously to decouple consistency enforcement from the processor core. Overall, InvisiFence requires approximately one kilobyte of additional state to transform a conventional multiprocessor into one that provides performance-transparent memory ordering, fences, and atomic operations.

多处理器的内存一致性模型在加载、存储、原子操作和内存栅栏之间施加了排序约束。即使对于在加载和存储之间放松排序的一致性模型，由于原子操作和内存排序围栏，排序约束仍然会导致显著的性能损失。之前有几个建议使用退役后推测来减少强排序模型的性能损失，但是这些设计要么(1)在每个存储粒度上维持推测状态，导致存储需求与推测深度成比例地增长，要么(2)使用非常规的基于块的失效机制采用分布式全局提交仲裁。在本文中，我们提出了InvisiFence，这是一种基于退休后推测实现内存排序的方法，可以避免这些问题。InvisiFence利用在其他情况下提出的退役后推测的极简机制:(1)在块粒度上有效地跟踪推测状态，具有独立于推测深度的专用存储需求，(2)通过避免显式提交仲裁提供快速提交，以及(3)在传统的基于无效的缓存一致性协议下运行。InvisiFence支持在之前的工作中发现的两种操作模式:仅在必要时进行推测，以最大限度地减少回滚导致违规的风险，或者连续推测以将一致性强制与处理器核心分离。总的来说，InvisiFence需要大约1kb的额外状态来将传统的多处理器转换为提供性能透明的内存排序、隔离和原子操作的多处理器。

{"title":"InvisiFence: performance-transparent memory ordering in conventional multiprocessors","authors":"Colin Blundell, Milo M. K. Martin, T. Wenisch","doi":"10.1145/1555754.1555785","DOIUrl":"https://doi.org/10.1145/1555754.1555785","url":null,"abstract":"A multiprocessor's memory consistency model imposes ordering constraints among loads, stores, atomic operations, and memory fences. Even for consistency models that relax ordering among loads and stores, ordering constraints still induce significant performance penalties due to atomic operations and memory ordering fences. Several prior proposals reduce the performance penalty of strongly ordered models using post-retirement speculation, but these designs either (1) maintain speculative state at a per-store granularity, causing storage requirements to grow proportionally to speculation depth, or (2) employ distributed global commit arbitration using unconventional chunk-based invalidation mechanisms. In this paper we propose InvisiFence, an approach for implementing memory ordering based on post-retirement speculation that avoids these concerns. InvisiFence leverages minimalistic mechanisms for post-retirement speculation proposed in other contexts to (1) track speculative state efficiently at block-granularity with dedicated storage requirements independent of speculation depth, (2) provide fast commit by avoiding explicit commit arbitration, and (3) operate under a conventional invalidation-based cache coherence protocol. InvisiFence supports both modes of operation found in prior work: speculating only when necessary to minimize the risk of rollback-inducing violations or speculating continuously to decouple consistency enforcement from the processor core. Overall, InvisiFence requires approximately one kilobyte of additional state to transform a conventional multiprocessor into one that provides performance-transparent memory ordering, fences, and atomic operations.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"1 1","pages":"233-244"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87983626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 110

Disaggregated memory for expansion and sharing in blade servers 在刀片服务器中用于扩展和共享的分解内存

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555789

Kevin T. Lim, Jichuan Chang, T. Mudge, Parthasarathy Ranganathan, S. Reinhardt, T. Wenisch

Analysis of technology and application trends reveals a growing imbalance in the peak compute-to-memory-capacity ratio for future servers. At the same time, the fraction contributed by memory systems to total datacenter costs and power consumption during typical usage is increasing. In response to these trends, this paper re-examines traditional compute-memory co-location on a single system and details the design of a new general-purpose architectural building block-a memory blade-that allows memory to be "disaggregated" across a system ensemble. This remote memory blade can be used for memory capacity expansion to improve performance and for sharing memory across servers to reduce provisioning and power costs. We use this memory blade building block to propose two new system architecture solutions-(1) page-swapped remote memory at the virtualization layer, and (2) block-access remote memory with support in the coherence hardware-that enable transparent memory expansion and sharing on commodity-based systems. Using simulations of a mix of enterprise benchmarks supplemented with traces from live datacenters, we demonstrate that memory disaggregation can provide substantial performance benefits (on average 10X) in memory constrained environments, while the sharing enabled by our solutions can improve performance-per-dollar by up to 57% when optimizing memory provisioning across multiple servers.

对技术和应用趋势的分析表明，未来服务器的峰值计算与内存容量之比越来越不平衡。与此同时，在典型使用过程中，内存系统在数据中心总成本和功耗中所占的比例正在增加。为了响应这些趋势，本文重新审视了单一系统上传统的计算-内存协同定位，并详细介绍了一种新的通用架构构建块——内存刀片——的设计，它允许内存在整个系统集成中“分解”。这种远程内存刀片可用于内存容量扩展以提高性能，也可用于跨服务器共享内存以降低供应和电源成本。我们使用这个内存刀片构建块提出了两种新的系统架构解决方案——(1)虚拟化层的页面交换远程内存，(2)在一致性硬件支持下的块访问远程内存，从而在基于商品的系统上实现透明的内存扩展和共享。通过对企业基准测试的混合模拟，加上来自实时数据中心的跟踪，我们证明了内存分解可以在内存受限的环境中提供可观的性能优势(平均10倍)，而我们的解决方案支持的共享在优化跨多个服务器的内存配置时，可以将每美元的性能提高高达57%。

{"title":"Disaggregated memory for expansion and sharing in blade servers","authors":"Kevin T. Lim, Jichuan Chang, T. Mudge, Parthasarathy Ranganathan, S. Reinhardt, T. Wenisch","doi":"10.1145/1555754.1555789","DOIUrl":"https://doi.org/10.1145/1555754.1555789","url":null,"abstract":"Analysis of technology and application trends reveals a growing imbalance in the peak compute-to-memory-capacity ratio for future servers. At the same time, the fraction contributed by memory systems to total datacenter costs and power consumption during typical usage is increasing. In response to these trends, this paper re-examines traditional compute-memory co-location on a single system and details the design of a new general-purpose architectural building block-a memory blade-that allows memory to be \"disaggregated\" across a system ensemble. This remote memory blade can be used for memory capacity expansion to improve performance and for sharing memory across servers to reduce provisioning and power costs. We use this memory blade building block to propose two new system architecture solutions-(1) page-swapped remote memory at the virtualization layer, and (2) block-access remote memory with support in the coherence hardware-that enable transparent memory expansion and sharing on commodity-based systems. Using simulations of a mix of enterprise benchmarks supplemented with traces from live datacenters, we demonstrate that memory disaggregation can provide substantial performance benefits (on average 10X) in memory constrained environments, while the sharing enabled by our solutions can improve performance-per-dollar by up to 57% when optimizing memory provisioning across multiple servers.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"190 6 1","pages":"267-278"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88565400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 430

Application-aware deadlock-free oblivious routing 应用程序感知的无死锁遗忘路由

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555782

M. Kinsy, Myong Hyon Cho, Tina Wen, G. Suh, Marten van Dijk, S. Devadas

Conventional oblivious routing algorithms are either not application-aware or assume that each flow has its own private channel to ensure deadlock avoidance. We present a framework for application-aware routing that assures deadlock-freedom under one or more channels by forcing routes to conform to an acyclic channel dependence graph. Arbitrary minimal routes can be made deadlock-free through appropriate static channel allocation when two or more channels are available. Given bandwidth estimates for flows, we present a mixed integer-linear programming (MILP) approach and a heuristic approach for producing deadlock-free routes that minimize maximum channel load. The heuristic algorithm is calibrated using the MILP algorithm and evaluated on a number of benchmarks through detailed network simulation. Our framework can be used to produce application-aware routes that target the minimization of latency, number of flows through a link, bandwidth, or any combination thereof.

传统的无关路由算法要么不能感知应用程序，要么假定每个流都有自己的专用通道以确保避免死锁。我们提出了一个应用感知路由的框架，通过强制路由符合无循环通道依赖图来确保一个或多个通道下的死锁自由。当有两个或多个通道可用时，可以通过适当的静态通道分配使任意最小路由无死锁。给定流量的带宽估计，我们提出了一种混合整数线性规划(MILP)方法和一种启发式方法，用于产生无死锁的路由，使最大通道负载最小化。启发式算法使用MILP算法进行校准，并通过详细的网络模拟在许多基准上进行评估。我们的框架可用于生成应用程序感知路由，目标是最小化延迟、通过链路的流量数量、带宽或它们的任何组合。

引用次数: 85

Rigel: an architecture and scalable programming interface for a 1000-core accelerator Rigel:用于1000核加速器的架构和可扩展编程接口

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555774

J. H. Kelm, Daniel R. Johnson, Matthew R. Johnson, N. Crago, W. Tuohy, Aqeel Mahesri, S. Lumetta, M. Frank, Sanjay J. Patel

This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled single-program, multiple-data (SPMD) execution model. Rigel's low-level programming interface adopts a single global address space model where parallel work is expressed in a task-centric, bulk-synchronized manner using minimal hardware support. Compared to existing accelerators, which contain domain-specific hardware, specialized memories, and/or restrictive programming models, Rigel is more flexible and provides a straightforward target for a broader set of applications. We perform a design analysis of Rigel to quantify the compute density and power efficiency of our initial design. We find that Rigel can achieve a density of over 8 single-precision GFLOPS/mm2 in 45nm, which is comparable to high-end GPUs scaled to 45nm. We perform experimental analysis on several applications ported to the Rigel low-level programming interface. We examine scalability issues related to work distribution, synchronization, and load-balancing for 1000-core accelerators using software techniques and minimal specialized hardware support. We find that while it is important to support fast task distribution and barrier operations, these operations can be implemented without specialized hardware using flexible hardware primitives.

Rigel是一种可编程加速器体系结构，用于广泛的数据和任务并行计算。Rigel由1000多个分层组织的核心组成，这些核心使用细粒度、动态调度的单程序多数据(SPMD)执行模型。Rigel的低级编程接口采用单一的全局地址空间模型，其中并行工作以任务为中心，使用最小的硬件支持以批量同步的方式表示。与现有的包含特定领域硬件、专用内存和/或限制性编程模型的加速器相比，Rigel更加灵活，并为更广泛的应用程序提供了一个直接的目标。我们对Rigel进行了设计分析，以量化初始设计的计算密度和功率效率。我们发现Rigel可以在45nm实现超过8个单精度GFLOPS/mm2的密度，这与高端gpu缩放到45nm相当。我们对移植到Rigel底层编程接口的几个应用程序进行了实验分析。我们使用软件技术和最小的专用硬件支持来研究与1000核加速器的工作分配、同步和负载平衡相关的可伸缩性问题。我们发现，虽然支持快速任务分发和屏障操作很重要，但这些操作可以在没有专用硬件的情况下使用灵活的硬件原语来实现。

{"title":"Rigel: an architecture and scalable programming interface for a 1000-core accelerator","authors":"J. H. Kelm, Daniel R. Johnson, Matthew R. Johnson, N. Crago, W. Tuohy, Aqeel Mahesri, S. Lumetta, M. Frank, Sanjay J. Patel","doi":"10.1145/1555754.1555774","DOIUrl":"https://doi.org/10.1145/1555754.1555774","url":null,"abstract":"This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled single-program, multiple-data (SPMD) execution model. Rigel's low-level programming interface adopts a single global address space model where parallel work is expressed in a task-centric, bulk-synchronized manner using minimal hardware support. Compared to existing accelerators, which contain domain-specific hardware, specialized memories, and/or restrictive programming models, Rigel is more flexible and provides a straightforward target for a broader set of applications.\u0000 We perform a design analysis of Rigel to quantify the compute density and power efficiency of our initial design. We find that Rigel can achieve a density of over 8 single-precision GFLOPS/mm2 in 45nm, which is comparable to high-end GPUs scaled to 45nm. We perform experimental analysis on several applications ported to the Rigel low-level programming interface. We examine scalability issues related to work distribution, synchronization, and load-balancing for 1000-core accelerators using software techniques and minimal specialized hardware support. We find that while it is important to support fast task distribution and barrier operations, these operations can be implemented without specialized hardware using flexible hardware primitives.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"63 1","pages":"140-151"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77674115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 160

Stream chaining: exploiting multiple levels of correlation in data prefetching 流链:在数据预取中利用多级相关性

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555767

Pedro Díaz, Marcelo H. Cintra

Data prefetching has long been an important technique to amortize the effects of the memory wall, and is likely to remain so in the current era of multi-core systems. Most prefetchers operate by identifying patterns and correlations in the miss address stream. Separating streams according to the memory access instruction that generates the misses is an effective way of filtering out spurious addresses from predictable streams. On the other hand, by localizing streams based on the memory access instructions, such prefetchers both lose the complete time sequence information of misses and can only issue prefetches for a single memory access instruction at a time. This paper proposes a novel class of prefetchers based on the idea of linking various localized streams into predictable chains of missing memory access instructions such that the prefetcher can issue prefetches along multiple streams. In this way the prefetcher is not limited to prefetching deeply for a single missing memory access instruction but can instead adaptively prefetch for other memory access instructions closer in time. Experimental results show that the proposed prefetcher consistently achieves better performance than a state-of-the-art prefetcher -- 10% on average, being only outperformed in very few cases and then by only 2%, and outperforming that prefetcher by as much as 55% -- while consuming the same amount of memory bandwidth.

长期以来，数据预取一直是分摊内存墙影响的重要技术，并且在当前的多核系统时代可能仍然如此。大多数预取器通过识别丢失地址流中的模式和相关性来操作。根据产生错误的内存访问指令分离流是一种从可预测流中过滤掉虚假地址的有效方法。另一方面，由于基于内存访问指令对流进行了本地化，这种预取器既丢失了丢失的完整时间序列信息，又一次只能对单个内存访问指令发出预取。本文提出了一类新的预取器，基于将各种本地化流链接到缺失内存访问指令的可预测链中的想法，以便预取器可以沿多个流发出预取。通过这种方式，预取器不局限于对单个丢失的内存访问指令进行深度预取，而是可以自适应地预取时间更近的其他内存访问指令。实验结果表明，所提出的预取器始终比最先进的预取器获得更好的性能——平均10%，仅在极少数情况下优于2%，并且在消耗相同数量的内存带宽的情况下优于该预取器多达55%。

{"title":"Stream chaining: exploiting multiple levels of correlation in data prefetching","authors":"Pedro Díaz, Marcelo H. Cintra","doi":"10.1145/1555754.1555767","DOIUrl":"https://doi.org/10.1145/1555754.1555767","url":null,"abstract":"Data prefetching has long been an important technique to amortize the effects of the memory wall, and is likely to remain so in the current era of multi-core systems. Most prefetchers operate by identifying patterns and correlations in the miss address stream. Separating streams according to the memory access instruction that generates the misses is an effective way of filtering out spurious addresses from predictable streams. On the other hand, by localizing streams based on the memory access instructions, such prefetchers both lose the complete time sequence information of misses and can only issue prefetches for a single memory access instruction at a time.\u0000 This paper proposes a novel class of prefetchers based on the idea of linking various localized streams into predictable chains of missing memory access instructions such that the prefetcher can issue prefetches along multiple streams. In this way the prefetcher is not limited to prefetching deeply for a single missing memory access instruction but can instead adaptively prefetch for other memory access instructions closer in time.\u0000 Experimental results show that the proposed prefetcher consistently achieves better performance than a state-of-the-art prefetcher -- 10% on average, being only outperformed in very few cases and then by only 2%, and outperforming that prefetcher by as much as 55% -- while consuming the same amount of memory bandwidth.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"2009 1","pages":"81-92"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78670557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors 解耦存储完成/静默确定性重放:为CPR/CFP处理器启用可扩展的数据内存

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555786

Andrew D. Hilton, A. Roth

CPR/CFP (Checkpoint Processing and Recovery/Continual Flow Pipeline) support an adaptive instruction window that scales to tolerate last-level cache misses. CPR/CFP scale the register file by aggressively reclaiming the destination registers of many in-flight instructions. However, an analogous mechanism does not exist for stores and loads. As the window expands, CPR/CFP processors must track all in-flight stores and loads to support forwarding and detect memory ordering violations. The previously-described SVW (Store Vulnerability Window) and SQIP (Store Queue Index Prediction) schemes provide scalable, non-associative load and store queues, respectively. However, they don't work smoothly in a CPR/CFP context. SVW/SQIP rely on the ability to dynamically stall some loads until a specific older store writes to the cache. Enforcing this serialization in CPR/CFP is expensive if the load and store are in the same checkpoint. We introduce two complementary procedures that implement this serialization efficiently. Decoupled Store Completion (DSC) allows stores to write to the cache before the enclosing checkpoint completes execution. Silent Deterministic Replay (SDR) supports mis-speculation recovery in the presence of DSC by replaying loads older than completed stores using values from the load queue. The combination of DSC and SDR enables an SVW/SQIP based CPR/CFP memory system that outperforms previous designs while occupying less area.

CPR/CFP(检查点处理和恢复/连续流管道)支持自适应指令窗口，该窗口可扩展以容忍最后一级缓存丢失。CPR/CFP通过积极回收许多飞行指令的目的地寄存器来扩展寄存器文件。但是，对于存储和加载，不存在类似的机制。随着窗口的扩展，CPR/CFP处理器必须跟踪所有飞行中的存储和负载，以支持转发和检测内存顺序违规。前面描述的SVW(存储漏洞窗口)和SQIP(存储队列索引预测)方案分别提供可伸缩的、非关联的负载和存储队列。然而，它们在CPR/CFP环境中并不顺利。SVW/SQIP依赖于动态停止某些负载的能力，直到特定的旧存储写入缓存。如果加载和存储在同一个检查点中，在CPR/CFP中强制这种序列化是昂贵的。我们引入两个互补的过程来有效地实现这种序列化。解耦存储完成(DSC)允许存储在封闭检查点完成执行之前写入缓存。静默确定性重播(SDR)通过使用负载队列中的值重播比已完成存储更早的负载，支持在DSC存在时的错误推测恢复。DSC和SDR的结合使基于SVW/SQIP的CPR/CFP存储系统优于以前的设计，同时占用更小的面积。

{"title":"Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors","authors":"Andrew D. Hilton, A. Roth","doi":"10.1145/1555754.1555786","DOIUrl":"https://doi.org/10.1145/1555754.1555786","url":null,"abstract":"CPR/CFP (Checkpoint Processing and Recovery/Continual Flow Pipeline) support an adaptive instruction window that scales to tolerate last-level cache misses. CPR/CFP scale the register file by aggressively reclaiming the destination registers of many in-flight instructions. However, an analogous mechanism does not exist for stores and loads. As the window expands, CPR/CFP processors must track all in-flight stores and loads to support forwarding and detect memory ordering violations.\u0000 The previously-described SVW (Store Vulnerability Window) and SQIP (Store Queue Index Prediction) schemes provide scalable, non-associative load and store queues, respectively. However, they don't work smoothly in a CPR/CFP context. SVW/SQIP rely on the ability to dynamically stall some loads until a specific older store writes to the cache. Enforcing this serialization in CPR/CFP is expensive if the load and store are in the same checkpoint.\u0000 We introduce two complementary procedures that implement this serialization efficiently. Decoupled Store Completion (DSC) allows stores to write to the cache before the enclosing checkpoint completes execution. Silent Deterministic Replay (SDR) supports mis-speculation recovery in the presence of DSC by replaying loads older than completed stores using values from the load queue. The combination of DSC and SDR enables an SVW/SQIP based CPR/CFP memory system that outperforms previous designs while occupying less area.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"40 1","pages":"245-254"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73098122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Hardware support for WCET analysis of hard real-time multicore systems 硬实时多核系统的WCET分析的硬件支持

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555764

Marco Paolieri, E. Quiñones, F. Cazorla, G. Bernat, M. Valero

The increasing demand for new functionalities in current and future hard real-time embedded systems like automotive, avionics and space industries is driving an increase in the performance required in embedded processors. Multicore processors represent a good design solution for such systems due to their high performance, low cost and power consumption characteristics. However, hard real-time embedded systems require time analyzability and current multicore processors are less analyzable than single-core processors due to the interferences between different tasks when accessing shared hardware resources. In this paper we propose a multicore architecture with shared resources that allows the execution of applications with hard real-time and non hard real-time constraints at the same time, providing time analizability for the hard real-time tasks so that they can meet their deadlines. Moreover our architecture proposal provides high-performance for the non hard real-time tasks.

当前和未来的硬实时嵌入式系统(如汽车、航空电子和航天工业)对新功能的需求日益增长，这推动了嵌入式处理器性能要求的提高。多核处理器以其高性能、低成本和低功耗的特点，为这类系统提供了一个很好的设计解决方案。然而，硬实时嵌入式系统需要时间可分析性，由于访问共享硬件资源时不同任务之间的干扰，当前多核处理器的可分析性不如单核处理器。在本文中，我们提出了一种具有共享资源的多核架构，该架构允许同时执行具有硬实时和非硬实时约束的应用程序，为硬实时任务提供时间可分析性，从而使它们能够满足其截止日期。此外，我们的架构方案为非硬实时任务提供了高性能。

引用次数: 290

Internet-scale service infrastructure efficiency 互联网规模的服务基础设施效率

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555756

James R. Hamilton

High-scale cloud services provide economies of scale of five to ten over small-scale deployments, and are becoming a large part of both enterprise information processing and consumer services. Even very large enterprise IT deployments have quite different cost drivers and optimizations points from internet-scale services. The former are people-dominated from a cost perspective whereas internet-scale service costs are driven by server hardware and infrastructure with people costs fading into the noise at less than 10%. In this talk we inventory where the infrastructure costs are in internet-scale services. We track power distribution from 115KV at the property line through all conversions into the data center tracking the losses to final delivery at semiconductor voltage levels. We track cooling and all the energy conversions from power dissipation through release to the environment outside of the building. Understanding where the costs and inefficiencies lie, we'll look more closely at cooling and overall mechanical system design, server hardware design, and software techniques including graceful degradation mode, power yield management, and resource consumption shaping.

与小规模部署相比，大规模云服务提供了5到10倍的规模经济，并且正在成为企业信息处理和消费者服务的重要组成部分。即使是非常大的企业IT部署，其成本驱动因素和优化点也与互联网规模的服务大不相同。从成本的角度来看，前者是以人为主导的，而互联网规模的服务成本是由服务器硬件和基础设施驱动的，人力成本在不到10%的情况下逐渐消失。在这次演讲中，我们列出了互联网规模服务的基础设施成本。我们跟踪从115千伏到所有转换到数据中心的配电，跟踪损耗到半导体电压水平的最终交付。我们跟踪冷却和所有的能量转换，从耗散到释放到建筑外部的环境。了解成本和低效率的所在，我们将更仔细地研究冷却和整体机械系统设计、服务器硬件设计和软件技术，包括优雅的降级模式、功率产生管理和资源消耗塑造。

{"title":"Internet-scale service infrastructure efficiency","authors":"James R. Hamilton","doi":"10.1145/1555754.1555756","DOIUrl":"https://doi.org/10.1145/1555754.1555756","url":null,"abstract":"High-scale cloud services provide economies of scale of five to ten over small-scale deployments, and are becoming a large part of both enterprise information processing and consumer services. Even very large enterprise IT deployments have quite different cost drivers and optimizations points from internet-scale services. The former are people-dominated from a cost perspective whereas internet-scale service costs are driven by server hardware and infrastructure with people costs fading into the noise at less than 10%.\u0000 In this talk we inventory where the infrastructure costs are in internet-scale services. We track power distribution from 115KV at the property line through all conversions into the data center tracking the losses to final delivery at semiconductor voltage levels. We track cooling and all the energy conversions from power dissipation through release to the environment outside of the building. Understanding where the costs and inefficiencies lie, we'll look more closely at cooling and overall mechanical system design, server hardware design, and software techniques including graceful degradation mode, power yield management, and resource consumption shaping.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"33 1","pages":"232"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75199000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 62

A memory system design framework: creating smart memories 一个记忆系统设计框架:创建智能记忆

Proceedings. International Symposium on Computer Architecture

Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555805

A. Firoozshahian, A. Solomatnikov, Ofer Shacham, Zain Asgar, S. Richardson, C. Kozyrakis, M. Horowitz

As CPU cores become building blocks, we see a great expansion in the types of on-chip memory systems proposed for CMPs. Unfortunately, designing the cache and protocol controllers to support these memory systems is complex, and their concurrency and latency characteristics significantly affect the performance of any CMP. To address this problem, this paper presents a microarchitecture framework for cache and protocol controllers, which can aid in generating the RTL for new memory systems. The framework consists of three pipelined engines' request-tracking, state-manipulation, and data movement' which are programmed to implement a higher-level memory model. This approach simplifies the design and verification of CMP systems by decomposing the memory model into sequences of state and data manipulations. Moreover, implementing the framework itself produces a polymorphic memory system. To validate the approach, we implemented a scalable, flexible CMP in silicon. The memory system was then programmed to support three disparate memory models' cache coherent shared memory, streams and transactional memory. Measured overheads of this approach seem promising. Our system generates controllers with performance overheads of less than 20% compared to an ideal controller with zero internal latency. Even the overhead of directly implementing a fully programmable controller was modest. While it did double the controller's area, the amortized effective area in the system grew by roughly 7%.

随着CPU内核成为构建模块，我们看到了为cmp提出的片上存储系统类型的巨大扩展。不幸的是，设计缓存和协议控制器来支持这些内存系统是很复杂的，它们的并发性和延迟特性会显著影响任何CMP的性能。为了解决这个问题，本文提出了一个缓存和协议控制器的微架构框架，它可以帮助生成新的存储系统的RTL。该框架由三个流水线引擎(请求跟踪、状态操作和数据移动)组成，它们被编程为实现更高级别的内存模型。该方法通过将内存模型分解为状态和数据操作序列，简化了CMP系统的设计和验证。此外，实现框架本身产生了一个多态内存系统。为了验证该方法，我们在硅上实现了一个可扩展的、灵活的CMP。然后对内存系统进行编程，以支持三种不同的内存模型:缓存、一致共享内存、流和事务性内存。这种方法的测量开销似乎很有希望。与零内部延迟的理想控制器相比，我们的系统生成的控制器的性能开销低于20%。甚至直接实现一个完全可编程控制器的开销也是适度的。虽然它确实使控制器的面积增加了一倍，但系统的平摊有效面积增加了大约7%。

{"title":"A memory system design framework: creating smart memories","authors":"A. Firoozshahian, A. Solomatnikov, Ofer Shacham, Zain Asgar, S. Richardson, C. Kozyrakis, M. Horowitz","doi":"10.1145/1555754.1555805","DOIUrl":"https://doi.org/10.1145/1555754.1555805","url":null,"abstract":"As CPU cores become building blocks, we see a great expansion in the types of on-chip memory systems proposed for CMPs. Unfortunately, designing the cache and protocol controllers to support these memory systems is complex, and their concurrency and latency characteristics significantly affect the performance of any CMP. To address this problem, this paper presents a microarchitecture framework for cache and protocol controllers, which can aid in generating the RTL for new memory systems. The framework consists of three pipelined engines' request-tracking, state-manipulation, and data movement' which are programmed to implement a higher-level memory model. This approach simplifies the design and verification of CMP systems by decomposing the memory model into sequences of state and data manipulations. Moreover, implementing the framework itself produces a polymorphic memory system.\u0000 To validate the approach, we implemented a scalable, flexible CMP in silicon. The memory system was then programmed to support three disparate memory models' cache coherent shared memory, streams and transactional memory. Measured overheads of this approach seem promising. Our system generates controllers with performance overheads of less than 20% compared to an ideal controller with zero internal latency. Even the overhead of directly implementing a fully programmable controller was modest. While it did double the controller's area, the amortized effective area in the system grew by roughly 7%.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"65 1","pages":"406-417"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78357433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23