2011 38th Annual International Symposium on Computer Architecture (ISCA)最新文献_第3页

Demand-driven software race detection using hardware performance counters 使用硬件性能计数器的需求驱动软件竞态检测

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000084

J. Greathouse, Zhiqiang Ma, M. Frank, R. Peri, T. Austin

Dynamic data race detectors are an important mechanism for creating robust parallel programs. Software race detectors instrument the program under test, observe each memory access, and watch for inter-thread data sharing that could lead to concurrency errors. While this method of bug hunting can find races that are normally difficult to observe, it also suffers from high runtime overheads. It is not uncommon for commercial race detectors to experience 300× slowdowns, limiting their usage. This paper presents a hardware-assisted demand-driven race detector. We are able to observe cache events that are indicative of data sharing between threads by taking advantage of hardware available on modern commercial microprocessors. We use these to build a race detector that is only enabled when it is likely that inter-thread data sharing is occurring. When little sharing takes place, this demand-driven analysis is much faster than contemporary continuous-analysis tools without a large loss of detection accuracy. We modified the race detector in Intel® Inspector XE to utilize our hardware-based sharing indicator and were able to achieve performance increases of 3× and 10× in two parallel benchmark suites and 51× for one particular program.

动态数据竞争检测器是创建健壮并行程序的重要机制。软件竞争检测器检测被测程序，观察每次内存访问，并监视可能导致并发错误的线程间数据共享。虽然这种寻找bug的方法可以找到通常很难观察到的种族，但它也有很高的运行时开销。商用赛跑检测器经历300倍的减速并不罕见，这限制了它们的使用。本文提出了一种硬件辅助的需求驱动竞赛检测器。通过利用现代商业微处理器上可用的硬件，我们能够观察到指示线程之间数据共享的缓存事件。我们使用它们来构建一个竞争检测器，该检测器仅在可能发生线程间数据共享时启用。当很少发生共享时，这种需求驱动的分析比当前的连续分析工具要快得多，而不会大大降低检测精度。我们修改了Intel®Inspector XE中的竞赛检测器，以利用我们基于硬件的共享指示器，并且能够在两个并行基准套件中实现3倍和10倍的性能提升，并在一个特定程序中实现51倍的性能提升。

{"title":"Demand-driven software race detection using hardware performance counters","authors":"J. Greathouse, Zhiqiang Ma, M. Frank, R. Peri, T. Austin","doi":"10.1145/2000064.2000084","DOIUrl":"https://doi.org/10.1145/2000064.2000084","url":null,"abstract":"Dynamic data race detectors are an important mechanism for creating robust parallel programs. Software race detectors instrument the program under test, observe each memory access, and watch for inter-thread data sharing that could lead to concurrency errors. While this method of bug hunting can find races that are normally difficult to observe, it also suffers from high runtime overheads. It is not uncommon for commercial race detectors to experience 300× slowdowns, limiting their usage. This paper presents a hardware-assisted demand-driven race detector. We are able to observe cache events that are indicative of data sharing between threads by taking advantage of hardware available on modern commercial microprocessors. We use these to build a race detector that is only enabled when it is likely that inter-thread data sharing is occurring. When little sharing takes place, this demand-driven analysis is much faster than contemporary continuous-analysis tools without a large loss of detection accuracy. We modified the race detector in Intel® Inspector XE to utilize our hardware-based sharing indicator and were able to achieve performance increases of 3× and 10× in two parallel benchmark suites and 51× for one particular program.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121040306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 43

Automatic abstraction and fault tolerance in cortical microachitectures 皮层微结构中的自动抽象和容错

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000066

Atif Hashmi, H. Berry, O. Temam, Mikko H. Lipasti

Recent advances in the neuroscientific understanding of the brain are bringing about a tantalizing opportunity for building synthetic machines that perform computation in ways that differ radically from traditional Von Neumann machines. These brain-like architectures, which are premised on our understanding of how the human neocortex computes, are highly fault-tolerant, averaging results over large numbers of potentially faulty components, yet manage to solve very difficult problems more reliably than traditional algorithms. A key principle of operation for these architectures is that of automatic abstraction: independent features are extracted from highly disordered inputs and are used to create abstract invariant representations of the external entities. This feature extraction is applied hierarchically, leading to increasing levels of abstraction at higher levels in the hierarchy. This paper describes and evaluates a biologically plausible computational model for this process, and highlights the inherent fault tolerance of the biologically-inspired algorithm. We introduce a stuck-at fault model for such cortical networks, and describe how this model maps to hardware faults that can occur on commodity GPGPU cores used to realize the model in software. We show experimentally that the model software implementation can intrinsically preserve its functionality in the presence of faulty hardware, without requiring any reprogramming or recompilation. This model is a first step towards developing a comprehensive and biologically plausible understanding of the computational algorithms and microarchitecture of computing systems that mimic the human cortex, and to applying them to the robust implementation of tasks on future computing systems built of faulty components.

神经科学对大脑理解的最新进展，为构建与传统冯·诺伊曼机器截然不同的计算方式的合成机器带来了一个诱人的机会。这些类似大脑的架构，以我们对人类新皮层如何计算的理解为前提，具有高度的容错性，在大量潜在故障组件上平均结果，但比传统算法更可靠地解决非常困难的问题。这些体系结构的一个关键操作原则是自动抽象:从高度无序的输入中提取独立的特征，并用于创建外部实体的抽象不变表示。这种特征提取是分层应用的，导致在层次结构的更高层次上增加抽象级别。本文描述并评估了这一过程的生物学上合理的计算模型，并强调了生物启发算法固有的容错能力。我们为这种皮质网络引入了一个卡在故障模型，并描述了该模型如何映射到用于在软件中实现该模型的商用GPGPU内核上可能发生的硬件故障。我们通过实验证明，模型软件实现可以在存在故障硬件的情况下本质上保持其功能，而不需要任何重新编程或重新编译。该模型是对模拟人类皮层的计算系统的计算算法和微架构进行全面和生物学上合理理解的第一步，并将其应用于由故障组件构建的未来计算系统上的任务的健壮实现。

{"title":"Automatic abstraction and fault tolerance in cortical microachitectures","authors":"Atif Hashmi, H. Berry, O. Temam, Mikko H. Lipasti","doi":"10.1145/2000064.2000066","DOIUrl":"https://doi.org/10.1145/2000064.2000066","url":null,"abstract":"Recent advances in the neuroscientific understanding of the brain are bringing about a tantalizing opportunity for building synthetic machines that perform computation in ways that differ radically from traditional Von Neumann machines. These brain-like architectures, which are premised on our understanding of how the human neocortex computes, are highly fault-tolerant, averaging results over large numbers of potentially faulty components, yet manage to solve very difficult problems more reliably than traditional algorithms. A key principle of operation for these architectures is that of automatic abstraction: independent features are extracted from highly disordered inputs and are used to create abstract invariant representations of the external entities. This feature extraction is applied hierarchically, leading to increasing levels of abstraction at higher levels in the hierarchy. This paper describes and evaluates a biologically plausible computational model for this process, and highlights the inherent fault tolerance of the biologically-inspired algorithm. We introduce a stuck-at fault model for such cortical networks, and describe how this model maps to hardware faults that can occur on commodity GPGPU cores used to realize the model in software. We show experimentally that the model software implementation can intrinsically preserve its functionality in the presence of faulty hardware, without requiring any reprogramming or recompilation. This model is a first step towards developing a comprehensive and biologically plausible understanding of the computational algorithms and microarchitecture of computing systems that mimic the human cortex, and to applying them to the robust implementation of tasks on future computing systems built of faulty components.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"02 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127429571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 54

Energy-efficient cache design using variable-strength error-correcting codes 采用可变强度纠错码的节能缓存设计

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000118

Alaa R. Alameldeen, I. Wagner, Zeshan A. Chishti, Wei Wu, C. Wilkerson, Shih-Lien Lu

Voltage scaling is one of the most effective mechanisms to improve microprocessors' energy efficiency. However, processors cannot operate reliably below a minimum voltage, Vccmin, since hardware structures may fail. Cell failures in large memory arrays (e.g., caches) typically determine Vccmin for the whole processor. We observe that most cache lines exhibit zero or one failures at low voltages. However, a few lines, especially in large caches, exhibit multi-bit failures and increase Vccmin. Previous solutions either significantly reduce cache capacity to enable uniform error correction across all lines, or significantly increase latency and bandwidth overheads when amortizing the cost of error-correcting codes (ECC) over large lines. In this paper, we propose a novel cache architecture that uses variable-strength error-correcting codes (VS-ECC). In the common case, lines with zero or one failures use a simple and fast ECC. A small number of lines with multi-bit failures use a strong multi-bit ECC that requires some additional area and latency. We present a novel dynamic cache characterization mechanism to determine which lines will exhibit multi-bit failures. In particular, we use multi-bit correction to protect a fraction of the cache after switching to low voltage, while dynamically testing the remaining lines for multi-bit failures. Compared to prior multi-bit-correcting proposals, VS-ECC significantly reduces power and energy, avoids significant reductions in cache capacity, incurs little area overhead, and avoids large increases in latency and bandwidth.

电压缩放是提高微处理器能效的最有效机制之一。然而，处理器不能在低于最低电压(Vccmin)的情况下可靠地运行，因为硬件结构可能会失效。大型存储器阵列(例如，缓存)中的单元故障通常决定整个处理器的Vccmin。我们观察到大多数缓存线在低电压下表现为零或一个故障。但是，有几行，特别是在大型缓存中，会出现多比特故障并增加Vccmin。以前的解决方案要么显着减少缓存容量，以便在所有线路上实现统一的纠错，要么显着增加延迟和带宽开销，以便在大型线路上分摊纠错码(ECC)的成本。在本文中，我们提出了一种使用可变强度纠错码(VS-ECC)的新型缓存架构。在通常情况下，零或一个故障的线路使用简单快速的ECC。少数具有多位故障的线路使用强大的多位ECC，这需要一些额外的面积和延迟。我们提出了一种新的动态缓存表征机制来确定哪些线路会出现多比特故障。特别是，我们在切换到低电压后使用多位校正来保护一小部分缓存，同时动态测试剩余线路的多位故障。与之前的多比特纠错方案相比，VS-ECC大大降低了功耗和能量，避免了缓存容量的大幅减少，面积开销很小，避免了延迟和带宽的大幅增加。

{"title":"Energy-efficient cache design using variable-strength error-correcting codes","authors":"Alaa R. Alameldeen, I. Wagner, Zeshan A. Chishti, Wei Wu, C. Wilkerson, Shih-Lien Lu","doi":"10.1145/2000064.2000118","DOIUrl":"https://doi.org/10.1145/2000064.2000118","url":null,"abstract":"Voltage scaling is one of the most effective mechanisms to improve microprocessors' energy efficiency. However, processors cannot operate reliably below a minimum voltage, Vccmin, since hardware structures may fail. Cell failures in large memory arrays (e.g., caches) typically determine Vccmin for the whole processor. We observe that most cache lines exhibit zero or one failures at low voltages. However, a few lines, especially in large caches, exhibit multi-bit failures and increase Vccmin. Previous solutions either significantly reduce cache capacity to enable uniform error correction across all lines, or significantly increase latency and bandwidth overheads when amortizing the cost of error-correcting codes (ECC) over large lines. In this paper, we propose a novel cache architecture that uses variable-strength error-correcting codes (VS-ECC). In the common case, lines with zero or one failures use a simple and fast ECC. A small number of lines with multi-bit failures use a strong multi-bit ECC that requires some additional area and latency. We present a novel dynamic cache characterization mechanism to determine which lines will exhibit multi-bit failures. In particular, we use multi-bit correction to protect a fraction of the cache after switching to low voltage, while dynamically testing the remaining lines for multi-bit failures. Compared to prior multi-bit-correcting proposals, VS-ECC significantly reduces power and energy, avoids significant reductions in cache capacity, incurs little area overhead, and avoids large increases in latency and bandwidth.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"30 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128190373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 165

FabScalar: Composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template 在规范超标量模板内组合任意核的可合成RTL设计

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000067

N. Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Hiran Mayukh, Jayneel Gandhi, Brandon H. Dwiel, Sandeep Navada, H. H. Najaf-abadi, E. Rotenberg

A growing body of work has compiled a strong case for the single-ISA heterogeneous multi-core paradigm. A single-ISA heterogeneous multi-core provides multiple, differently-designed superscalar core types that can streamline the execution of diverse programs and program phases. No prior research has addressed the “Achilles' heel” of this paradigm: design and verification effort is multiplied by the number of different core types. This work frames superscalar processors in a canonical form, so that it becomes feasible to quickly design many cores that differ in the three major superscalar dimensions: superscalar width, pipeline depth, and sizes of structures for extracting instruction-level parallelism (ILP). From this idea, we develop a toolset, called FabScalar, for automatically composing the synthesizable register-transfer-level (RTL) designs of arbitrary cores within a canonical superscalar template. The template defines canonical pipeline stages and interfaces among them. A Canonical Pipeline Stage Library (CPSL) provides many implementations of each canonical pipeline stage, that differ in their superscalar width and depth of sub-pipelining. An RTL generation tool uses the template and CPSL to automatically generate an overall core of desired configuration. Validation experiments are performed along three fronts to evaluate the quality of RTL designs generated by FabScalar: functional and performance (instructions-per-cycle (IPC)) validation, timing validation (cycle time), and confirmation of suitability for standard ASIC flows. With FabScalar, a chip with many different superscalar core types is conceivable.

越来越多的工作为单isa异构多核范式编写了强有力的案例。单个isa异构多核提供了多种不同设计的超标量核类型，可以简化不同程序和程序阶段的执行。之前没有研究解决过这个范例的“阿喀琉斯之踵”:设计和验证工作乘以不同核心类型的数量。这项工作将超标量处理器框架在一个规范的形式中，因此可以快速设计在三个主要超标量维度上不同的许多内核:超标量宽度、管道深度和用于提取指令级并行性(ILP)的结构大小。从这个想法出发，我们开发了一个工具集，称为FabScalar，用于在规范超标量模板内自动组合任意内核的可合成寄存器-传输级(RTL)设计。模板定义规范管道阶段和它们之间的接口。规范管道阶段库(CPSL)提供了每个规范管道阶段的许多实现，这些实现在子管道的超标量宽度和深度上有所不同。RTL生成工具使用模板和CPSL自动生成所需配置的整体核心。验证实验沿着三个方面进行，以评估由FabScalar生成的RTL设计的质量:功能和性能(每周期指令(IPC))验证，时间验证(周期时间)，以及对标准ASIC流适用性的确认。有了FabScalar，一个有许多不同的超标量核心类型的芯片是可以想象的。

{"title":"FabScalar: Composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template","authors":"N. Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Hiran Mayukh, Jayneel Gandhi, Brandon H. Dwiel, Sandeep Navada, H. H. Najaf-abadi, E. Rotenberg","doi":"10.1145/2000064.2000067","DOIUrl":"https://doi.org/10.1145/2000064.2000067","url":null,"abstract":"A growing body of work has compiled a strong case for the single-ISA heterogeneous multi-core paradigm. A single-ISA heterogeneous multi-core provides multiple, differently-designed superscalar core types that can streamline the execution of diverse programs and program phases. No prior research has addressed the “Achilles' heel” of this paradigm: design and verification effort is multiplied by the number of different core types. This work frames superscalar processors in a canonical form, so that it becomes feasible to quickly design many cores that differ in the three major superscalar dimensions: superscalar width, pipeline depth, and sizes of structures for extracting instruction-level parallelism (ILP). From this idea, we develop a toolset, called FabScalar, for automatically composing the synthesizable register-transfer-level (RTL) designs of arbitrary cores within a canonical superscalar template. The template defines canonical pipeline stages and interfaces among them. A Canonical Pipeline Stage Library (CPSL) provides many implementations of each canonical pipeline stage, that differ in their superscalar width and depth of sub-pipelining. An RTL generation tool uses the template and CPSL to automatically generate an overall core of desired configuration. Validation experiments are performed along three fronts to evaluate the quality of RTL designs generated by FabScalar: functional and performance (instructions-per-cycle (IPC)) validation, timing validation (cycle time), and confirmation of suitability for standard ASIC flows. With FabScalar, a chip with many different superscalar core types is conceivable.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124628840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 121

CRIB: Consolidated rename, issue, and bypass CRIB:合并重命名、发布和绕过

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000068

Erika Gunadi, Mikko H. Lipasti

Conventional high-performance processors utilize register renaming and complex broadcast-based scheduling logic to steer instructions into a small number of heavily-pipelined execution lanes. This requires multiple complex structures and repeated dependency resolution, imposing a significant dynamic power overhead. This paper advocates in-place execution of instructions, a power-saving, pipeline-free approach that consolidates rename, issue, and bypass logic into one structure - the CRIB - while simultaneously eliminating the need for a multiported register file, instead storing architected state in a simple rank of latches. CRIB achieves the high IPC of an out-of-order machine while keeping the execution core clean, simple, and low power. The datapath within a CRIB structure is purely combinational, eliminating most of the clocked elements in the core while keeping a fully synchronous yet high-frequency design. Experimental results match the IPC and cycle time of a baseline outof- order design while reducing dynamic energy consumption by more than 60% in affected structures.

传统的高性能处理器利用寄存器重命名和复杂的基于广播的调度逻辑，将指令引导到少数高度流水线化的执行通道中。这需要多个复杂的结构和重复的依赖项解析，造成了巨大的动态功率开销。本文提倡就地执行指令，这是一种节能、无管道的方法，它将重命名、发布和绕过逻辑整合到一个结构中——CRIB——同时消除了对多端口寄存器文件的需求，而是将体系结构状态存储在一个简单的锁存器中。CRIB实现了乱序机器的高IPC，同时保持了执行核心的干净、简单和低功耗。CRIB结构中的数据路径是纯组合的，消除了核心中的大部分时钟元素，同时保持了完全同步的高频设计。实验结果与基线失序设计的IPC和周期时间相匹配，同时将受影响结构的动态能耗降低了60%以上。

引用次数: 13

Moguls: A model to explore the memory hierarchy for bandwidth improvements Moguls:一个探索内存层次结构以提高带宽的模型

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000109

Guangyu Sun, C. Hughes, Changkyu Kim, Jishen Zhao, Cong Xu, Yuan Xie, Yen-kuang Chen

In recent years, the increasing number of processor cores and limited increases in main memory bandwidth have led to the problem of the bandwidth wall, where memory bandwidth is becoming a performance bottleneck. This is especially true for emerging latency-insensitive, bandwidth-sensitive applications. Designing the memory hierarchy for a platform with an emphasis on maximizing bandwidth within a fixed power budget becomes one of the key challenges. To facilitate architects to quickly explore the design space of memory hierarchies, we propose an analytical performance model called Moguls. The Moguls model estimates the performance of an application on a system, using the bandwidth demand of the application for a range of cache capacities and the bandwidth provided by the system with those capacities. We show how to extend this model with appropriate approximations to optimize a cache hierarchy under a power constraint. The results show how many levels of cache should be designed, and what the capacity, bandwidth, and technology of each level should be. In addition, we study memory hierarchy design with hybrid memory technologies, which shows the benefits of using multiple technologies for future computing systems.

近年来，随着处理器内核数量的不断增加和主存带宽增长的有限，导致了带宽墙的问题，内存带宽正在成为性能瓶颈。对于新兴的对延迟不敏感、对带宽敏感的应用程序尤其如此。为强调在固定功率预算内最大化带宽的平台设计内存层次结构成为关键挑战之一。为了方便架构师快速探索内存层次结构的设计空间，我们提出了一个称为Moguls的分析性能模型。Moguls模型使用应用程序对一系列缓存容量的带宽需求以及具有这些容量的系统提供的带宽，估计系统上应用程序的性能。我们将展示如何使用适当的近似来扩展此模型，以在功率约束下优化缓存层次结构。结果显示应该设计多少层缓存，以及每层的容量、带宽和技术应该是什么。此外，我们还研究了混合存储技术的内存层次设计，这表明了在未来的计算系统中使用多种技术的好处。

{"title":"Moguls: A model to explore the memory hierarchy for bandwidth improvements","authors":"Guangyu Sun, C. Hughes, Changkyu Kim, Jishen Zhao, Cong Xu, Yuan Xie, Yen-kuang Chen","doi":"10.1145/2000064.2000109","DOIUrl":"https://doi.org/10.1145/2000064.2000109","url":null,"abstract":"In recent years, the increasing number of processor cores and limited increases in main memory bandwidth have led to the problem of the bandwidth wall, where memory bandwidth is becoming a performance bottleneck. This is especially true for emerging latency-insensitive, bandwidth-sensitive applications. Designing the memory hierarchy for a platform with an emphasis on maximizing bandwidth within a fixed power budget becomes one of the key challenges. To facilitate architects to quickly explore the design space of memory hierarchies, we propose an analytical performance model called Moguls. The Moguls model estimates the performance of an application on a system, using the bandwidth demand of the application for a range of cache capacities and the bandwidth provided by the system with those capacities. We show how to extend this model with appropriate approximations to optimize a cache hierarchy under a power constraint. The results show how many levels of cache should be designed, and what the capacity, bandwidth, and technology of each level should be. In addition, we study memory hierarchy design with hybrid memory technologies, which shows the benefits of using multiple technologies for future computing systems.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125137865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

The role of optics in future high radix switch design 光学在未来高基数开关设计中的作用

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000116

N. Binkert, A. Davis, N. Jouppi, M. McLaren, Naveen Muralimanohar, R. Schreiber, Jung Ho Ahn

For large-scale networks, high-radix switches reduce hop and switch count, which decreases latency and power. The ITRS projections for signal-pin count and per-pin bandwidth are nearly flat over the next decade, so increased radix in electronic switches will come at the cost of less per-port bandwidth. Silicon nanophotonic technology provides a long-term solution to this problem. We first compare the use of photonic I/O against an all-electrical, Cray YARC inspired baseline. We compare the power and performance of switches of radix 64, 100, and 144 in the 45, 32, and 22 nm technology steps. In addition with the greater off-chip bandwidth enabled by photonics, the high power of electrical components inside the switch becomes a problem beyond radix 64. We propose an optical switch architecture that exploits high-speed optical interconnects to build a flat crossbar with multiple-writer, single-reader links. Unlike YARC, which uses small buffers at various stages, the proposed design buffers only at input and output ports. This simplifies the design and enables large buffers, capable of handling ethernet-size packets. To mitigate head-of-line blocking and maximize switch throughput, we use an arbitration scheme that allows each port to make eight requests and use two grants. The bandwidth of the optical crossbar is also doubled to to provide a 2x internal speedup. Since optical interconnects have high static power, we show that it is critical to balance the use of optical and electrical components to get the best energy efficiency. Overall, the adoption of photonic I/O allows 100,000 port networks to be constructed with less than one third the power of equivalent all-electronic networks. A further 50% reduction in power can be achieved by using photonics within the switch components. Our best optical design performs similarly to YARC for small packets while consuming less than half the power, and handles 80% more load for large message traffic.

对于大型网络，高基数交换机可以减少跳数和交换机计数，从而降低延迟和功耗。ITRS对信号引脚数和每引脚带宽的预测在未来十年几乎持平，因此电子交换机基数的增加将以每端口带宽的减少为代价。硅纳米光子技术为这一问题提供了一个长期的解决方案。我们首先比较了光子I/O与全电Cray YARC启发基线的使用。我们比较了45、32和22 nm工艺步骤中基数64、100和144开关的功率和性能。此外，随着光子学带来更大的片外带宽，开关内部电子元件的高功率成为基数64以外的一个问题。我们提出了一种利用高速光互连来构建具有多写入器、单读取器链路的平面交叉排的光交换机架构。与YARC不同的是，它在各个阶段使用小缓冲区，所提出的设计仅在输入和输出端口缓冲。这简化了设计并支持大型缓冲区，能够处理以太网大小的数据包。为了减轻线路阻塞并最大限度地提高交换机吞吐量，我们使用一种仲裁方案，允许每个端口发出8个请求并使用两个授予。光交叉条的带宽也增加了一倍，以提供2倍的内部加速。由于光互连具有高静态功率，因此我们表明平衡光学和电子元件的使用以获得最佳能源效率至关重要。总体而言，采用光子I/O可以构建100,000个端口网络，其功率不到等效全电子网络的三分之一。通过在开关元件中使用光子学，可以进一步降低50%的功率。我们最好的光学设计在处理小数据包时的性能与YARC相似，而功耗不到YARC的一半，并且在处理大型消息流量时可多处理80%的负载。

{"title":"The role of optics in future high radix switch design","authors":"N. Binkert, A. Davis, N. Jouppi, M. McLaren, Naveen Muralimanohar, R. Schreiber, Jung Ho Ahn","doi":"10.1145/2000064.2000116","DOIUrl":"https://doi.org/10.1145/2000064.2000116","url":null,"abstract":"For large-scale networks, high-radix switches reduce hop and switch count, which decreases latency and power. The ITRS projections for signal-pin count and per-pin bandwidth are nearly flat over the next decade, so increased radix in electronic switches will come at the cost of less per-port bandwidth. Silicon nanophotonic technology provides a long-term solution to this problem. We first compare the use of photonic I/O against an all-electrical, Cray YARC inspired baseline. We compare the power and performance of switches of radix 64, 100, and 144 in the 45, 32, and 22 nm technology steps. In addition with the greater off-chip bandwidth enabled by photonics, the high power of electrical components inside the switch becomes a problem beyond radix 64. We propose an optical switch architecture that exploits high-speed optical interconnects to build a flat crossbar with multiple-writer, single-reader links. Unlike YARC, which uses small buffers at various stages, the proposed design buffers only at input and output ports. This simplifies the design and enables large buffers, capable of handling ethernet-size packets. To mitigate head-of-line blocking and maximize switch throughput, we use an arbitration scheme that allows each port to make eight requests and use two grants. The bandwidth of the optical crossbar is also doubled to to provide a 2x internal speedup. Since optical interconnects have high static power, we show that it is critical to balance the use of optical and electrical components to get the best energy efficiency. Overall, the adoption of photonic I/O allows 100,000 port networks to be constructed with less than one third the power of equivalent all-electronic networks. A further 50% reduction in power can be achieved by using photonics within the switch components. Our best optical design performs similarly to YARC for small packets while consuming less than half the power, and handles 80% more load for large message traffic.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128138430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 87

Sampling + DMR: Practical and low-overhead permanent fault detection 采样+ DMR:实用且低开销的永久故障检测

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000089

Shuou Nomura, Matthew D. Sinclair, C. Ho, Venkatraman Govindaraju, M. Kruijf, K. Sankaralingam

With technology scaling, manufacture-time and in-field permanent faults are becoming a fundamental problem. Multi-core architectures with spares can tolerate them by detecting and isolating faulty cores, but the required fault detection coverage becomes effectively 100% as the number of permanent faults increases. Dual-modular redundancy(DMR) can provide 100% coverage without assuming device-level fault models, but its overhead is excessive. In this paper, we explore a simple and low-overhead mechanism we call Sampling-DMR: run in DMR mode for a small percentage (1% of the time for example) of each periodic execution window (5 million cycles for example). Although Sampling-DMR can leave some errors undetected, we argue the permanent fault coverage is 100% because it can detect all faults eventually. SamplingDMR thus introduces a system paradigm of restricting all permanent faults' effects to small finite windows of error occurrence. We prove an ultimate upper bound exists on total missed errors and develop a probabilistic model to analyze the distribution of the number of undetected errors and detection latency. The model is validated using full gate-level fault injection experiments for an actual processor running full application software. Sampling-DMR outperforms conventional techniques in terms of fault coverage, sustains similar detection latency guarantees, and limits energy and performance overheads to less than 2%.

随着技术的规模化，制造时间和现场永久故障已成为一个基本问题。带有备件的多核架构可以通过检测和隔离故障核来容忍它们，但随着永久故障数量的增加，所需的故障检测覆盖率实际上会达到100%。双模块冗余(DMR)可以在不假设设备级故障模型的情况下提供100%的覆盖，但其开销过大。在本文中，我们探索了一种简单且低开销的机制，我们称之为采样-DMR:在每个周期执行窗口(例如500万个周期)中以DMR模式运行一小部分时间(例如1%的时间)。尽管采样- dmr可能会留下一些未检测到的错误，但我们认为永久故障覆盖率为100%，因为它最终可以检测到所有故障。因此，采样dmr引入了一种系统范例，将所有永久故障的影响限制在错误发生的小有限窗口内。我们证明了总遗漏错误存在一个最终上界，并建立了一个概率模型来分析未检测错误数量和检测延迟的分布。通过运行完整应用软件的实际处理器的全门级故障注入实验，对模型进行了验证。采样- dmr在故障覆盖方面优于传统技术，保持类似的检测延迟保证，并将能量和性能开销限制在2%以下。

{"title":"Sampling + DMR: Practical and low-overhead permanent fault detection","authors":"Shuou Nomura, Matthew D. Sinclair, C. Ho, Venkatraman Govindaraju, M. Kruijf, K. Sankaralingam","doi":"10.1145/2000064.2000089","DOIUrl":"https://doi.org/10.1145/2000064.2000089","url":null,"abstract":"With technology scaling, manufacture-time and in-field permanent faults are becoming a fundamental problem. Multi-core architectures with spares can tolerate them by detecting and isolating faulty cores, but the required fault detection coverage becomes effectively 100% as the number of permanent faults increases. Dual-modular redundancy(DMR) can provide 100% coverage without assuming device-level fault models, but its overhead is excessive. In this paper, we explore a simple and low-overhead mechanism we call Sampling-DMR: run in DMR mode for a small percentage (1% of the time for example) of each periodic execution window (5 million cycles for example). Although Sampling-DMR can leave some errors undetected, we argue the permanent fault coverage is 100% because it can detect all faults eventually. SamplingDMR thus introduces a system paradigm of restricting all permanent faults' effects to small finite windows of error occurrence. We prove an ultimate upper bound exists on total missed errors and develop a probabilistic model to analyze the distribution of the number of undetected errors and detection latency. The model is validated using full gate-level fault injection experiments for an actual processor running full application software. Sampling-DMR outperforms conventional techniques in terms of fault coverage, sustains similar detection latency guarantees, and limits energy and performance overheads to less than 2%.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"239 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125761299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks 通过取消激活私有内存块的一致性来提高目录缓存的有效性

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000076

B. Cuesta, Alberto Ros, M. E. Gómez, A. Robles, J. Duato

To meet the demand for more powerful high-performance shared-memory servers, multiprocessor systems must incorporate efficient and scalable cache coherence protocols, such as those based on directory caches. However, the limited directory cache size of the increasingly larger systems may cause frequent evictions of directory entries and, consequently, invalidations of cached blocks, which severely degrades system performance. A significant percentage of the referred memory blocks are only accessed by one processor (even in parallel applications) and, therefore, do not require coherence maintenance. Taking advantage of techniques that dynamically identify those private blocks, we propose to deactivate the coherence protocol for them and to treat them as uniprocessor systems do. The protocol deactivation allows directory caches to omit the tracking of an appreciable quantity of blocks, which reduces their load and increases their effective size. Since the operating system collaborates on the detection of private blocks, our proposal only requires minor modifications. Simulation results show that, thanks to our proposal, directory caches can avoid the tracking of about 57% of the accessed blocks and their capacity can be better exploited. This contributes either to shorten the runtime of parallel applications by 15% while keeping directory cache size or to maintain system performance while using directory caches 8 times smaller.

为了满足对更强大的高性能共享内存服务器的需求，多处理器系统必须结合高效和可扩展的缓存一致性协议，例如那些基于目录缓存的协议。但是，越来越大的系统的有限的目录缓存大小可能会导致频繁的目录条目清除，从而导致缓存块失效，从而严重降低系统性能。相当大比例的引用内存块仅由一个处理器访问(即使在并行应用程序中)，因此不需要一致性维护。利用动态识别这些私有块的技术，我们建议停用它们的相干协议，并将它们视为单处理器系统。协议停用允许目录缓存忽略相当数量的块的跟踪，这减少了它们的负载并增加了它们的有效大小。由于操作系统协作检测私有块，因此我们的建议只需要稍加修改。仿真结果表明，通过本文的方案，目录缓存可以避免约57%的访问块的跟踪，可以更好地利用目录缓存的容量。这有助于将并行应用程序的运行时缩短15%，同时保持目录缓存的大小，或者在使用目录缓存小8倍的情况下保持系统性能。

{"title":"Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks","authors":"B. Cuesta, Alberto Ros, M. E. Gómez, A. Robles, J. Duato","doi":"10.1145/2000064.2000076","DOIUrl":"https://doi.org/10.1145/2000064.2000076","url":null,"abstract":"To meet the demand for more powerful high-performance shared-memory servers, multiprocessor systems must incorporate efficient and scalable cache coherence protocols, such as those based on directory caches. However, the limited directory cache size of the increasingly larger systems may cause frequent evictions of directory entries and, consequently, invalidations of cached blocks, which severely degrades system performance. A significant percentage of the referred memory blocks are only accessed by one processor (even in parallel applications) and, therefore, do not require coherence maintenance. Taking advantage of techniques that dynamically identify those private blocks, we propose to deactivate the coherence protocol for them and to treat them as uniprocessor systems do. The protocol deactivation allows directory caches to omit the tracking of an appreciable quantity of blocks, which reduces their load and increases their effective size. Since the operating system collaborates on the detection of private blocks, our proposal only requires minor modifications. Simulation results show that, thanks to our proposal, directory caches can avoid the tracking of about 57% of the accessed blocks and their capacity can be better exploited. This contributes either to shorten the runtime of parallel applications by 15% while keeping directory cache size or to maintain system performance while using directory caches 8 times smaller.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130585911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 145

Kilo-NOC: A heterogeneous network-on-chip architecture for scalability and service guarantees Kilo-NOC:可扩展性和服务保证的异构片上网络架构

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000112

Boris Grot, Joel Hestness, S. Keckler, O. Mutlu

Today's chip-level multiprocessors (CMPs) feature up to a hundred discrete cores, and with increasing levels of integration, CMPs with hundreds of cores, cache tiles, and specialized accelerators are anticipated in the near future. In this paper, we propose and evaluate technologies to enable networks-on-chip (NOCs) to support a thousand connected components (Kilo-NOC) with high area and energy efficiency, good performance, and strong quality-of-service (QOS) guarantees. Our analysis shows that QOS support burdens the network with high area and energy costs. In response, we propose a new lightweight topology-aware QOS architecture that provides service guarantees for applications such as consolidated servers on CMPs and real-time SOCs. Unlike prior NOC quality-of-service proposals which require QOS support at every network node, our scheme restricts the extent of hardware support to portions of the die, reducing router complexity in the rest of the chip. We further improve network area- and energy-efficiency through a novel flow control mechanism that enables a single-network, low-cost elastic buffer implementation. Together, these techniques yield a heterogeneous Kilo-NOC architecture that consumes 45% less area and 29% less power than a state-of-the-art QOS-enabled NOC without these features.

目前的芯片级多处理器(cmp)具有多达100个独立核心的功能，随着集成水平的提高，预计在不久的将来，cmp将具有数百个核心、缓存块和专用加速器。在本文中，我们提出并评估了使片上网络(noc)能够支持千个连接组件(Kilo-NOC)的技术，具有高面积和能源效率，良好的性能和强大的服务质量(QOS)保证。分析表明，QOS支持给网络带来了较高的面积和能量成本。为此，我们提出了一种新的轻量级拓扑感知QOS体系结构，为cmp和实时soc上的合并服务器等应用提供服务保证。与之前要求在每个网络节点上都支持QOS的NOC服务质量提案不同，我们的方案将硬件支持的范围限制在芯片的部分，从而降低了芯片其余部分的路由器复杂性。我们通过一种新颖的流量控制机制进一步提高了网络的面积和能源效率，该机制使单网络、低成本的弹性缓冲实现成为可能。总的来说，这些技术产生了异构的Kilo-NOC架构，与没有这些特性的最先进的qos NOC相比，它的占地面积和功耗分别减少了45%和29%。

{"title":"Kilo-NOC: A heterogeneous network-on-chip architecture for scalability and service guarantees","authors":"Boris Grot, Joel Hestness, S. Keckler, O. Mutlu","doi":"10.1145/2000064.2000112","DOIUrl":"https://doi.org/10.1145/2000064.2000112","url":null,"abstract":"Today's chip-level multiprocessors (CMPs) feature up to a hundred discrete cores, and with increasing levels of integration, CMPs with hundreds of cores, cache tiles, and specialized accelerators are anticipated in the near future. In this paper, we propose and evaluate technologies to enable networks-on-chip (NOCs) to support a thousand connected components (Kilo-NOC) with high area and energy efficiency, good performance, and strong quality-of-service (QOS) guarantees. Our analysis shows that QOS support burdens the network with high area and energy costs. In response, we propose a new lightweight topology-aware QOS architecture that provides service guarantees for applications such as consolidated servers on CMPs and real-time SOCs. Unlike prior NOC quality-of-service proposals which require QOS support at every network node, our scheme restricts the extent of hardware support to portions of the die, reducing router complexity in the rest of the chip. We further improve network area- and energy-efficiency through a novel flow control mechanism that enables a single-network, low-cost elastic buffer implementation. Together, these techniques yield a heterogeneous Kilo-NOC architecture that consumes 45% less area and 29% less power than a state-of-the-art QOS-enabled NOC without these features.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"2012 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131981819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 168