2007 IEEE 13th International Symposium on High Performance Computer Architecture最新文献_第3页

A Burst Scheduling Access Reordering Mechanism 突发调度访问重排序机制

2007 IEEE 13th International Symposium on High Performance Computer Architecture

Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346206

Jun Shao, B. Davis

Utilizing the nonuniform latencies of SDRAM devices, access reordering mechanisms alter the sequence of main memory access streams to reduce the observed access latency. Using a revised M5 simulator with an accurate SDRAM module, the burst scheduling access reordering mechanism is proposed and compared to conventional in order memory scheduling as well as existing academic and industrial access reordering mechanisms. With burst scheduling, memory accesses to the same rows of the same banks are clustered into bursts to maximize bus utilization of the SDRAM device. Subject to a static threshold, memory reads are allowed to preempt ongoing writes for reduced read latency, while qualified writes are piggybacked at the end of bursts to exploit row locality in writes and prevent write queue saturation. Performance improvements contributed by read preemption and write piggybacking are identified. Simulation results show that burst scheduling reduces the average execution time of selected SPEC CPU2000 benchmarks by 21% over conventional bank in order memory scheduling. Burst scheduling also outperforms Intel's patented out of order memory scheduling and the row hit access reordering mechanism by 11% and 6% respectively

利用SDRAM设备的非均匀延迟，访问重排序机制改变主存储器访问流的顺序，以减少观察到的访问延迟。利用改进的M5模拟器和精确的SDRAM模块，提出了突发调度访问重排序机制，并与传统的有序存储器调度以及现有的学术和工业访问重排序机制进行了比较。使用突发调度，对同一银行的同一行的内存访问被聚集到突发中，以最大限度地提高SDRAM设备的总线利用率。根据静态阈值，允许内存读抢占正在进行的写操作，以减少读延迟，而在突发结束时承载合格的写操作，以利用写操作中的行局域性并防止写队列饱和。识别了读抢占和写负载所带来的性能改进。仿真结果表明，在顺序存储器调度中，突发调度比常规银行调度减少了所选SPEC CPU2000基准测试的平均执行时间21%。突发调度也比英特尔专利的无序内存调度和行命中访问重排序机制分别高出11%和6%

{"title":"A Burst Scheduling Access Reordering Mechanism","authors":"Jun Shao, B. Davis","doi":"10.1109/HPCA.2007.346206","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346206","url":null,"abstract":"Utilizing the nonuniform latencies of SDRAM devices, access reordering mechanisms alter the sequence of main memory access streams to reduce the observed access latency. Using a revised M5 simulator with an accurate SDRAM module, the burst scheduling access reordering mechanism is proposed and compared to conventional in order memory scheduling as well as existing academic and industrial access reordering mechanisms. With burst scheduling, memory accesses to the same rows of the same banks are clustered into bursts to maximize bus utilization of the SDRAM device. Subject to a static threshold, memory reads are allowed to preempt ongoing writes for reduced read latency, while qualified writes are piggybacked at the end of bursts to exploit row locality in writes and prevent write queue saturation. Performance improvements contributed by read preemption and write piggybacking are identified. Simulation results show that burst scheduling reduces the average execution time of selected SPEC CPU2000 benchmarks by 21% over conventional bank in order memory scheduling. Burst scheduling also outperforms Intel's patented out of order memory scheduling and the row hit access reordering mechanism by 11% and 6% respectively","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126691079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 120

Interconnect-Centric Computing Interconnect-Centric计算

2007 IEEE 13th International Symposium on High Performance Computer Architecture

Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346179

W. Dally

Summary form only given. As we enter the many-core era, the interconnection networks of a computer system, rather than the processor or memory modules, will dominate its performance. Several recent developments in interconnection network architecture including global adaptive routing, high-radix routers, and technology-matched topologies offer large improvements in the performance and efficiency of this critical component. The implementation of a portion of several interconnection networks on multi-core chips also raises new opportunities and challenges for network design. This talk explores the role of interconnection networks in modern computer systems, recent developments in network architecture and design, and the challenges of on-chip interconnection networks. Examples will be drawn from several systems including the Cray BlackWidow

只提供摘要形式。随着我们进入多核时代，计算机系统的互连网络，而不是处理器或存储模块，将主导其性能。最近在互连网络架构方面的一些发展，包括全局自适应路由、高基数路由器和技术匹配拓扑，大大提高了这一关键组件的性能和效率。部分互连网络在多核芯片上的实现也为网络设计带来了新的机遇和挑战。本讲座探讨互连网络在现代计算机系统中的作用，网络架构和设计的最新发展，以及片上互连网络的挑战。示例将从包括克雷黑寡妇在内的几个系统中抽取

引用次数: 6

An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors 芯片多处理器的自适应共享/私有NUCA缓存分区方案

2007 IEEE 13th International Symposium on High Performance Computer Architecture

Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346180

H. Dybdahl, P. Stenström

The significant speed-gap between processor and memory and the limited chip memory bandwidth make last-level cache performance crucial for future chip multiprocessors. To use the capacity of shared last-level caches efficiently and to allow for a short access time, proposed non-uniform cache architectures (NUCAs) are organized into per-core partitions. If a core runs out of cache space, blocks are typically relocated to nearby partitions, thus managing the cache as a shared cache. This uncontrolled sharing of all resources may unfortunately result in pollution that degrades performance. We propose a novel non-uniform cache architecture in which the amount of cache space that can be shared among the cores is controlled dynamically. The adaptive scheme estimates, continuously, the effect of increasing/decreasing the shared partition size on the overall performance. We show that our scheme outperforms a private and shared cache organization as well as a hybrid NUCA organization in which blocks in a local partition can spill over to neighbor core partitions

处理器和存储器之间的显著速度差距以及有限的芯片存储器带宽使得最后一级缓存性能对未来的芯片多处理器至关重要。为了有效地使用共享的最后一级缓存的容量并允许较短的访问时间，建议将非统一缓存架构(nuca)组织到每核分区中。如果一个核心耗尽了缓存空间，块通常会被重新定位到附近的分区，从而将缓存作为共享缓存进行管理。不幸的是，这种不受控制的资源共享可能导致污染，从而降低性能。我们提出了一种新的非统一缓存架构，其中可以在内核之间共享的缓存空间量是动态控制的。自适应方案持续地估计增加/减少共享分区大小对整体性能的影响。我们表明，我们的方案优于私有和共享缓存组织以及混合NUCA组织，在混合NUCA组织中，本地分区中的块可以溢出到相邻的核心分区

引用次数: 140

A Scalable, Non-blocking Approach to Transactional Memory 事务性内存的可伸缩、非阻塞方法

2007 IEEE 13th International Symposium on High Performance Computer Architecture

Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346189

Hassan Chafi, J. Casper, Brian D. Carlstrom, Austen McDonald, C. Minh, Woongki Baek, C. Kozyrakis, K. Olukotun

Transactional memory (TM) provides mechanisms that promise to simplify parallel programming by eliminating the need for locks and their associated problems (deadlock, livelock, priority inversion, convoying). For TM to be adopted in the long term, not only does it need to deliver on these promises, but it needs to scale to a high number of processors. To date, proposals for scalable TM have relegated livelock issues to user-level contention managers. This paper presents the first scalable TM implementation for directory-based distributed shared memory systems that is livelock free without the need for user-level intervention. The design is a scalable implementation of optimistic concurrency control that supports parallel commits with a two-phase commit protocol, uses write-back caches, and filters coherence messages. The scalable design is based on transactional coherence and consistency (TCC), which supports continuous transactions and fault isolation. A performance evaluation of the design using both scientific and enterprise benchmarks demonstrates that the directory-based TCC design scales efficiently for NUMA systems up to 64 processors

事务性内存(TM)提供了一些机制，通过消除对锁及其相关问题(死锁、活动锁、优先级反转、传输)的需求，有望简化并行编程。要想长期采用TM，它不仅需要实现这些承诺，还需要扩展到大量的处理器。到目前为止，可伸缩TM的建议已经将动态锁问题降级为用户级争用管理器。本文提出了基于目录的分布式共享内存系统的第一个可扩展的TM实现，它不需要用户级的干预。该设计是乐观并发控制的可伸缩实现，支持使用两阶段提交协议并行提交，使用回写缓存，并过滤一致性消息。可伸缩设计基于事务一致性(TCC)，支持连续事务和故障隔离。使用科学和企业基准测试对设计进行的性能评估表明，基于目录的TCC设计可以有效地扩展到多达64个处理器的NUMA系统

{"title":"A Scalable, Non-blocking Approach to Transactional Memory","authors":"Hassan Chafi, J. Casper, Brian D. Carlstrom, Austen McDonald, C. Minh, Woongki Baek, C. Kozyrakis, K. Olukotun","doi":"10.1109/HPCA.2007.346189","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346189","url":null,"abstract":"Transactional memory (TM) provides mechanisms that promise to simplify parallel programming by eliminating the need for locks and their associated problems (deadlock, livelock, priority inversion, convoying). For TM to be adopted in the long term, not only does it need to deliver on these promises, but it needs to scale to a high number of processors. To date, proposals for scalable TM have relegated livelock issues to user-level contention managers. This paper presents the first scalable TM implementation for directory-based distributed shared memory systems that is livelock free without the need for user-level intervention. The design is a scalable implementation of optimistic concurrency control that supports parallel commits with a two-phase commit protocol, uses write-back caches, and filters coherence messages. The scalable design is based on transactional coherence and consistency (TCC), which supports continuous transactions and fault isolation. A performance evaluation of the design using both scientific and enterprise benchmarks demonstrates that the directory-based TCC design scales efficiently for NUMA systems up to 64 processors","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116096635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 149

Implications of Device Timing Variability on Full Chip Timing 器件时序可变性对全芯片时序的影响

2007 IEEE 13th International Symposium on High Performance Computer Architecture

Pub Date : 2007-02-10 DOI: 10.1145/1353629.1353644

M. Annavaram, Edward T. Grochowski, P. Reed

As process technologies continue to scale, the magnitude of within-die device parameter variations is expected to increase and may lead to significant timing variability. This paper presents a quantitative evaluation of how low level device timing variations impact the timing at the functional block level. We evaluate two types of timing variations: random and systematic variations. The study introduces random and systematic timing variations to several functional blocks in Intelreg Coretrade Duo microprocessor design database and measures the resulting timing margins. The primary conclusion of this research is that as a result of combining two probability distributions (the distribution of the random variation and the distribution of path timing margins) functional block timing margins degrade non-linearly with increasing variability

随着工艺技术的不断发展，模具内器件参数变化的幅度预计会增加，并可能导致显著的时间变化。本文提出了低电平器件时序变化如何影响功能块级时序的定量评估。我们评估了两种类型的时间变化:随机和系统变化。该研究将随机和系统的时序变化引入Intelreg Coretrade Duo微处理器设计数据库中的几个功能块，并测量由此产生的时序裕度。本研究的主要结论是，由于结合了两种概率分布(随机变化分布和路径时间裕度分布)，函数块时间裕度随变异性的增加呈非线性退化

引用次数: 10

Liquid SIMD: Abstracting SIMD Hardware using Lightweight Dynamic Mapping 液态SIMD:使用轻量级动态映射抽象SIMD硬件

2007 IEEE 13th International Symposium on High Performance Computer Architecture

Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346199

Nathan Clark, Amir Hormati, S. Yehia, S. Mahlke, K. Flautner

Microprocessor designers commonly utilize SIMD accelerators and their associated instruction set extensions to provide substantial performance gains at a relatively low cost for media applications. One of the most difficult problems with using SIMD accelerators is forward migration to newer generations. With larger hardware budgets and more demands for performance, SIMD accelerators evolve with both larger data widths and increased functionality with each new generation. However, this causes difficult problems in terms of binary compatibility, software migration costs, and expensive redesign of the instruction set architecture. In this work, we propose Liquid SIMD to decouple the instruction set architecture from the SIMD accelerator. SIMD instructions are expressed using a processor's baseline scalar instruction set, and light-weight dynamic translation maps the representation onto a broad family of SIMD accelerators. Liquid SIMD effectively bypasses the problems inherent to instruction set modification and binary compatibility across accelerator generations. We provide a detailed description of changes to a compilation framework and processor pipeline needed to support this abstraction. Additionally, we show that the hardware overhead of dynamic optimization is modest, hardware changes do not affect cycle time of the processor, and the performance impact of abstracting the SIMD accelerator is negligible. We conclude that using dynamic techniques to map instructions onto SIMD accelerators is an effective way to improve computation efficiency, without the overhead associated with modifying the instruction set

微处理器设计人员通常利用SIMD加速器及其相关的指令集扩展，以相对较低的成本为媒体应用程序提供可观的性能提升。使用SIMD加速器最困难的问题之一是向前迁移到新一代。随着硬件预算的增加和对性能的要求越来越高，SIMD加速器在每一代新产品中都随着更大的数据宽度和更多的功能而发展。然而，这会在二进制兼容性、软件迁移成本和昂贵的指令集架构重新设计方面导致困难的问题。在这项工作中，我们提出了液态SIMD来将指令集架构与SIMD加速器解耦。SIMD指令使用处理器的基线标量指令集表示，轻量级动态转换将表示映射到广泛的SIMD加速器家族。Liquid SIMD有效地绕过了指令集修改和跨加速器代二进制兼容性所固有的问题。我们提供了对支持此抽象所需的编译框架和处理器管道的更改的详细描述。此外，我们表明动态优化的硬件开销是适度的，硬件更改不会影响处理器的周期时间，抽象SIMD加速器对性能的影响可以忽略不计。我们得出的结论是，使用动态技术将指令映射到SIMD加速器上是提高计算效率的有效方法，而不会增加与修改指令集相关的开销

{"title":"Liquid SIMD: Abstracting SIMD Hardware using Lightweight Dynamic Mapping","authors":"Nathan Clark, Amir Hormati, S. Yehia, S. Mahlke, K. Flautner","doi":"10.1109/HPCA.2007.346199","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346199","url":null,"abstract":"Microprocessor designers commonly utilize SIMD accelerators and their associated instruction set extensions to provide substantial performance gains at a relatively low cost for media applications. One of the most difficult problems with using SIMD accelerators is forward migration to newer generations. With larger hardware budgets and more demands for performance, SIMD accelerators evolve with both larger data widths and increased functionality with each new generation. However, this causes difficult problems in terms of binary compatibility, software migration costs, and expensive redesign of the instruction set architecture. In this work, we propose Liquid SIMD to decouple the instruction set architecture from the SIMD accelerator. SIMD instructions are expressed using a processor's baseline scalar instruction set, and light-weight dynamic translation maps the representation onto a broad family of SIMD accelerators. Liquid SIMD effectively bypasses the problems inherent to instruction set modification and binary compatibility across accelerator generations. We provide a detailed description of changes to a compilation framework and processor pipeline needed to support this abstraction. Additionally, we show that the hardware overhead of dynamic optimization is modest, hardware changes do not affect cycle time of the processor, and the performance impact of abstracting the SIMD accelerator is negligible. We conclude that using dynamic techniques to map instructions onto SIMD accelerators is an effective way to improve computation efficiency, without the overhead associated with modifying the instruction set","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126194875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

MemTracker: Efficient and Programmable Support for Memory Access Monitoring and Debugging MemTracker:高效和可编程的内存访问监控和调试支持

2007 IEEE 13th International Symposium on High Performance Computer Architecture

Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346205

Guru Venkataramani, Brandyn Roemer, Yan Solihin, Milos Prvulović

Memory bugs are a broad class of bugs that is becoming increasingly common with increasing software complexity, and many of these bugs are also security vulnerabilities. Unfortunately, existing software and even hardware approaches for finding and identifying memory bugs have considerable performance overheads, target only a narrow class of bugs, are costly to implement, or use computational resources inefficiently. This paper describes MemTracker, a new hardware support mechanism that can be configured to perform different kinds of memory access monitoring tasks. MemTracker associates each word of data in memory with a few bits of state, and uses a programmable state transition table to react to different events that can affect this state. The number of state bits per word, the events to which MemTracker reacts, and the transition table are all fully programmable. MemTracker's rich set of states, events, and transitions can be used to implement different monitoring and debugging checkers with minimal performance overheads, even when frequent state updates are needed. To evaluate MemTracker, we map three different checkers onto it, as well as a checker that combines all three. For the most demanding (combined) checker, we observe performance overheads of only 2.7% on average and 4.8% worst-case on SPEC 2000 applications. Such low overheads allow continuous (always-on) use of MemTracker-enabled checkers even in production runs

内存错误是一类广泛的错误，随着软件复杂性的增加而变得越来越普遍，其中许多错误也是安全漏洞。不幸的是，现有的用于查找和识别内存错误的软件甚至硬件方法都有相当大的性能开销，只能针对一小部分错误，实现成本很高，或者低效地使用计算资源。MemTracker是一种新的硬件支持机制，可以通过配置来执行不同类型的内存访问监控任务。MemTracker将内存中的每个数据字与一些状态位相关联，并使用可编程状态转换表对可能影响该状态的不同事件做出反应。每个字的状态位数、MemTracker响应的事件和转换表都是完全可编程的。MemTracker丰富的状态、事件和转换集可用于以最小的性能开销实现不同的监视和调试检查器，即使在需要频繁的状态更新时也是如此。为了评估MemTracker，我们将三个不同的检查器映射到它，以及一个将这三个检查器组合在一起的检查器。对于最苛刻的(组合)检查器，我们观察到在SPEC 2000应用程序中，性能开销平均仅为2.7%，最坏情况下为4.8%。如此低的开销允许即使在生产运行中也可以持续(始终打开)使用启用了memtracker的检查器

{"title":"MemTracker: Efficient and Programmable Support for Memory Access Monitoring and Debugging","authors":"Guru Venkataramani, Brandyn Roemer, Yan Solihin, Milos Prvulović","doi":"10.1109/HPCA.2007.346205","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346205","url":null,"abstract":"Memory bugs are a broad class of bugs that is becoming increasingly common with increasing software complexity, and many of these bugs are also security vulnerabilities. Unfortunately, existing software and even hardware approaches for finding and identifying memory bugs have considerable performance overheads, target only a narrow class of bugs, are costly to implement, or use computational resources inefficiently. This paper describes MemTracker, a new hardware support mechanism that can be configured to perform different kinds of memory access monitoring tasks. MemTracker associates each word of data in memory with a few bits of state, and uses a programmable state transition table to react to different events that can affect this state. The number of state bits per word, the events to which MemTracker reacts, and the transition table are all fully programmable. MemTracker's rich set of states, events, and transitions can be used to implement different monitoring and debugging checkers with minimal performance overheads, even when frequent state updates are needed. To evaluate MemTracker, we map three different checkers onto it, as well as a checker that combines all three. For the most demanding (combined) checker, we observe performance overheads of only 2.7% on average and 4.8% worst-case on SPEC 2000 applications. Such low overheads allow continuous (always-on) use of MemTracker-enabled checkers even in production runs","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116362172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 129

Application-Level Correctness and its Impact on Fault Tolerance 应用层正确性及其对容错的影响

2007 IEEE 13th International Symposium on High Performance Computer Architecture

Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346196

Xuanhua Li, D. Yeung

Traditionally, fault tolerance researchers have required architectural state to be numerically perfect for program execution to be correct. However, in many programs, even if execution is not 100% numerically correct, the program can still appear to execute correctly from the user's perspective. Hence, whether a fault is unacceptable or benign may depend on the level of abstraction at which correctness is evaluated, with more faults being benign at higher levels of abstraction, i.e. at the user or application level, compared to lower levels of abstraction, i.e. at the architecture level. The extent to which programs are more fault resilient at higher levels of abstraction is application dependent. Programs that produce inexact and/or approximate outputs can be very resilient at the application level. We call such programs soft computations, and we find they are common in multimedia workloads, as well as artificial intelligence (AI) workloads. Programs that compute exact numerical outputs offer less error resilience at the application level. However, we find all programs studied in this paper exhibit some enhanced fault resilience at the application level, including those that are traditionally considered exact computations - e.g., SPECInt CPU2000. This paper investigates definitions of program correctness that view correctness from the application's standpoint rather than the architecture's standpoint. Under application-level correctness, a program's execution is deemed correct as long as the result it produces is acceptable to the user. To quantify user satisfaction, we rely on application-level fidelity metrics that capture user-perceived program solution quality. We conduct a detailed fault susceptibility study that measures how much more fault resilient programs are when defining correctness at the application level compared to the architecture level. Our results show for 6 multimedia and AI benchmarks that 45.8% of architecturally incorrect faults are correct at the application level. For 3 SPECInt CPU2000 benchmarks, 17.6% of architecturally incorrect faults are correct at the application level. We also present a lightweight fault recovery mechanism that exploits the relaxed requirements on numerical integrity provided by application-level correctness to reduce checkpoint cost. Our lightweight fault recovery mechanism successfully recovers 66.3% of program crashes in our multimedia and AI workloads, while incurring minimum runtime overhead

传统上，容错研究人员要求体系结构状态在数字上是完美的，这样程序才能正确执行。然而，在许多程序中，即使执行不是100%的数字正确，从用户的角度来看，程序仍然可以正确执行。因此，错误是不可接受的还是良性的可能取决于评估正确性的抽象级别，与较低抽象级别(即体系结构级别)相比，更高抽象级别(即用户或应用程序级别)的错误更多。程序在更高的抽象层次上具有更强的故障弹性的程度取决于应用程序。产生不精确和/或近似输出的程序在应用程序级别上可能非常有弹性。我们称这种程序为软计算，我们发现它们在多媒体工作负载以及人工智能(AI)工作负载中很常见。计算精确数值输出的程序在应用程序级别提供较少的错误恢复能力。然而，我们发现本文研究的所有程序在应用程序级别上都表现出一些增强的故障恢复能力，包括那些传统上被认为是精确计算的程序，例如SPECInt CPU2000。本文研究了从应用程序的角度而不是从体系结构的角度来看待正确性的程序正确性定义。在应用程序级别的正确性下，只要程序产生的结果是用户可以接受的，就认为程序的执行是正确的。为了量化用户满意度，我们依赖于获取用户感知的程序解决方案质量的应用级保真度度量。我们进行了详细的故障敏感性研究，测量了当在应用程序级别定义正确性时，与架构级别相比，程序的故障弹性有多大。我们的结果表明，在6个多媒体和AI基准测试中，45.8%的架构错误在应用程序级别上是正确的。对于3个SPECInt CPU2000基准测试，17.6%的架构错误在应用程序级别是正确的。我们还提出了一种轻量级的故障恢复机制，该机制利用应用程序级正确性对数值完整性的宽松要求来降低检查点成本。我们的轻量级故障恢复机制成功地恢复了多媒体和人工智能工作负载中66.3%的程序崩溃，同时产生最小的运行时开销

{"title":"Application-Level Correctness and its Impact on Fault Tolerance","authors":"Xuanhua Li, D. Yeung","doi":"10.1109/HPCA.2007.346196","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346196","url":null,"abstract":"Traditionally, fault tolerance researchers have required architectural state to be numerically perfect for program execution to be correct. However, in many programs, even if execution is not 100% numerically correct, the program can still appear to execute correctly from the user's perspective. Hence, whether a fault is unacceptable or benign may depend on the level of abstraction at which correctness is evaluated, with more faults being benign at higher levels of abstraction, i.e. at the user or application level, compared to lower levels of abstraction, i.e. at the architecture level. The extent to which programs are more fault resilient at higher levels of abstraction is application dependent. Programs that produce inexact and/or approximate outputs can be very resilient at the application level. We call such programs soft computations, and we find they are common in multimedia workloads, as well as artificial intelligence (AI) workloads. Programs that compute exact numerical outputs offer less error resilience at the application level. However, we find all programs studied in this paper exhibit some enhanced fault resilience at the application level, including those that are traditionally considered exact computations - e.g., SPECInt CPU2000. This paper investigates definitions of program correctness that view correctness from the application's standpoint rather than the architecture's standpoint. Under application-level correctness, a program's execution is deemed correct as long as the result it produces is acceptable to the user. To quantify user satisfaction, we rely on application-level fidelity metrics that capture user-perceived program solution quality. We conduct a detailed fault susceptibility study that measures how much more fault resilient programs are when defining correctness at the application level compared to the architecture level. Our results show for 6 multimedia and AI benchmarks that 45.8% of architecturally incorrect faults are correct at the application level. For 3 SPECInt CPU2000 benchmarks, 17.6% of architecturally incorrect faults are correct at the application level. We also present a lightweight fault recovery mechanism that exploits the relaxed requirements on numerical integrity provided by application-level correctness to reduce checkpoint cost. Our lightweight fault recovery mechanism successfully recovers 66.3% of program crashes in our multimedia and AI workloads, while incurring minimum runtime overhead","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129799589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 173

A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures CMP架构中一种低开销容错一致性协议

2007 IEEE 13th International Symposium on High Performance Computer Architecture

Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346194

Ricardo Fernández Pascual, José M. García, M. Acacio, J. Duato

It is widely accepted that transient failures will appear more frequently in chips designed in the near future due to several factors such as the increased integration scale. On the other hand, chip-multiprocessors (CMP) that integrate several processor cores in a single chip are nowadays the best alternative to more efficient use of the increasing number of transistors that can be placed in a single die. Hence, it is necessary to design new techniques to deal with these faults to be able to build sufficiently reliable chip multiprocessors (CMPs). In this work, we present a coherence protocol aimed at dealing with transient failures that affect the interconnection network of a CMP, thus assuming that the network is no longer reliable. In particular, our proposal extends a token-based cache coherence protocol so that no data can be lost and no deadlock can occur due to any dropped message. Using GEMS full system simulator, we compare our proposal against a similar protocol without fault tolerance (TOKENCMP). We show that in absence of failures our proposal does not introduce overhead in terms of increased execution time over TOKENCMP. Additionally, our protocol can tolerate message loss rates much higher than those likely to be found in the real world without increasing execution time more than 15%

人们普遍认为，由于集成规模的增加等因素，在不久的将来，芯片设计中的瞬态故障将更加频繁地出现。另一方面，芯片多处理器(CMP)将多个处理器核心集成到一个芯片中，这是目前更有效地利用单个芯片中可以放置的越来越多的晶体管的最佳选择。因此，有必要设计新的技术来处理这些故障，以便能够构建足够可靠的芯片多处理器(cmp)。在这项工作中，我们提出了一种相干协议，旨在处理影响CMP互连网络的瞬态故障，从而假设网络不再可靠。特别是，我们的建议扩展了一个基于令牌的缓存一致性协议，这样就不会有数据丢失，也不会因为任何丢失的消息而发生死锁。使用GEMS全系统模拟器，我们将我们的提议与类似的无容错协议(TOKENCMP)进行比较。我们表明，在没有失败的情况下，我们的提议不会在TOKENCMP的执行时间增加方面引入开销。此外，我们的协议可以容忍远高于现实世界中可能出现的消息丢失率，而不会使执行时间增加超过15%

{"title":"A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures","authors":"Ricardo Fernández Pascual, José M. García, M. Acacio, J. Duato","doi":"10.1109/HPCA.2007.346194","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346194","url":null,"abstract":"It is widely accepted that transient failures will appear more frequently in chips designed in the near future due to several factors such as the increased integration scale. On the other hand, chip-multiprocessors (CMP) that integrate several processor cores in a single chip are nowadays the best alternative to more efficient use of the increasing number of transistors that can be placed in a single die. Hence, it is necessary to design new techniques to deal with these faults to be able to build sufficiently reliable chip multiprocessors (CMPs). In this work, we present a coherence protocol aimed at dealing with transient failures that affect the interconnection network of a CMP, thus assuming that the network is no longer reliable. In particular, our proposal extends a token-based cache coherence protocol so that no data can be lost and no deadlock can occur due to any dropped message. Using GEMS full system simulator, we compare our proposal against a similar protocol without fault tolerance (TOKENCMP). We show that in absence of failures our proposal does not introduce overhead in terms of increased execution time over TOKENCMP. Additionally, our protocol can tolerate message loss rates much higher than those likely to be found in the real world without increasing execution time more than 15%","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125620632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Illustrative Design Space Studies with Microarchitectural Regression Models 用微建筑回归模型研究说明性设计空间

2007 IEEE 13th International Symposium on High Performance Computer Architecture

Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346211

Benjamin C. Lee, D. Brooks

We apply a scalable approach for practical, comprehensive design space evaluation and optimization. This approach combines design space sampling and statistical inference to identify trends from a sparse simulation of the space. The computational efficiency of sampling and inference enables new capabilities in design space exploration. We illustrate these capabilities using performance and power models for three studies of a 260,000 point design space: (1) Pareto frontier analysis, (2) pipeline depth analysis, and (3) multiprocessor heterogeneity analysis. For each study, we provide an assessment of predictive error and sensitivity of observed trends to such error. We construct Pareto frontiers and find predictions for Pareto optima are no less accurate than those for the broader design space. We reproduce and enhance prior pipeline depth studies, demonstrating constrained sensitivity studies may not generalize when many other design parameters are held at constant values. Lastly, we identify efficient heterogeneous core designs by clustering per benchmark optimal architectures. Collectively, these studies motivate the application of techniques in statistical inference for more effective use of modern simulator infrastructure

我们采用可扩展的方法进行实用、全面的设计空间评估和优化。这种方法结合了设计空间采样和统计推断，从空间的稀疏模拟中识别趋势。采样和推理的计算效率为设计空间探索提供了新的能力。我们使用性能和功率模型对26万个点的设计空间进行了三项研究:(1)帕累托边界分析，(2)管道深度分析，(3)多处理器异质性分析。对于每一项研究，我们提供了预测误差的评估和观察趋势对这种误差的敏感性。我们构建了帕累托边界，并发现对帕累托最优的预测并不比对更广泛的设计空间的预测更准确。我们重现并加强了之前的管道深度研究，证明当许多其他设计参数保持恒定值时，约束灵敏度研究可能无法推广。最后，我们通过对每个基准最优架构进行聚类来识别高效的异构核心设计。总的来说，这些研究激发了统计推断技术的应用，以更有效地利用现代模拟器基础设施

{"title":"Illustrative Design Space Studies with Microarchitectural Regression Models","authors":"Benjamin C. Lee, D. Brooks","doi":"10.1109/HPCA.2007.346211","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346211","url":null,"abstract":"We apply a scalable approach for practical, comprehensive design space evaluation and optimization. This approach combines design space sampling and statistical inference to identify trends from a sparse simulation of the space. The computational efficiency of sampling and inference enables new capabilities in design space exploration. We illustrate these capabilities using performance and power models for three studies of a 260,000 point design space: (1) Pareto frontier analysis, (2) pipeline depth analysis, and (3) multiprocessor heterogeneity analysis. For each study, we provide an assessment of predictive error and sensitivity of observed trends to such error. We construct Pareto frontiers and find predictions for Pareto optima are no less accurate than those for the broader design space. We reproduce and enhance prior pipeline depth studies, demonstrating constrained sensitivity studies may not generalize when many other design parameters are held at constant values. Lastly, we identify efficient heterogeneous core designs by clustering per benchmark optimal architectures. Collectively, these studies motivate the application of techniques in statistical inference for more effective use of modern simulator infrastructure","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133306397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 138