2011 38th Annual International Symposium on Computer Architecture (ISCA)最新文献

英文中文

Releasing efficient beta cores to market early 尽早向市场发布高效的beta内核

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000090

Sangeetha Sudhakrishnan, Rigo Dicochea, Jose Renau

Verification of modern processors is an expensive, time consuming, and challenging task. Although it is estimated that over half of total design time is spent on verification, we often find processors with bugs released into the market. This paper proposes an architecture that tolerates, not just the typically infrequent bugs found in current processors, but a significantly larger set of bugs. The objective is to allow for a much quicker time to market. We propose an architecture built around Beta Cores, which are cores partially verified. Our proposal intelligently activates and deactivates a simple single issue in-order Checker Core to verify a buggy superscalar out-of-order Beta Core. Our Beta Core Solution (BCS), which includes the Beta Core, the Checker Core, and the logic to detect potentially buggy situations consumes just 5% more power than the stand-alone Beta Core. We also show that performance is only slightly diminished with an average slowdown of 1.6%. By leveraging program signatures, our BCS only needs a simple in-order Checker Core, at half the frequency, to verify a complex 4 issue out-of-order Beta Core. The BCS architecture allows for a decrease in verification effort and thus a quicker time to market.

现代处理器的验证是一项昂贵、耗时且具有挑战性的任务。尽管估计超过一半的总设计时间花在验证上，但我们经常发现有缺陷的处理器被发布到市场上。本文提出了一种体系结构，它不仅可以容忍当前处理器中通常不常见的错误，还可以容忍大量错误。这样做的目的是让产品更快进入市场。我们提出了一个围绕Beta核心构建的架构，这些核心是部分验证的。我们的建议智能地激活和停用一个简单的单问题有序检查器核心，以验证一个有缺陷的超标量无序Beta核心。我们的测试核心解决方案(BCS)，包括测试核心，检查核心，以及检测潜在错误情况的逻辑，仅比独立的测试核心多消耗5%的功率。我们还表明，表现仅略有下降，平均放缓1.6%。通过利用程序签名，我们的BCS只需要一个简单的有序检查核心，以一半的频率，来验证一个复杂的问题无序的Beta核心。BCS体系结构允许减少验证工作，从而加快上市时间。

{"title":"Releasing efficient beta cores to market early","authors":"Sangeetha Sudhakrishnan, Rigo Dicochea, Jose Renau","doi":"10.1145/2000064.2000090","DOIUrl":"https://doi.org/10.1145/2000064.2000090","url":null,"abstract":"Verification of modern processors is an expensive, time consuming, and challenging task. Although it is estimated that over half of total design time is spent on verification, we often find processors with bugs released into the market. This paper proposes an architecture that tolerates, not just the typically infrequent bugs found in current processors, but a significantly larger set of bugs. The objective is to allow for a much quicker time to market. We propose an architecture built around Beta Cores, which are cores partially verified. Our proposal intelligently activates and deactivates a simple single issue in-order Checker Core to verify a buggy superscalar out-of-order Beta Core. Our Beta Core Solution (BCS), which includes the Beta Core, the Checker Core, and the logic to detect potentially buggy situations consumes just 5% more power than the stand-alone Beta Core. We also show that performance is only slightly diminished with an average slowdown of 1.6%. By leveraging program signatures, our BCS only needs a simple in-order Checker Core, at half the frequency, to verify a complex 4 issue out-of-order Beta Core. The BCS architecture allows for a decrease in verification effort and thus a quicker time to market.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123938409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

OUTRIDER: Efficient memory latency tolerance with decoupled strands 通过解耦链实现高效的内存延迟容忍

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000079

N. Crago, Sanjay J. Patel

We present Outrider, an architecture for throughput-oriented processors that provides memory latency tolerance to improve performance on highly threaded workloads. Out-rider enables a single thread of execution to be presented to the architecture as multiple decoupled instruction streams that separate memory-accessing and memory-consuming instructions. The key insight is that by decoupling the instruction streams, the processor pipeline can tolerate memory latency in a way similar to out-of-order designs while relying on a low-complexity in-order micro-architecture. Moreover, instead of adding more threads as is done in modern GPUs, Outrider can tolerate memory latency with fewer threads and reduced contention for resources shared amongst threads. We demonstrate that Outrider can outperform single threaded cores by 23-131% and a 4-way simultaneous multithreaded core by up to 87% on data parallel applications in a 1024-core system. Moreover, Outrider achieves these performance gains without incurring the overhead of additional hardware thread contexts, which results in improved area efficiency compared to a multithreaded core.

我们提出了Outrider，这是一种面向吞吐量的处理器架构，提供内存延迟容忍，以提高高线程工作负载的性能。outrider允许将单个执行线程作为多个分离内存访问和内存消耗指令的解耦指令流呈现给架构。关键的见解是，通过解耦指令流，处理器管道可以以类似于乱序设计的方式容忍内存延迟，同时依赖于低复杂度的有序微体系结构。此外，与在现代gpu中添加更多线程不同，Outrider可以用更少的线程容忍内存延迟，并减少线程之间共享资源的争用。我们证明，在1024核系统的数据并行应用中，Outrider的性能比单线程内核高出23-131%，比4路同步多线程内核高出87%。此外，Outrider在不增加额外硬件线程上下文开销的情况下实现了这些性能提升，与多线程内核相比，这提高了区域效率。

引用次数: 31

i-NVMM: A secure non-volatile main memory system with incremental encryption i-NVMM:一种安全的非易失性主存储器系统，具有增量加密功能

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000086

Siddhartha Chhabra, Yan Solihin

Emerging technologies for building non-volatile main memory (NVMM) systems suffer from a security vulnerability where information lingers on long after the system is powered down, enabling an attacker with physical access to the system to extract sensitive information off the memory. The goal of this study is to find a solution for such a security vulnerability. We introduce i-NVMM, a data privacy protection scheme for NVMM, where the main memory is encrypted incrementally, i.e. different data in the main memory is encrypted at different times depending on whether the data is predicted to still be useful to the processor. The motivation behind incremental encryption is the observation that the working set of an application is much smaller than its resident set. By identifying the working set and encrypting remaining part of the resident set, i-NVMM can keep the majority of the main memory encrypted at all times without penalizing performance by much. Our experiments demonstrate promising results. i-NVMM keeps 78% of the main memory encrypted across SPEC2006 benchmarks, yet only incurs 3.7% execution time overhead, and has a negligible impact on the write endurance of NVMM, all achieved with a relatively simple hardware support in the memory module.

用于构建非易失性主存储器(NVMM)系统的新兴技术存在一个安全漏洞，即信息在系统断电后很长一段时间内仍然存在，这使得具有对系统的物理访问的攻击者能够从内存中提取敏感信息。本研究的目的是为这种安全漏洞找到解决方案。我们介绍了i-NVMM，一种NVMM的数据隐私保护方案，其中主存是增量加密的，即主存中的不同数据在不同的时间被加密，这取决于数据是否被预测仍然对处理器有用。增量加密背后的动机是观察到应用程序的工作集比其驻留集小得多。通过识别工作集并加密驻留集的其余部分，i-NVMM可以始终对大部分主内存进行加密，而不会对性能造成太大影响。我们的实验显示出有希望的结果。i-NVMM在SPEC2006基准测试中加密了78%的主内存，但只产生3.7%的执行时间开销，并且对NVMM的写入持久性的影响可以忽略不计，所有这些都是通过内存模块中相对简单的硬件支持实现的。

{"title":"i-NVMM: A secure non-volatile main memory system with incremental encryption","authors":"Siddhartha Chhabra, Yan Solihin","doi":"10.1145/2000064.2000086","DOIUrl":"https://doi.org/10.1145/2000064.2000086","url":null,"abstract":"Emerging technologies for building non-volatile main memory (NVMM) systems suffer from a security vulnerability where information lingers on long after the system is powered down, enabling an attacker with physical access to the system to extract sensitive information off the memory. The goal of this study is to find a solution for such a security vulnerability. We introduce i-NVMM, a data privacy protection scheme for NVMM, where the main memory is encrypted incrementally, i.e. different data in the main memory is encrypted at different times depending on whether the data is predicted to still be useful to the processor. The motivation behind incremental encryption is the observation that the working set of an application is much smaller than its resident set. By identifying the working set and encrypting remaining part of the resident set, i-NVMM can keep the majority of the main memory encrypted at all times without penalizing performance by much. Our experiments demonstrate promising results. i-NVMM keeps 78% of the main memory encrypted across SPEC2006 benchmarks, yet only incurs 3.7% execution time overhead, and has a negligible impact on the write endurance of NVMM, all achieved with a relatively simple hardware support in the memory module.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129537383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 158

Dark silicon and the end of multicore scaling 暗硅和多核缩放的终结

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000108

H. Esmaeilzadeh, Emily R. Blem, Renee St. Amant, K. Sankaralingam, D. Burger

Since 2005, processor designers have increased core counts to exploit Moore's Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit multicore scaling just as single-core scaling has been curtailed. This paper models multicore scaling limits by combining device scaling, single-core scaling, and multicore scaling to measure the speedup potential for a set of parallel workloads for the next five technology generations. For device scaling, we use both the ITRS projections and a set of more conservative device scaling parameters. To model single-core scaling, we combine measurements from over 150 processors to derive Pareto-optimal frontiers for area/performance and power/performance. Finally, to model multicore scaling, we build a detailed performance model of upper-bound performance and lower-bound core power. The multicore designs we study include single-threaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community. Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. Through 2024, only 7.9× average speedup is possible across commonly used parallel workloads, leaving a nearly 24-fold gap from a target of doubled performance per generation.

自2005年以来，处理器设计人员增加了核心数量，以利用摩尔定律扩展，而不是专注于单核性能。Dennard缩放的失败(向多核部件的转变在一定程度上是对Dennard缩放的回应)可能很快就会限制多核缩放，就像单核缩放已经被限制一样。本文通过结合设备扩展、单核扩展和多核扩展来模拟多核扩展限制，以衡量未来五代技术中一组并行工作负载的加速潜力。对于设备缩放，我们使用ITRS预测和一组更保守的设备缩放参数。为了模拟单核缩放，我们结合了来自150多个处理器的测量结果，得出了面积/性能和功率/性能的帕累托最优边界。最后，为了对多核缩放进行建模，我们建立了一个详细的性能上限和核心功耗下限的性能模型。我们研究的多核设计包括具有对称、非对称、动态和组合拓扑的单线程类cpu和大规模线程类gpu的多核芯片组织。该研究表明，无论芯片组织和拓扑结构如何，多核扩展在一定程度上都受到了计算社区广泛认可的功率限制。即使在22nm工艺中(距离现在只有一年的时间)，固定尺寸芯片的21%必须关闭电源，而在8nm工艺中，这个数字增长到50%以上。到2024年，在常用的并行工作负载上只能实现7.9倍的平均加速，与每代性能翻倍的目标相差近24倍。

{"title":"Dark silicon and the end of multicore scaling","authors":"H. Esmaeilzadeh, Emily R. Blem, Renee St. Amant, K. Sankaralingam, D. Burger","doi":"10.1145/2000064.2000108","DOIUrl":"https://doi.org/10.1145/2000064.2000108","url":null,"abstract":"Since 2005, processor designers have increased core counts to exploit Moore's Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit multicore scaling just as single-core scaling has been curtailed. This paper models multicore scaling limits by combining device scaling, single-core scaling, and multicore scaling to measure the speedup potential for a set of parallel workloads for the next five technology generations. For device scaling, we use both the ITRS projections and a set of more conservative device scaling parameters. To model single-core scaling, we combine measurements from over 150 processors to derive Pareto-optimal frontiers for area/performance and power/performance. Finally, to model multicore scaling, we build a detailed performance model of upper-bound performance and lower-bound core power. The multicore designs we study include single-threaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community. Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. Through 2024, only 7.9× average speedup is possible across commonly used parallel workloads, leaving a nearly 24-fold gap from a target of doubled performance per generation.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127463286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 364

Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput 自适应粒度内存系统:存储效率和吞吐量之间的权衡

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000100

D. Yoon, Minseong Jeong, M. Erez

We propose adaptive granularity to combine the best of fine-grained and coarse-grained memory accesses. We augment virtual memory to allow each page to specify its preferred granularity of access based on spatial locality and error-tolerance tradeoffs. We use sector caches and sub-ranked memory systems to implement adaptive granularity. We also show how to incorporate adaptive granularity into memory access scheduling. We evaluate our architecture with and without ECC using memory intensive benchmarks from the SPEC, Olden, PARSEC, SPLASH2, and HPCS benchmark suites and micro-benchmarks. The evaluation shows that performance is improved by 61% without ECC and 44% with ECC in memory-intensive applications, while the reduction in memory power consumption (29% without ECC and 14% with ECC) and traffic (78% without ECC and 66% with ECC) is significant.

我们提出自适应粒度来结合细粒度和粗粒度的最佳内存访问。我们增加了虚拟内存，允许每个页面根据空间局域性和容错权衡来指定其首选的访问粒度。我们使用扇区缓存和子分级内存系统来实现自适应粒度。我们还将展示如何将自适应粒度合并到内存访问调度中。我们使用SPEC、Olden、PARSEC、SPLASH2和HPCS基准套件和微基准测试中的内存密集型基准测试来评估有ECC和没有ECC的架构。评估表明，在内存密集型应用中，没有ECC的性能提高了61%，使用ECC的性能提高了44%，而内存功耗(没有ECC的为29%，使用ECC的为14%)和流量(没有ECC的为78%，使用ECC的为66%)的降低是显著的。

引用次数: 92

Crafting a usable microkernel, processor, and I/O system with strict and provable information flow security 制作一个可用的微内核、处理器和I/O系统，具有严格和可证明的信息流安全性

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000087

Mohit Tiwari, J. Oberg, Xun Li, Jonathan Valamehr, T. Levin, B. Hardekopf, R. Kastner, F. Chong, T. Sherwood

High assurance systems used in avionics, medical implants, and cryptographic devices often rely on a small trusted base of hardware and software to manage the rest of the system. Crafting the core of such a system in a way that achieves flexibility, security, and performance requires a careful balancing act. Simple static primitives with hard partitions of space and time are easier to analyze formally, but strict approaches to the problem at the hardware level have been extremely restrictive, failing to allow even the simplest of dynamic behaviors to be expressed. Our approach to this problem is to construct a minimal but configurable architectural skeleton. This skeleton couples a critical slice of the low level hardware implementation with a microkernel in a way that allows information flow properties of the entire construction to be statically verified all the way down to its gate-level implementation. This strict structure is then made usable by a runtime system that delivers more traditional services (e.g. communication interfaces and long-living contexts) in a way that is decoupled from the information flow properties of the skeleton. To test the viability of this approach we design, test, and statically verify the information-flow security of a hardware/software system complete with support for unbounded operation, inter-process communication, pipelined operation, and I/O with traditional devices. The resulting system is provably sound even when adversaries are allowed to execute arbitrary code on the machine, yet is flexible enough to allow caching, pipelining, and other common case optimizations.

用于航空电子设备、医疗植入物和加密设备的高保障系统通常依赖于一个小型可信的硬件和软件基础来管理系统的其余部分。以实现灵活性、安全性和性能的方式构建这样一个系统的核心，需要谨慎地平衡。具有空间和时间硬分区的简单静态原语更容易进行形式化分析，但是在硬件级别上严格解决问题的方法具有极大的限制性，甚至不能表达最简单的动态行为。我们解决这个问题的方法是构造一个最小但可配置的架构骨架。该框架将低级硬件实现的关键部分与微内核耦合在一起，从而允许整个结构的信息流属性一直到其门级实现进行静态验证。然后，这个严格的结构可以被运行时系统使用，运行时系统以一种与骨架的信息流属性解耦的方式提供更多的传统服务(例如通信接口和长期存在的上下文)。为了测试这种方法的可行性，我们设计、测试和静态验证了一个硬件/软件系统的信息流安全性，该系统支持无界操作、进程间通信、流水线操作和使用传统设备的I/O。即使允许攻击者在机器上执行任意代码，生成的系统也可以证明是可靠的，并且足够灵活，可以允许缓存、流水线和其他常见情况的优化。

{"title":"Crafting a usable microkernel, processor, and I/O system with strict and provable information flow security","authors":"Mohit Tiwari, J. Oberg, Xun Li, Jonathan Valamehr, T. Levin, B. Hardekopf, R. Kastner, F. Chong, T. Sherwood","doi":"10.1145/2000064.2000087","DOIUrl":"https://doi.org/10.1145/2000064.2000087","url":null,"abstract":"High assurance systems used in avionics, medical implants, and cryptographic devices often rely on a small trusted base of hardware and software to manage the rest of the system. Crafting the core of such a system in a way that achieves flexibility, security, and performance requires a careful balancing act. Simple static primitives with hard partitions of space and time are easier to analyze formally, but strict approaches to the problem at the hardware level have been extremely restrictive, failing to allow even the simplest of dynamic behaviors to be expressed. Our approach to this problem is to construct a minimal but configurable architectural skeleton. This skeleton couples a critical slice of the low level hardware implementation with a microkernel in a way that allows information flow properties of the entire construction to be statically verified all the way down to its gate-level implementation. This strict structure is then made usable by a runtime system that delivers more traditional services (e.g. communication interfaces and long-living contexts) in a way that is decoupled from the information flow properties of the skeleton. To test the viability of this approach we design, test, and statically verify the information-flow security of a hardware/software system complete with support for unbounded operation, inter-process communication, pipelined operation, and I/O with traditional devices. The resulting system is provably sound even when adversaries are allowed to execute arbitrary code on the machine, yet is flexible enough to allow caching, pipelining, and other common case optimizations.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114167650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 113

TLSync: Support for multiple fast barriers using on-chip transmission lines TLSync:支持使用片上传输线的多个快速屏障

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000078

Jung-Sub Oh, Milos Prvulović, A. Zajić

As the number of cores on a single-chip grows, scalable barrier synchronization becomes increasingly difficult to implement. In software implementations, such as the tournament barrier, a larger number of cores results in a longer latency for each round and a larger number of rounds. Hardware barrier implementations require significant dedicated wiring, e.g., using a reduction (arrival) tree and a notification (release) tree, and multiple instances of this wiring are needed to support multiple barriers (e.g., when concurrently executing multiple parallel applications). This paper presents TLSync, a novel hardware barrier implementation that uses the high-frequency part of the spectrum in a transmission-line broadcast network, thus leaving the transmission line network free for non-modulated (base-band) data transmission. In contrast to other implementations of hardware barriers, TLSync allows multiple thread groups to each have its own barrier. This is accomplished by allocating different bands in the radio-frequency spectrum to different groups. Our circuit-level and electromagnetic models show that the worst-case latency for a TLSync barrier is 4ns to 10ns, depending on the size of the frequency band allocated to each group, and our cycle-accurate architectural simulations show that low-latency TLSync barriers provide significant performance and scalability benefits to barrier-intensive applications.

随着单芯片上内核数量的增长，可扩展的屏障同步变得越来越难以实现。在软件实现中，例如比赛障碍，更大数量的核心会导致每轮更长的延迟和更大的回合数。硬件屏障实现需要大量的专用连接，例如，使用减少(到达)树和通知(发布)树，并且需要这种连接的多个实例来支持多个屏障(例如，当并发执行多个并行应用程序时)。TLSync是一种新颖的硬件屏障实现，它利用在线传输广播网络中频谱的高频部分，从而使传输线网络自由用于非调制(基带)数据传输。与其他硬件屏障实现相比，TLSync允许多个线程组各自拥有自己的屏障。这是通过将无线电频谱中的不同频段分配给不同的组来实现的。我们的电路级和电磁模型表明，TLSync屏障的最坏情况延迟为4ns至10ns，这取决于分配给每个组的频带的大小，我们的周期精确架构模拟表明，低延迟TLSync屏障为屏障密集型应用提供了显著的性能和可扩展性优势。

{"title":"TLSync: Support for multiple fast barriers using on-chip transmission lines","authors":"Jung-Sub Oh, Milos Prvulović, A. Zajić","doi":"10.1145/2000064.2000078","DOIUrl":"https://doi.org/10.1145/2000064.2000078","url":null,"abstract":"As the number of cores on a single-chip grows, scalable barrier synchronization becomes increasingly difficult to implement. In software implementations, such as the tournament barrier, a larger number of cores results in a longer latency for each round and a larger number of rounds. Hardware barrier implementations require significant dedicated wiring, e.g., using a reduction (arrival) tree and a notification (release) tree, and multiple instances of this wiring are needed to support multiple barriers (e.g., when concurrently executing multiple parallel applications). This paper presents TLSync, a novel hardware barrier implementation that uses the high-frequency part of the spectrum in a transmission-line broadcast network, thus leaving the transmission line network free for non-modulated (base-band) data transmission. In contrast to other implementations of hardware barriers, TLSync allows multiple thread groups to each have its own barrier. This is accomplished by allocating different bands in the radio-frequency spectrum to different groups. Our circuit-level and electromagnetic models show that the worst-case latency for a TLSync barrier is 4ns to 10ns, depending on the size of the frequency band allocated to each group, and our cycle-accurate architectural simulations show that low-latency TLSync barriers provide significant performance and scalability benefits to barrier-intensive applications.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131786694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

Rebound: Scalable checkpointing for coherent shared memory 反弹:用于一致共享内存的可伸缩检查点

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000083

Rishi Agarwal, P. Garg, J. Torrellas

As we move to large manycores, the hardware-based global check-pointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads. Scalable checkpointing requires tracking inter-thread dependences and building the checkpoint and rollback operations around dynamic groups of communicating processors. To address this problem, this paper introduces Rebound, the first hardware-based scheme for coordinated local checkpointing in multi-processors with directory-based cache coherence. Rebound leverages the transactions of a directory protocol to track inter-thread dependences. In addition, it boosts checkpointing efficiency by: (i) delaying the writeback of data to safe memory at checkpoints, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing at barrier synchronization. Finally, Rebound introduces distributed algorithms for checkpointing and rollback sets of processors. Simulations of parallel programs with up to 64 threads show that Rebound is scalable and has very low overhead. For 64 processors, its average performance overhead is only 2%, compared to 15% for global checkpointing.

当我们转向大型多核时，为小型共享内存机器提出的基于硬件的全局检查点方案无法扩展。可伸缩性障碍包括全局操作、全局回滚所损失的工作，以及不平衡负载或I/ o密集型负载的低效率。可伸缩的检查点需要跟踪线程间依赖关系，并围绕动态通信处理器组构建检查点和回滚操作。为了解决这个问题，本文引入了第一个基于硬件的多处理器协调本地检查点方案，该方案具有基于目录的缓存一致性。Rebound利用目录协议的事务来跟踪线程间的依赖关系。此外，它还通过以下方式提高检查点效率:(i)延迟检查点上的数据回写到安全内存，(ii)支持使用多个检查点的操作，以及(iii)优化屏障同步时的检查点。最后，Rebound介绍了用于检查点和回滚处理器集的分布式算法。对多达64个线程的并行程序的模拟表明，反弹是可伸缩的，并且开销非常低。对于64个处理器，其平均性能开销仅为2%，而全局检查点则为15%。

{"title":"Rebound: Scalable checkpointing for coherent shared memory","authors":"Rishi Agarwal, P. Garg, J. Torrellas","doi":"10.1145/2000064.2000083","DOIUrl":"https://doi.org/10.1145/2000064.2000083","url":null,"abstract":"As we move to large manycores, the hardware-based global check-pointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads. Scalable checkpointing requires tracking inter-thread dependences and building the checkpoint and rollback operations around dynamic groups of communicating processors. To address this problem, this paper introduces Rebound, the first hardware-based scheme for coordinated local checkpointing in multi-processors with directory-based cache coherence. Rebound leverages the transactions of a directory protocol to track inter-thread dependences. In addition, it boosts checkpointing efficiency by: (i) delaying the writeback of data to safe memory at checkpoints, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing at barrier synchronization. Finally, Rebound introduces distributed algorithms for checkpointing and rollback sets of processors. Simulations of parallel programs with up to 64 threads show that Rebound is scalable and has very low overhead. For 64 processors, its average performance overhead is only 2%, compared to 15% for global checkpointing.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132281269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

Combining memory and a controller with photonics through 3D-stacking to enable scalable and energy-efficient systems 通过3d堆叠将存储器和控制器与光子学相结合，以实现可扩展和节能的系统

2011 38th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2011-06-04 DOI: 10.1145/2000064.2000115

Aniruddha N. Udipi, Naveen Muralimanohar, R. Balasubramonian, A. Davis, N. Jouppi

It is well-known that memory latency, energy, capacity, band-width, and scalability will be critical bottlenecks in future large-scale systems. This paper addresses these problems, focusing on the interface between the compute cores and memory, comprising the physical interconnect and the memory access protocol. For the physical interconnect, we study the prudent use of emerging silicon-photonic technology to reduce energy consumption and improve capacity scaling. We conclude that photonics are effective primarily to improve socket-edge bandwidth by breaking the pin barrier, and for use on heavily utilized links. For the access protocol, we propose a novel packet based interface that relinquishes most of the tight control that the memory controller holds in current systems and allows the memory modules to be more autonomous, improving flexibility and interoperability. The key enabler here is the introduction of a 3D-stacked interface die that allows both these optimizations without modifying commodity memory dies. The interface die handles all conversion between optics and electronics, as well as all low-level memory device control functionality. Communication beyond the interface die is fully electrical, with TSVs between dies and low-swing wires on-die. We show that such an approach results in substantially lowered energy consumption, reduced latency, better scalability to large capacities, and better support for heterogeneity and interoperability.

众所周知，内存延迟、能量、容量、带宽和可伸缩性将成为未来大规模系统的关键瓶颈。本文着重讨论了计算核心与存储器之间的接口，包括物理互连和存储器访问协议。对于物理互连，我们研究了谨慎使用新兴的硅光子技术来降低能耗和提高容量缩放。我们得出结论，光子学主要通过打破引脚屏障来提高插座边缘带宽，并用于大量使用的链路。对于访问协议，我们提出了一种新的基于数据包的接口，它放弃了当前系统中内存控制器的大部分严格控制，并允许内存模块更加自治，提高灵活性和互操作性。这里的关键促成因素是3d堆叠接口芯片的引入，它可以在不修改商用内存芯片的情况下实现这两种优化。接口芯片处理光学和电子之间的所有转换，以及所有低级存储设备控制功能。在接口芯片之外的通信完全是电的，在芯片和芯片上的低摆线之间有tsv。我们表明，这种方法可以大大降低能耗、减少延迟、更好地扩展到大容量，并更好地支持异构性和互操作性。

{"title":"Combining memory and a controller with photonics through 3D-stacking to enable scalable and energy-efficient systems","authors":"Aniruddha N. Udipi, Naveen Muralimanohar, R. Balasubramonian, A. Davis, N. Jouppi","doi":"10.1145/2000064.2000115","DOIUrl":"https://doi.org/10.1145/2000064.2000115","url":null,"abstract":"It is well-known that memory latency, energy, capacity, band-width, and scalability will be critical bottlenecks in future large-scale systems. This paper addresses these problems, focusing on the interface between the compute cores and memory, comprising the physical interconnect and the memory access protocol. For the physical interconnect, we study the prudent use of emerging silicon-photonic technology to reduce energy consumption and improve capacity scaling. We conclude that photonics are effective primarily to improve socket-edge bandwidth by breaking the pin barrier, and for use on heavily utilized links. For the access protocol, we propose a novel packet based interface that relinquishes most of the tight control that the memory controller holds in current systems and allows the memory modules to be more autonomous, improving flexibility and interoperability. The key enabler here is the introduction of a 3D-stacked interface die that allows both these optimizations without modifying commodity memory dies. The interface die handles all conversion between optics and electronics, as well as all low-level memory device control functionality. Communication beyond the interface die is fully electrical, with TSVs between dies and low-swing wires on-die. We show that such an approach results in substantially lowered energy consumption, reduced latency, better scalability to large capacities, and better support for heterogeneity and interoperability.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123301300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 70

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2011 38th Annual International Symposium on Computer Architecture (ISCA)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀