2011 IEEE 17th International Symposium on High Performance Computer Architecture最新文献_第4页

NUcache: An efficient multicore cache organization based on Next-Use distance NUcache:基于下一次使用距离的高效多核缓存组织

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749733

R. Manikantan, K. Rajan, Ramaswamy Govindarajan

The effectiveness of the last-level shared cache is crucial to the performance of a multi-core system. In this paper, we observe and make use of the DelinquentPC — Next-Use characteristic to improve shared cache performance. We propose a new PC-centric cache organization, NUcache, for the shared last level cache of multi-cores. NUcache logically partitions the associative ways of a cache set into MainWays and DeliWays. While all lines have access to the MainWays, only lines brought in by a subset of delinquent PCs, selected by a PC selection mechanism, are allowed to enter the DeliWays. The PC selection mechanism is an intelligent cost-benefit analysis based algorithm that utilizes Next-Use information to select the set of PCs that can maximize the hits experienced in DeliWays. Performance evaluation reveals that NUcache improves the performance over a baseline design by 9.6%, 30% and 33% respectively for dual, quad and eight core workloads comprised of SPEC benchmarks. We also show that NUcache is more effective than other well-known cache-partitioning algorithms.

最后一级共享缓存的有效性对多核系统的性能至关重要。在本文中，我们观察并利用了拖欠pc -下次使用特性来提高共享缓存性能。我们提出了一种新的以pc为中心的缓存组织，NUcache，用于多核共享的最后一级缓存。NUcache逻辑上将缓存集的关联方式划分为MainWays和DeliWays。虽然所有线路都可以访问干线，但只有由故障PC子集(由PC选择机制选择)引入的线路才允许进入DeliWays。PC选择机制是一种基于成本效益分析的智能算法，它利用Next-Use信息来选择一组PC，以最大限度地提高DeliWays的点击率。性能评估显示，在双核、四核和八核工作负载组成的SPEC基准测试中，NUcache的性能比基线设计分别提高了9.6%、30%和33%。我们还证明了NUcache比其他已知的缓存分区算法更有效。

引用次数: 51

MorphCache: A Reconfigurable Adaptive Multi-level Cache hierarchy MorphCache:一个可重构的自适应多级缓存层次结构

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749732

Shekhar Srikantaiah, Emre Kultursay, Zhang Tao, M. Kandemir, M. J. Irwin, Yuan Xie

Given the diverse range of application characteristics that chip multiprocessors (CMPs) need to cater to, a “one-cache-topology-fits-all” design philosophy will clearly be inadequate. In this paper, we propose MorphCache, a Reconfigurable Adaptive Multi-level Cache hierarchy. Mor-phCache dynamically tunes a multi-level cache topology in a CMP to allow significantly different cache topologies to exist on the same architecture. Starting from per-core L2 and L3 cache slices as the basic design point, MorphCache alters the cache topology dynamically by merging or splitting cache slices and modifying the accessibility of different cache slice groups to different cores in a CMP. We evaluated MorphCache on a 16 core CMP on a full system simulator and found that it significantly improves both average throughput and harmonic mean of speedups of diverse multithreaded and multiprogrammed workloads. Specifically, our results show that MorphCache improves throughput of the multiprogrammed mixes by 29.9% over a topology with all-shared L2 and L3 caches and 27.9% over a topology with per core private L2 cache and shared L3 cache. In addition, we also compared MorphCache to partitioning a single shared cache at each level using promotion/insertion pseudo-partitioning (PIPP) [28] and managing per-core private cache at each level using dynamic spill receive caches (DSR) [18]. We found that MorphCache improves average throughput by 6.6% over PIPP and by 5.7% over DSR when applied to both L2 and L3 caches.

考虑到芯片多处理器(cmp)需要满足的各种应用程序特性，“一种缓存拓扑适合所有人”的设计理念显然是不够的。在本文中，我们提出了MorphCache，一个可重构的自适应多级缓存结构。Mor-phCache动态调优CMP中的多级缓存拓扑，以允许在同一架构上存在显著不同的缓存拓扑。MorphCache从每核L2和L3缓存片作为基本设计点开始，通过合并或分割缓存片以及修改CMP中不同缓存片组对不同核心的可访问性来动态更改缓存拓扑。我们在全系统模拟器上的16核CMP上评估了MorphCache，发现它显着提高了各种多线程和多编程工作负载的平均吞吐量和调和平均速度。具体来说，我们的结果表明，MorphCache比具有全共享L2和L3缓存的拓扑结构提高了29.9%的吞吐量，比具有每核私有L2缓存和共享L3缓存的拓扑结构提高了27.9%的吞吐量。此外，我们还将MorphCache与使用提升/插入伪分区(PIPP)[28]在每个级别对单个共享缓存进行分区和使用动态溢出接收缓存(DSR)[18]在每个级别管理每核私有缓存进行了比较。我们发现，当MorphCache应用于L2和L3缓存时，它比PIPP提高了6.6%的平均吞吐量，比DSR提高了5.7%。

{"title":"MorphCache: A Reconfigurable Adaptive Multi-level Cache hierarchy","authors":"Shekhar Srikantaiah, Emre Kultursay, Zhang Tao, M. Kandemir, M. J. Irwin, Yuan Xie","doi":"10.1109/HPCA.2011.5749732","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749732","url":null,"abstract":"Given the diverse range of application characteristics that chip multiprocessors (CMPs) need to cater to, a “one-cache-topology-fits-all” design philosophy will clearly be inadequate. In this paper, we propose MorphCache, a Reconfigurable Adaptive Multi-level Cache hierarchy. Mor-phCache dynamically tunes a multi-level cache topology in a CMP to allow significantly different cache topologies to exist on the same architecture. Starting from per-core L2 and L3 cache slices as the basic design point, MorphCache alters the cache topology dynamically by merging or splitting cache slices and modifying the accessibility of different cache slice groups to different cores in a CMP. We evaluated MorphCache on a 16 core CMP on a full system simulator and found that it significantly improves both average throughput and harmonic mean of speedups of diverse multithreaded and multiprogrammed workloads. Specifically, our results show that MorphCache improves throughput of the multiprogrammed mixes by 29.9% over a topology with all-shared L2 and L3 caches and 27.9% over a topology with per core private L2 cache and shared L3 cache. In addition, we also compared MorphCache to partitioning a single shared cache at each level using promotion/insertion pseudo-partitioning (PIPP) [28] and managing per-core private cache at each level using dynamic spill receive caches (DSR) [18]. We found that MorphCache improves average throughput by 6.6% over PIPP and by 5.7% over DSR when applied to both L2 and L3 caches.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"280 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127391273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

Dynamic parallelization of JavaScript applications using an ultra-lightweight speculation mechanism 使用超轻量级推测机制的JavaScript应用程序的动态并行化

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749719

M. Mehrara, Po-Chun Hsu, M. Samadi, S. Mahlke

As the web becomes the platform of choice for execution of more complex applications, a growing portion of computation is handed off by developers to the client side to reduce network traffic and improve application responsiveness. Therefore, the client-side component, often written in JavaScript, is becoming larger and more compute-intensive, increasing the demand for high performance JavaScript execution. This has led to many recent efforts to improve the performance of JavaScript engines in the web browsers. Furthermore, considering the wide-spread deployment of multi-cores in today's computing systems, exploiting parallelism in these applications is a promising approach to meet their performance requirement. However, JavaScript has traditionally been treated as a sequential language with no support for multithreading, limiting its potential to make use of the extra computing power in multicore systems. In this work, to exploit hardware concurrency while retaining traditional sequential programming model, we develop ParaScript, an automatic runtime parallelization system for JavaScript applications on the client's browser. First, we propose an optimistic runtime scheme for identifying parallelizable regions, generating the parallel code on-the-fly, and speculatively executing it. Second, we introduce an ultra-lightweight software speculation mechanism to manage parallel execution. This speculation engine consists of a selective checkpointing scheme and a novel runtime dependence detection mechanism based on reference counting and range-based array conflict detection. Our system is able to achieve an average of 2.18× speedup over the Firefox browser using 8 threads on commodity multi-core systems, while performing all required analyses and conflict detection dynamically at runtime.

随着web成为执行更复杂应用程序的首选平台，开发人员将越来越多的计算工作移交给客户端，以减少网络流量并提高应用程序的响应能力。因此，通常用JavaScript编写的客户端组件变得越来越大，计算量也越来越大，从而增加了对高性能JavaScript执行的需求。这导致了最近许多改进web浏览器中JavaScript引擎性能的努力。此外，考虑到当今计算系统中多核的广泛部署，在这些应用程序中利用并行性是满足其性能需求的一种很有前途的方法。然而，JavaScript传统上被视为一种不支持多线程的顺序语言，这限制了它在多核系统中利用额外计算能力的潜力。在这项工作中，为了在保留传统顺序编程模型的同时利用硬件并发性，我们开发了ParaScript，一个在客户端浏览器上用于JavaScript应用程序的自动运行时并行化系统。首先，我们提出了一种乐观的运行时方案，用于识别可并行化的区域，动态生成并行代码，并推测执行它。其次，我们引入了一个超轻量级的软件推测机制来管理并行执行。该推测引擎由选择性检查点方案和基于引用计数和基于范围的数组冲突检测的新型运行时依赖检测机制组成。我们的系统能够在商用多核系统上使用8个线程实现比Firefox浏览器平均2.18倍的加速，同时在运行时动态执行所有所需的分析和冲突检测。

{"title":"Dynamic parallelization of JavaScript applications using an ultra-lightweight speculation mechanism","authors":"M. Mehrara, Po-Chun Hsu, M. Samadi, S. Mahlke","doi":"10.1109/HPCA.2011.5749719","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749719","url":null,"abstract":"As the web becomes the platform of choice for execution of more complex applications, a growing portion of computation is handed off by developers to the client side to reduce network traffic and improve application responsiveness. Therefore, the client-side component, often written in JavaScript, is becoming larger and more compute-intensive, increasing the demand for high performance JavaScript execution. This has led to many recent efforts to improve the performance of JavaScript engines in the web browsers. Furthermore, considering the wide-spread deployment of multi-cores in today's computing systems, exploiting parallelism in these applications is a promising approach to meet their performance requirement. However, JavaScript has traditionally been treated as a sequential language with no support for multithreading, limiting its potential to make use of the extra computing power in multicore systems. In this work, to exploit hardware concurrency while retaining traditional sequential programming model, we develop ParaScript, an automatic runtime parallelization system for JavaScript applications on the client's browser. First, we propose an optimistic runtime scheme for identifying parallelizable regions, generating the parallel code on-the-fly, and speculatively executing it. Second, we introduce an ultra-lightweight software speculation mechanism to manage parallel execution. This speculation engine consists of a selective checkpointing scheme and a novel runtime dependence detection mechanism based on reference counting and range-based array conflict detection. Our system is able to achieve an average of 2.18× speedup over the Firefox browser using 8 threads on commodity multi-core systems, while performing all required analyses and conflict detection dynamically at runtime.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127994621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47

I-CASH: Intelligently Coupled Array of SSD and HDD I-CASH: SSD和HDD智能耦合阵列

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749736

Qing Yang, Jin Ren

This paper presents a new disk I/O architecture composed of an array of a flash memory SSD (solid state disk) and a hard disk drive (HDD) that are intelligently coupled by a special algorithm. We call this architecture I-CASH: Intelligently Coupled Array of SSD and HDD. The SSD stores seldom-changed and mostly read reference data blocks whereas the HDD stores a log of deltas between currently accessed I/O blocks and their corresponding reference blocks in the SSD so that random writes are not performed in SSD during online I/O operations. High speed delta compression and similarity detection algorithms are developed to control the pair of SSD and HDD. The idea is to exploit the fast read performance of SSDs and the high speed computation of modern multi-core CPUs to replace and substitute, to a great extent, the mechanical operations of HDDs. At the same time, we avoid runtime SSD writes that are slow and wearing. An experimental prototype I-CASH has been implemented and is used to evaluate I-CASH performance as compared to existing SSD/HDD I/O architectures. Numerical results on standard benchmarks show that I-CASH reduces the average I/O response time by an order of magnitude compared to existing disk I/O architectures such as RAID and SSD/HDD storage hierarchy, and provides up to 2.8 speedup over state-of-the-art pure SSD storage. Furthermore, I-CASH reduces random writes to SSD implying reduced wearing and prolonged life time of the SSD.

本文提出了一种新的磁盘I/O体系结构，该体系结构由一个闪存(SSD)阵列和一个硬盘驱动器(HDD)阵列组成，并通过一种特殊的算法进行智能耦合。我们称这种架构为I-CASH: SSD和HDD的智能耦合阵列。SSD存储的是很少改变的、经常读的参考数据块，而HDD在SSD中存储的是当前访问的I/O块与对应的参考数据块之间的增量日志，因此在线I/O操作时不会在SSD中进行随机写操作。开发了高速增量压缩和相似度检测算法来控制SSD和HDD对。其理念是利用ssd的快速读取性能和现代多核cpu的高速计算能力，在很大程度上取代和替代hdd的机械操作。同时，我们避免运行时SSD写缓慢和磨损。I- cash的实验原型已经实现，并用于评估I- cash与现有SSD/HDD I/O架构的性能。标准基准测试的数值结果表明，与现有的磁盘I/O架构(如RAID和SSD/HDD存储层次)相比，I- cash将平均I/O响应时间减少了一个数量级，并且比最先进的纯SSD存储提供高达2.8的加速。此外，I-CASH减少了对SSD的随机写入，这意味着减少了磨损，延长了SSD的使用寿命。

{"title":"I-CASH: Intelligently Coupled Array of SSD and HDD","authors":"Qing Yang, Jin Ren","doi":"10.1109/HPCA.2011.5749736","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749736","url":null,"abstract":"This paper presents a new disk I/O architecture composed of an array of a flash memory SSD (solid state disk) and a hard disk drive (HDD) that are intelligently coupled by a special algorithm. We call this architecture I-CASH: Intelligently Coupled Array of SSD and HDD. The SSD stores seldom-changed and mostly read reference data blocks whereas the HDD stores a log of deltas between currently accessed I/O blocks and their corresponding reference blocks in the SSD so that random writes are not performed in SSD during online I/O operations. High speed delta compression and similarity detection algorithms are developed to control the pair of SSD and HDD. The idea is to exploit the fast read performance of SSDs and the high speed computation of modern multi-core CPUs to replace and substitute, to a great extent, the mechanical operations of HDDs. At the same time, we avoid runtime SSD writes that are slow and wearing. An experimental prototype I-CASH has been implemented and is used to evaluate I-CASH performance as compared to existing SSD/HDD I/O architectures. Numerical results on standard benchmarks show that I-CASH reduces the average I/O response time by an order of magnitude compared to existing disk I/O architectures such as RAID and SSD/HDD storage hierarchy, and provides up to 2.8 speedup over state-of-the-art pure SSD storage. Furthermore, I-CASH reduces random writes to SSD implying reduced wearing and prolonged life time of the SSD.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115860124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 167

Checked Load: Architectural support for JavaScript type-checking on mobile processors Checked Load:在移动处理器上对JavaScript类型检查的架构支持

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749748

O. Anderson, Emily Fortuna, L. Ceze, S. Eggers

Dynamic languages such as Javascript are the de-facto standard for web applications. However, generating efficient code for dynamically-typed languages is a challenge, because it requires frequent dynamic type checks. Our analysis has shown that some programs spend upwards of 20% of dynamic instructions doing type checks, and 12.9% on average. In this paper we propose Checked Load, a low-complexity architectural extension that replaces software-based, dynamic type checking. Checked Load is comprised of four new ISA instructions that provide flexible and automatic type checks for memory operations, and whose implementation requires minimal hardware changes. We also propose hardware support for dynamic type prediction to reduce the cost of failed type checks. We show how to use Checked Load in the Nitro JavaScript just-in-time compiler (used in the Safari 5 browser). Speedups on a typical mobile processor range up to 44.6% (with a mean of 11.2%) in popular JavaScript benchmarks. While we have focused our work on JavaScript, Checked Load is sufficiently general to support other dynamically-typed languages, such as Python or Ruby.

像Javascript这样的动态语言是web应用程序事实上的标准。然而，为动态类型语言生成高效的代码是一个挑战，因为它需要频繁的动态类型检查。我们的分析表明，有些程序将20%以上的动态指令用于类型检查，平均为12.9%。在本文中，我们提出了Checked Load，这是一种低复杂度的架构扩展，取代了基于软件的动态类型检查。Checked Load由四个新的ISA指令组成，它们为内存操作提供灵活和自动的类型检查，其实现只需要最小的硬件更改。我们还提出了对动态类型预测的硬件支持，以减少失败类型检查的成本。我们将展示如何在Nitro JavaScript即时编译器(在Safari 5浏览器中使用)中使用Checked Load。在流行的JavaScript基准测试中，典型移动处理器的加速幅度高达44.6%(平均11.2%)。虽然我们的工作重点是JavaScript，但Checked Load已经足够通用，可以支持其他动态类型语言，如Python或Ruby。

{"title":"Checked Load: Architectural support for JavaScript type-checking on mobile processors","authors":"O. Anderson, Emily Fortuna, L. Ceze, S. Eggers","doi":"10.1109/HPCA.2011.5749748","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749748","url":null,"abstract":"Dynamic languages such as Javascript are the de-facto standard for web applications. However, generating efficient code for dynamically-typed languages is a challenge, because it requires frequent dynamic type checks. Our analysis has shown that some programs spend upwards of 20% of dynamic instructions doing type checks, and 12.9% on average. In this paper we propose Checked Load, a low-complexity architectural extension that replaces software-based, dynamic type checking. Checked Load is comprised of four new ISA instructions that provide flexible and automatic type checks for memory operations, and whose implementation requires minimal hardware changes. We also propose hardware support for dynamic type prediction to reduce the cost of failed type checks. We show how to use Checked Load in the Nitro JavaScript just-in-time compiler (used in the Safari 5 browser). Speedups on a typical mobile processor range up to 44.6% (with a mean of 11.2%) in popular JavaScript benchmarks. While we have focused our work on JavaScript, Checked Load is sufficiently general to support other dynamically-typed languages, such as Python or Ruby.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130478531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Addressing system-level trimming issues in on-chip nanophotonic networks 解决片上纳米光子网络中的系统级微调问题

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749722

C. Nitta, M. Farrens, V. Akella

The basic building block of on-chip nanophotonic interconnects is the microring resonator [14], and these resonators change their resonant wavelengths due to variations in temperature — a problem that can be addressed using a technique called ”trimming”, which involves correcting the drift via heating and/or current injection. Thus far system researchers have modeled trimming as a per ring fixed cost. In this work we show that at the system level using a fixed cost model is inappropriate — our simulations demonstrate that the cost of heating has a non-linear relationship with the number of rings, and also that current injection can lead to thermal runaway. We show that a very narrow Temperature Control Window (TCW) must be maintained in order for the network to work as desired. However, by exploiting the group drift property of co-located rings, it is possible to create a sliding window scheme which can increase the TCW. We also show that partially athermal rings can alleviate but not eliminate the problem.

片上纳米光子互连的基本组成部分是微环谐振器[14]，这些谐振器会由于温度的变化而改变其谐振波长——这个问题可以通过一种叫做“微调”的技术来解决，这种技术包括通过加热和/或电流注入来纠正漂移。到目前为止，系统研究人员已经将修剪建模为每个环的固定成本。在这项工作中，我们表明，在系统层面上使用固定成本模型是不合适的-我们的模拟表明，加热成本与环的数量具有非线性关系，并且电流注入可能导致热失控。我们表明，为了使网络按预期工作，必须保持非常窄的温度控制窗口(TCW)。然而，通过利用同位环的群漂移特性，可以创建一个可以增加TCW的滑动窗口方案。我们还表明，部分非热环可以缓解但不能消除问题。

引用次数: 86

Safe and efficient supervised memory systems 安全高效的监督存储系统

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749744

J. Bobba, Marc Lupon, M. Hill, D. Wood

Supervised Memory systems use out-of-band metabits to control and monitor accesses to normal data memory for such purposes as transactional memory and memory typestate trackers. Previous proposals demonstrate the value of supervised memory systems, but have typically (1) assumed sequential consistency (while most deployed systems use weaker models), and (2) used ad hoc, informal memory specifications (that can be ambiguous and/or incorrect). This paper seeks to make many previous proposals more practical. This paper builds a foundation for future supervised memory systems which (1) operate with the TSO and ×86 memory models, and (2) are formally specified using two supervised memory models. The simpler TSOall model requires all metadata and data accesses to obey TSO, but precludes using store buffers for supervised accesses. The more complex TSOdata model relaxes some ordering constraints (allowing store buffer use) but makes programmer reasoning more difficult. To get the benefits of both models, we propose Safe Supervision, which asks programmers to avoid using metabits from one location to order accesses to another. Programmers that obey safe supervision can reason with the simpler semantics of TSOall while obtaining the higher performance of TSOdata. Our approach is similar to how data-race-free programs can run on relaxed systems and yet appear sequentially consistent. Finally, we show that TSOdata can (a) provide significant performance benefit (up to 22%) over TSOall and (b) can be incorporated correctly and with low overhead into the RTL of an industrial multi-core chip design (OpenSPARC T2).

监督内存系统使用带外元比特来控制和监视对正常数据内存的访问，用于诸如事务性内存和内存类型状态跟踪器等目的。以前的建议展示了监督内存系统的价值，但通常(1)假设顺序一致性(而大多数部署的系统使用较弱的模型)，以及(2)使用临时的，非正式的内存规范(可能是模糊的和/或不正确的)。本文力求使以前的许多建议更加实用。本文为未来的监督记忆系统(1)使用TSO和×86记忆模型，以及(2)使用两个监督记忆模型正式指定奠定了基础。更简单的TSOall模型要求所有元数据和数据访问服从TSO，但排除了对监督访问使用存储缓冲区。更复杂的TSOdata模型放宽了一些排序约束(允许使用存储缓冲区)，但使程序员的推理更加困难。为了获得这两种模型的好处，我们提出了安全监督，它要求程序员避免使用来自一个位置的元比特来命令访问另一个位置。遵循安全监督的程序员可以使用更简单的TSOall语义进行推理，同时获得更高的TSOdata性能。我们的方法类似于无数据竞争的程序可以在宽松的系统上运行，但看起来顺序一致。最后，我们表明TSOdata可以(a)提供比TSOall显著的性能优势(高达22%)，并且(b)可以正确且低开销地集成到工业多核芯片设计(OpenSPARC T2)的RTL中。

{"title":"Safe and efficient supervised memory systems","authors":"J. Bobba, Marc Lupon, M. Hill, D. Wood","doi":"10.1109/HPCA.2011.5749744","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749744","url":null,"abstract":"Supervised Memory systems use out-of-band metabits to control and monitor accesses to normal data memory for such purposes as transactional memory and memory typestate trackers. Previous proposals demonstrate the value of supervised memory systems, but have typically (1) assumed sequential consistency (while most deployed systems use weaker models), and (2) used ad hoc, informal memory specifications (that can be ambiguous and/or incorrect). This paper seeks to make many previous proposals more practical. This paper builds a foundation for future supervised memory systems which (1) operate with the TSO and ×86 memory models, and (2) are formally specified using two supervised memory models. The simpler TSOall model requires all metadata and data accesses to obey TSO, but precludes using store buffers for supervised accesses. The more complex TSOdata model relaxes some ordering constraints (allowing store buffer use) but makes programmer reasoning more difficult. To get the benefits of both models, we propose Safe Supervision, which asks programmers to avoid using metabits from one location to order accesses to another. Programmers that obey safe supervision can reason with the simpler semantics of TSOall while obtaining the higher performance of TSOdata. Our approach is similar to how data-race-free programs can run on relaxed systems and yet appear sequentially consistent. Finally, we show that TSOdata can (a) provide significant performance benefit (up to 22%) over TSOall and (b) can be incorporated correctly and with low overhead into the RTL of an industrial multi-core chip design (OpenSPARC T2).","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"256 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114377825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Exploiting criticality to reduce bottlenecks in distributed uniprocessors 利用临界性来减少分布式单处理器中的瓶颈

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749749

Behnam Robatmili, Madhu Saravana Sibi Govindan, D. Burger, S. Keckler

Composable multicore systems merge multiple independent cores for running sequential single-threaded workloads. The performance scalability of these systems, however, is limited due to partitioning overheads. This paper addresses two of the key performance scalability limitations of composable multicore systems. We present a critical path analysis revealing that communication needed for cross-core register value delivery and fetch stalls due to misspeculation are the two worst bottlenecks that prevent efficient scaling to a large number of fused cores. To alleviate these bottlenecks, this paper proposes a fully distributed framework to exploit criticality in these architectures at different granularities. A coordinator core exploits different types of block-level communication criticality information to fine-tune critical instructions at decode and register forward pipeline stages of their executing cores. The framework exploits the fetch criticality information at a coarser granularity by reissuing all instructions in the blocks previously fetched into the merged cores. This general framework reduces competing bottlenecks in a synergic manner and achieves scalable performance/power efficiency for sequential programs when running across a large number of cores.

可组合的多核系统合并多个独立的核，以运行顺序的单线程工作负载。然而，由于分区开销，这些系统的性能可伸缩性受到限制。本文讨论了可组合多核系统的两个关键性能可伸缩性限制。我们提出了一个关键路径分析，揭示了跨核寄存器值传递所需的通信和由于错误猜测而导致的获取停顿是阻碍有效扩展到大量融合核的两个最严重的瓶颈。为了缓解这些瓶颈，本文提出了一个完全分布式的框架来利用这些体系结构中不同粒度的临界性。协调器核心利用不同类型的块级通信关键信息来微调关键指令，在解码和注册其执行核心的前向管道阶段。该框架通过重新发布先前获取到合并核中的块中的所有指令，以更粗粒度利用获取临界信息。这个通用框架以协同的方式减少了竞争瓶颈，并在跨大量核心运行时为顺序程序实现了可扩展的性能/功率效率。

{"title":"Exploiting criticality to reduce bottlenecks in distributed uniprocessors","authors":"Behnam Robatmili, Madhu Saravana Sibi Govindan, D. Burger, S. Keckler","doi":"10.1109/HPCA.2011.5749749","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749749","url":null,"abstract":"Composable multicore systems merge multiple independent cores for running sequential single-threaded workloads. The performance scalability of these systems, however, is limited due to partitioning overheads. This paper addresses two of the key performance scalability limitations of composable multicore systems. We present a critical path analysis revealing that communication needed for cross-core register value delivery and fetch stalls due to misspeculation are the two worst bottlenecks that prevent efficient scaling to a large number of fused cores. To alleviate these bottlenecks, this paper proposes a fully distributed framework to exploit criticality in these architectures at different granularities. A coordinator core exploits different types of block-level communication criticality information to fine-tune critical instructions at decode and register forward pipeline stages of their executing cores. The framework exploits the fetch criticality information at a coarser granularity by reissuing all instructions in the blocks previously fetched into the merged cores. This general framework reduces competing bottlenecks in a synergic manner and achieves scalable performance/power efficiency for sequential programs when running across a large number of cores.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127281371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

A new server I/O architecture for high speed networks 一种用于高速网络的新型服务器I/O架构

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749734

Guangdeng Liao, Xia Zhu, L. Bhuyan

Traditional architectural designs are normally focused on CPUs and have been often decoupled from I/O considerations. They are inefficient for high-speed network processing with a bandwidth of 10Gbps and beyond. Long latency I/O interconnects on mainstream servers also substantially complicate the NIC designs. In this paper, we start with fine-grained driver and OS instrumentation to fully understand the network processing overhead over 10GbE on mainstream servers. We obtain several new findings: 1) besides data copy identified by previous works, the driver and buffer release are two unexpected major overheads (up to 54%); 2) the major source of the overheads is memory stalls and data relating to socket buffer (SKB) and page data structures are mainly responsible for the stalls; 3) prevailing platform optimizations like Direct Cache Access (DCA) are insufficient for addressing the network processing bottlenecks. Motivated by the studies, we propose a new server I/O architecture where DMA descriptor management is shifted from NICs to an on-chip network engine (NEngine), and descriptors are extended with information about data incurring memory stalls. NEngine relies on data lookups and preloads data to eliminate the stalls during network processing. Moreover, NEngine implements efficient packet movement inside caches to address the remaining issues in data copy. The new architecture allows DMA engine to have very fast access to descriptors and keeps packets in CPU caches instead of NIC buffers, significantly simplifying NICs. Experimental results demonstrate that the new server I/O architecture improves the network processing efficiency by 47% and web server throughput by 14%, while substantially reducing the NIC hardware complexity.

传统的体系结构设计通常关注cpu，并且经常与I/O考虑分离。对于带宽超过10Gbps的高速网络处理来说，它们的效率很低。主流服务器上的长延迟I/O互连也使NIC设计变得非常复杂。在本文中，我们从细粒度的驱动程序和操作系统检测开始，以充分理解主流服务器上超过10GbE的网络处理开销。我们得到了几个新的发现:1)除了以前的工作确定的数据拷贝之外，驱动器和缓冲区释放是两个意想不到的主要开销(高达54%);2)开销的主要来源是内存停顿，而与套接字缓冲区(SKB)和页面数据结构相关的数据是造成停顿的主要原因;3)像直接缓存访问(DCA)这样的主流平台优化不足以解决网络处理瓶颈。受这些研究的启发，我们提出了一种新的服务器I/O架构，其中DMA描述符管理从nic转移到片上网络引擎(NEngine)，并且描述符扩展了有关导致内存停滞的数据的信息。NEngine依靠数据查找和预加载数据来消除网络处理期间的停顿。此外，NEngine在缓存中实现了高效的数据包移动，以解决数据复制中的剩余问题。新的架构允许DMA引擎非常快速地访问描述符，并将数据包保存在CPU缓存中而不是NIC缓冲区中，从而大大简化了NIC。实验结果表明，新的服务器I/O架构提高了47%的网络处理效率和14%的web服务器吞吐量，同时大大降低了网卡硬件复杂性。

{"title":"A new server I/O architecture for high speed networks","authors":"Guangdeng Liao, Xia Zhu, L. Bhuyan","doi":"10.1109/HPCA.2011.5749734","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749734","url":null,"abstract":"Traditional architectural designs are normally focused on CPUs and have been often decoupled from I/O considerations. They are inefficient for high-speed network processing with a bandwidth of 10Gbps and beyond. Long latency I/O interconnects on mainstream servers also substantially complicate the NIC designs. In this paper, we start with fine-grained driver and OS instrumentation to fully understand the network processing overhead over 10GbE on mainstream servers. We obtain several new findings: 1) besides data copy identified by previous works, the driver and buffer release are two unexpected major overheads (up to 54%); 2) the major source of the overheads is memory stalls and data relating to socket buffer (SKB) and page data structures are mainly responsible for the stalls; 3) prevailing platform optimizations like Direct Cache Access (DCA) are insufficient for addressing the network processing bottlenecks. Motivated by the studies, we propose a new server I/O architecture where DMA descriptor management is shifted from NICs to an on-chip network engine (NEngine), and descriptors are extended with information about data incurring memory stalls. NEngine relies on data lookups and preloads data to eliminate the stalls during network processing. Moreover, NEngine implements efficient packet movement inside caches to address the remaining issues in data copy. The new architecture allows DMA engine to have very fast access to descriptors and keeps packets in CPU caches instead of NIC buffers, significantly simplifying NICs. Experimental results demonstrate that the new server I/O architecture improves the network processing efficiency by 47% and web server throughput by 14%, while substantially reducing the NIC hardware complexity.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124734353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

Essential roles of exploiting internal parallelism of flash memory based solid state drives in high-speed data processing 利用基于闪存的固态硬盘内部并行性在高速数据处理中的重要作用

2011 IEEE 17th International Symposium on High Performance Computer Architecture

Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749735

Feng Chen, Rubao Lee, Xiaodong Zhang

Flash memory based solid state drives (SSDs) have shown a great potential to change storage infrastructure fundamentally through their high performance and low power. Most recent studies have mainly focused on addressing the technical limitations caused by special requirements for writes in flash memory. However, a unique merit of an SSD is its rich internal parallelism, which allows us to offset for the most part of the performance loss related to technical limitations by significantly increasing data processing throughput. In this work we present a comprehensive study of essential roles of internal parallelism of SSDs in high-speed data processing. Besides substantially improving I/O bandwidth (e.g. 7.2×), we show that by exploiting internal parallelism, SSD performance is no longer highly sensitive to access patterns, but rather to other factors, such as data access interferences and physical data layout. Specifically, through extensive experiments and thorough analysis, we obtain the following new findings in the context of concurrent data processing in SSDs. (1) Write performance is largely independent of access patterns (regardless of being sequential or random), and can even outperform reads, which is opposite to the long-existing common understanding about slow writes on SSDs. (2) One performance concern comes from interference between concurrent reads and writes, which causes substantial performance degradation. (3) Parallel I/O performance is sensitive to physical data-layout mapping, which is largely not observed without parallelism. (4) Existing application designs optimized for magnetic disks can be suboptimal for running on SSDs with parallelism. Our study is further supported by a group of case studies in database systems as typical data-intensive applications. With these critical findings, we give a set of recommendations to application designers and system architects for exploiting internal parallelism and maximizing the performance potential of SSDs.

基于闪存的固态硬盘(ssd)凭借其高性能和低功耗表现出了从根本上改变存储基础设施的巨大潜力。最近的研究主要集中在解决由闪存写入的特殊要求引起的技术限制上。然而，SSD的一个独特优点是其丰富的内部并行性，这使我们能够通过显著提高数据处理吞吐量来抵消与技术限制相关的大部分性能损失。在这项工作中，我们提出了ssd内部并行性在高速数据处理中的重要作用的全面研究。除了大幅提高I/O带宽(例如7.2倍)外，我们还表明，通过利用内部并行性，SSD性能不再对访问模式高度敏感，而是对其他因素(如数据访问干扰和物理数据布局)高度敏感。具体而言，通过广泛的实验和深入的分析，我们在ssd中并发数据处理的背景下获得了以下新发现。(1)写性能在很大程度上独立于访问模式(无论是顺序的还是随机的)，甚至可以优于读，这与长期存在的关于ssd上写慢的普遍理解相反。(2)一个性能问题来自并发读和写之间的干扰，这会导致严重的性能下降。(3)并行I/O性能对物理数据布局映射很敏感，如果没有并行性，这在很大程度上是观察不到的。(4)现有的针对磁盘优化的应用程序设计对于在具有并行性的ssd上运行可能是次优的。作为典型的数据密集型应用程序，数据库系统中的一组案例研究进一步支持了我们的研究。根据这些重要发现，我们向应用程序设计人员和系统架构师提供了一组建议，以利用内部并行性并最大化ssd的性能潜力。

{"title":"Essential roles of exploiting internal parallelism of flash memory based solid state drives in high-speed data processing","authors":"Feng Chen, Rubao Lee, Xiaodong Zhang","doi":"10.1109/HPCA.2011.5749735","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749735","url":null,"abstract":"Flash memory based solid state drives (SSDs) have shown a great potential to change storage infrastructure fundamentally through their high performance and low power. Most recent studies have mainly focused on addressing the technical limitations caused by special requirements for writes in flash memory. However, a unique merit of an SSD is its rich internal parallelism, which allows us to offset for the most part of the performance loss related to technical limitations by significantly increasing data processing throughput. In this work we present a comprehensive study of essential roles of internal parallelism of SSDs in high-speed data processing. Besides substantially improving I/O bandwidth (e.g. 7.2×), we show that by exploiting internal parallelism, SSD performance is no longer highly sensitive to access patterns, but rather to other factors, such as data access interferences and physical data layout. Specifically, through extensive experiments and thorough analysis, we obtain the following new findings in the context of concurrent data processing in SSDs. (1) Write performance is largely independent of access patterns (regardless of being sequential or random), and can even outperform reads, which is opposite to the long-existing common understanding about slow writes on SSDs. (2) One performance concern comes from interference between concurrent reads and writes, which causes substantial performance degradation. (3) Parallel I/O performance is sensitive to physical data-layout mapping, which is largely not observed without parallelism. (4) Existing application designs optimized for magnetic disks can be suboptimal for running on SSDs with parallelism. Our study is further supported by a group of case studies in database systems as typical data-intensive applications. With these critical findings, we give a set of recommendations to application designers and system architects for exploiting internal parallelism and maximizing the performance potential of SSDs.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132073447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 285