首页 > 最新文献

2011 IEEE 17th International Symposium on High Performance Computer Architecture最新文献

英文 中文
Mercury: A fast and energy-efficient multi-level cell based Phase Change Memory system 水银:一种快速和节能的多级电池相变存储系统
Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749742
Madhura Joshi, Wangyuan Zhang, Tao Li
Phase Change Memory (PCM) is one of the most promising technologies among emerging non-volatile memories. PCM stores data in crystalline and amorphous phases of the GST material using large differences in their electrical resistivity. Although it is possible to design a high capacity memory system by storing multiple bits at intermediate levels between the highest and lowest resistance states of PCM, it is difficult to obtain the tight distribution required for accurate reading of the data. Moreover, the required programming latency and energy for a Multiple Level PCM (MLC-PCM) cell is not trivial and can act as a major hurdle in adopting multilevel PCM in a high-density memory architecture design. Furthermore, the effect of process variation (PV) on PCM cell exacerbates the variability in necessary programming current and hence the target resistance spread, leading to the demand for high-latency, multi-iteration-based programming-and-verify write schemes for MLC-PCM. PV-aware control of programming current, programming using staircase down current pulses and programming using increasing reset current pulses are some of the traditional techniques used to achieve optimum programming energy, write latency and accuracy, but they usually target on optimizing only one aspect of the design. In this paper, we address the high-write latency and process variation issues of MLC-PCM by introducing Mercury: A fast and energy efficient multi-level cell based phase change memory architecture. Mercury adapts the programming scheme of a multi-level PCM cell by taking into consideration the initial state of the cell, the target resistance to be programmed and the effect of process variation on the programming current profile of the cell. The proposed techniques act at circuit as well as microarchitecture levels. Simulation results show that Mercury achieves 10% saving in programming latency and 25% saving in programming energy for the PCM memory system compared to that of the traditional methods.
相变存储器(PCM)是新兴的非易失性存储器中最有前途的技术之一。PCM将数据存储在GST材料的晶体和非晶相中,利用它们电阻率的巨大差异。虽然可以通过在PCM的最高和最低电阻状态之间的中间水平存储多个比特来设计高容量存储系统,但很难获得准确读取数据所需的紧密分布。此外,多层PCM (MLC-PCM)单元所需的编程延迟和能量也不容忽视,这可能成为在高密度存储器架构设计中采用多层PCM的主要障碍。此外,工艺变化(PV)对PCM单元的影响加剧了必要编程电流的可变性,从而导致目标电阻扩散,从而导致对高延迟、基于多次迭代的MLC-PCM编程和验证写入方案的需求。编程电流的pv感知控制,使用阶梯下降电流脉冲编程和使用增加复位电流脉冲编程是一些用于实现最佳编程能量,写入延迟和精度的传统技术,但它们通常只针对优化设计的一个方面。在本文中,我们通过介绍Mercury:一种快速和节能的基于多级单元的相变存储器架构来解决MLC-PCM的高写入延迟和工艺变化问题。Mercury通过考虑电池的初始状态、待编程的目标电阻以及工艺变化对电池编程电流分布的影响,来适应多层PCM电池的编程方案。所提出的技术在电路和微体系结构级别上起作用。仿真结果表明,与传统的PCM存储系统方法相比,Mercury方法的编程延迟降低了10%,编程能量降低了25%。
{"title":"Mercury: A fast and energy-efficient multi-level cell based Phase Change Memory system","authors":"Madhura Joshi, Wangyuan Zhang, Tao Li","doi":"10.1109/HPCA.2011.5749742","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749742","url":null,"abstract":"Phase Change Memory (PCM) is one of the most promising technologies among emerging non-volatile memories. PCM stores data in crystalline and amorphous phases of the GST material using large differences in their electrical resistivity. Although it is possible to design a high capacity memory system by storing multiple bits at intermediate levels between the highest and lowest resistance states of PCM, it is difficult to obtain the tight distribution required for accurate reading of the data. Moreover, the required programming latency and energy for a Multiple Level PCM (MLC-PCM) cell is not trivial and can act as a major hurdle in adopting multilevel PCM in a high-density memory architecture design. Furthermore, the effect of process variation (PV) on PCM cell exacerbates the variability in necessary programming current and hence the target resistance spread, leading to the demand for high-latency, multi-iteration-based programming-and-verify write schemes for MLC-PCM. PV-aware control of programming current, programming using staircase down current pulses and programming using increasing reset current pulses are some of the traditional techniques used to achieve optimum programming energy, write latency and accuracy, but they usually target on optimizing only one aspect of the design. In this paper, we address the high-write latency and process variation issues of MLC-PCM by introducing Mercury: A fast and energy efficient multi-level cell based phase change memory architecture. Mercury adapts the programming scheme of a multi-level PCM cell by taking into consideration the initial state of the cell, the target resistance to be programmed and the effect of process variation on the programming current profile of the cell. The proposed techniques act at circuit as well as microarchitecture levels. Simulation results show that Mercury achieves 10% saving in programming latency and 25% saving in programming energy for the PCM memory system compared to that of the traditional methods.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134645849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 86
A quantitative performance analysis model for GPU architectures GPU架构的定量性能分析模型
Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749745
Yao Zhang, John Douglas Owens
We develop a microbenchmark-based performance model for NVIDIA GeForce 200-series GPUs. Our model identifies GPU program bottlenecks and quantitatively analyzes performance, and thus allows programmers and architects to predict the benefits of potential program optimizations and architectural improvements. In particular, we use a microbenchmark-based approach to develop a throughput model for three major components of GPU execution time: the instruction pipeline, shared memory access, and global memory access. Because our model is based on the GPU's native instruction set, we can predict performance with a 5–15% error. To demonstrate the usefulness of the model, we analyze three representative real-world and already highly-optimized programs: dense matrix multiply, tridiagonal systems solver, and sparse matrix vector multiply. The model provides us detailed quantitative analysis on performance, allowing us to understand the configuration of the fastest dense matrix multiply implementation and to optimize the tridiagonal solver and sparse matrix vector multiply by 60% and 18% respectively. Furthermore, our model applied to analysis on these codes allows us to suggest architectural improvements on hardware resource allocation, avoiding bank conflicts, block scheduling, and memory transaction granularity.
我们为NVIDIA GeForce 200系列gpu开发了一个基于微基准的性能模型。我们的模型识别GPU程序瓶颈并定量分析性能,从而允许程序员和架构师预测潜在的程序优化和架构改进的好处。特别是,我们使用基于微基准的方法来开发GPU执行时间的三个主要组成部分的吞吐量模型:指令管道,共享内存访问和全局内存访问。因为我们的模型是基于GPU的本地指令集,所以我们可以用5-15%的误差来预测性能。为了证明该模型的实用性,我们分析了三个具有代表性的现实世界和已经高度优化的程序:密集矩阵乘法,三对角系统求解器和稀疏矩阵向量乘法。该模型为我们提供了详细的性能定量分析,使我们能够了解最快的密集矩阵乘法实现的配置,并将三对角求解器和稀疏矩阵向量乘法分别优化60%和18%。此外,我们的模型应用于对这些代码的分析,允许我们对硬件资源分配、避免银行冲突、块调度和内存事务粒度提出架构改进建议。
{"title":"A quantitative performance analysis model for GPU architectures","authors":"Yao Zhang, John Douglas Owens","doi":"10.1109/HPCA.2011.5749745","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749745","url":null,"abstract":"We develop a microbenchmark-based performance model for NVIDIA GeForce 200-series GPUs. Our model identifies GPU program bottlenecks and quantitatively analyzes performance, and thus allows programmers and architects to predict the benefits of potential program optimizations and architectural improvements. In particular, we use a microbenchmark-based approach to develop a throughput model for three major components of GPU execution time: the instruction pipeline, shared memory access, and global memory access. Because our model is based on the GPU's native instruction set, we can predict performance with a 5–15% error. To demonstrate the usefulness of the model, we analyze three representative real-world and already highly-optimized programs: dense matrix multiply, tridiagonal systems solver, and sparse matrix vector multiply. The model provides us detailed quantitative analysis on performance, allowing us to understand the configuration of the fastest dense matrix multiply implementation and to optimize the tridiagonal solver and sparse matrix vector multiply by 60% and 18% respectively. Furthermore, our model applied to analysis on these codes allows us to suggest architectural improvements on hardware resource allocation, avoiding bank conflicts, block scheduling, and memory transaction granularity.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115417646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 287
Storage free confidence estimation for the TAGE branch predictor TAGE分支预测器的无存储置信度估计
Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749750
André Seznec
For the past 15 years, it has been shown that confidence estimation of branch prediction can be used for various usages such as fetch gating or throttling for power saving or for controlling resource allocation policies in a SMT processor. In many proposals, using extra hardware and particularly storage tables for branch confidence estimators has been considered as a worthwhile silicon investment. The TAGE predictor presented in 2006 is so far considered as the state-of-the-art conditional branch predictor. In this paper, we show that very accurate confidence estimations can be done for the branch predictions performed by the TAGE predictor by simply observing the outputs of the predictor tables. Many confidence estimators proposed in the literature only discriminate between high confidence predictions and low confidence predictions. It has been recently pointed out that a more selective confidence discrimination could useful. We show that the observation of the outputs of the predictor tables is sufficient to grade the confidence in the branch predictions with a very good granularity. Moreover a slight modification of the predictor automaton allows to discriminate the prediction in three classes, low-confidence (with a misprediction rate in the 30 % range), medium confidence (with a misprediction rate in 8–12% range) and high confidence (with a misprediction rate lower than 1 %).
在过去的15年中,已经证明了分支预测的置信度估计可以用于各种用途,例如获取门控或节电节流,或用于控制SMT处理器中的资源分配策略。在许多建议中,为分支置信度估计器使用额外的硬件,特别是存储表被认为是值得的硅投资。2006年提出的TAGE预测器被认为是目前最先进的条件分支预测器。在本文中,我们表明,通过简单地观察预测表的输出,可以对TAGE预测器执行的分支预测进行非常准确的置信度估计。文献中提出的许多置信度估计只区分高置信度预测和低置信度预测。最近有人指出,更有选择性的信心歧视可能是有用的。我们表明,对预测表输出的观察足以以非常好的粒度对分支预测的置信度进行分级。此外,对预测器自动机的轻微修改允许将预测区分为三类,低置信度(错误预测率在30%范围内),中等置信度(错误预测率在8-12%范围内)和高置信度(错误预测率低于1%)。
{"title":"Storage free confidence estimation for the TAGE branch predictor","authors":"André Seznec","doi":"10.1109/HPCA.2011.5749750","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749750","url":null,"abstract":"For the past 15 years, it has been shown that confidence estimation of branch prediction can be used for various usages such as fetch gating or throttling for power saving or for controlling resource allocation policies in a SMT processor. In many proposals, using extra hardware and particularly storage tables for branch confidence estimators has been considered as a worthwhile silicon investment. The TAGE predictor presented in 2006 is so far considered as the state-of-the-art conditional branch predictor. In this paper, we show that very accurate confidence estimations can be done for the branch predictions performed by the TAGE predictor by simply observing the outputs of the predictor tables. Many confidence estimators proposed in the literature only discriminate between high confidence predictions and low confidence predictions. It has been recently pointed out that a more selective confidence discrimination could useful. We show that the observation of the outputs of the predictor tables is sufficient to grade the confidence in the branch predictions with a very good granularity. Moreover a slight modification of the predictor automaton allows to discriminate the prediction in three classes, low-confidence (with a misprediction rate in the 30 % range), medium confidence (with a misprediction rate in 8–12% range) and high confidence (with a misprediction rate lower than 1 %).","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"204 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121301545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing HAsim:基于fpga的高细节多核仿真,使用分时复用
Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749747
Michael Pellauer, Michael Adler, M. Kinsy, A. Parashar, J. Emer
In this paper we present the HAsim FPGA-accelerated simulator. HAsim is able to model a shared-memory multicore system including detailed core pipelines, cache hierarchy, and on-chip network, using a single FPGA. We describe the scaling techniques that make this possible, including novel uses of time-multiplexing in the core pipeline and on-chip network. We compare our time-multiplexed approach to a direct implementation, and present a case study that motivates why high-detail simulations should continue to play a role in the architectural exploration process.
本文介绍了HAsim fpga加速模拟器。HAsim能够使用单个FPGA对共享内存多核系统进行建模,包括详细的核心管道、缓存层次结构和片上网络。我们描述了使这成为可能的缩放技术,包括在核心管道和片上网络中使用时间复用的新方法。我们将我们的时间复用方法与直接实现方法进行了比较,并提出了一个案例研究,该案例研究激发了为什么高细节模拟应该继续在建筑探索过程中发挥作用。
{"title":"HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing","authors":"Michael Pellauer, Michael Adler, M. Kinsy, A. Parashar, J. Emer","doi":"10.1109/HPCA.2011.5749747","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749747","url":null,"abstract":"In this paper we present the HAsim FPGA-accelerated simulator. HAsim is able to model a shared-memory multicore system including detailed core pipelines, cache hierarchy, and on-chip network, using a single FPGA. We describe the scaling techniques that make this possible, including novel uses of time-multiplexing in the core pipeline and on-chip network. We compare our time-multiplexed approach to a direct implementation, and present a case study that motivates why high-detail simulations should continue to play a role in the architectural exploration process.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130302222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 105
Architectural framework for supporting operating system survivability 支持操作系统生存性的体系结构框架
Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749751
Xiaowei Jiang, Yan Solihin
The ever increasing size and complexity of Operating System (OS) kernel code bring an inevitable increase in the number of security vulnerabilities that can be exploited by attackers. A successful security attack on the kernel has a profound impact that may affect all processes running on it. In this paper we propose an architectural framework that provides survivability to the OS kernel, i.e. able to keep normal system operation despite security faults. It consists of three components that work together: (1) security attack detection, (2) security fault isolation, and (3) a recovery mechanism that resumes normal system operation. Through simple but carefully-designed architecture support, we provide OS kernel survivability with low performance overheads (< 5% for kernel intensive benchmarks). When tested with real world security attacks, our survivability mechanism automatically prevents the security faults from corrupting the kernel state or affecting other processes, recovers the kernel state and resumes execution.
随着操作系统(OS)内核代码规模和复杂度的不断增加,攻击者可利用的安全漏洞数量也不可避免地增加。对内核进行成功的安全攻击会产生深远的影响,可能会影响在内核上运行的所有进程。在本文中,我们提出了一个架构框架,该框架为操作系统内核提供了生存性,即能够在安全故障的情况下保持正常的系统运行。它由三个组成部分共同工作:(1)安全攻击检测;(2)安全故障隔离;(3)恢复系统正常运行的恢复机制。通过简单但精心设计的架构支持,我们以较低的性能开销(内核密集型基准测试< 5%)提供了操作系统内核的生存性。在对真实世界的安全攻击进行测试时,我们的生存性机制会自动防止安全错误破坏内核状态或影响其他进程,恢复内核状态并恢复执行。
{"title":"Architectural framework for supporting operating system survivability","authors":"Xiaowei Jiang, Yan Solihin","doi":"10.1109/HPCA.2011.5749751","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749751","url":null,"abstract":"The ever increasing size and complexity of Operating System (OS) kernel code bring an inevitable increase in the number of security vulnerabilities that can be exploited by attackers. A successful security attack on the kernel has a profound impact that may affect all processes running on it. In this paper we propose an architectural framework that provides survivability to the OS kernel, i.e. able to keep normal system operation despite security faults. It consists of three components that work together: (1) security attack detection, (2) security fault isolation, and (3) a recovery mechanism that resumes normal system operation. Through simple but carefully-designed architecture support, we provide OS kernel survivability with low performance overheads (< 5% for kernel intensive benchmarks). When tested with real world security attacks, our survivability mechanism automatically prevents the security faults from corrupting the kernel state or affecting other processes, recovers the kernel state and resumes execution.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128786949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Keynote address I: Programming the cloud 主题演讲1:云编程
Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749711
J. Larus
Client + cloud computing is a disruptive, new computing platform, combining diverse client devices — PCs, smartphones, sensors, and single-function and embedded devices — with the unlimited, on-demand computation and data storage offered by cloud computing services such as Amazon's AWS or Microsoft's Windows Azure. As with every advance in computing, programming is a fundamental challenge as client + cloud computing combines many difficult aspects of software development. Systems built for this world are inherently parallel and distributed, run on unreliable hardware, and must be continually available — a challenging programming model for even the most skilled programmers. How then do ordinary programmers develop software for the Cloud?
客户端+云计算是一种颠覆性的新型计算平台,它将不同的客户端设备(pc、智能手机、传感器、单一功能和嵌入式设备)与云计算服务(如亚马逊的AWS或微软的Windows Azure)提供的无限按需计算和数据存储结合在一起。随着计算的每一次进步,编程是一个基本的挑战,因为客户端+云计算结合了软件开发的许多困难方面。为这个世界构建的系统本质上是并行和分布式的,运行在不可靠的硬件上,并且必须持续可用——这对即使是最熟练的程序员来说也是一个具有挑战性的编程模型。那么,普通程序员如何为云开发软件呢?
{"title":"Keynote address I: Programming the cloud","authors":"J. Larus","doi":"10.1109/HPCA.2011.5749711","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749711","url":null,"abstract":"Client + cloud computing is a disruptive, new computing platform, combining diverse client devices — PCs, smartphones, sensors, and single-function and embedded devices — with the unlimited, on-demand computation and data storage offered by cloud computing services such as Amazon's AWS or Microsoft's Windows Azure. As with every advance in computing, programming is a fundamental challenge as client + cloud computing combines many difficult aspects of software development. Systems built for this world are inherently parallel and distributed, run on unreliable hardware, and must be continually available — a challenging programming model for even the most skilled programmers. How then do ordinary programmers develop software for the Cloud?","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129936027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Achieving uniform performance and maximizing throughput in the presence of heterogeneity 在异构存在的情况下实现统一的性能和最大的吞吐量
Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749712
K. Rangan, Michael D. Powell, Gu-Yeon Wei, D. Brooks
Continued scaling of process technologies is critical to sustaining improvements in processor frequencies and performance. However, shrinking process technologies exacerbates process variations — the deviation of process parameters from their target specifications. In the context of multi-core CMPs, which are implemented to feature homogeneous cores, within-die process variations result in substantially different core frequencies. Exposing such process-variation induced heterogeneity interferes with the norm of marketing chips at a single frequency. Further, application performance is undesirably dictated by the frequency of the core it is running on. To work around these challenges, a single uniform frequency, dictated by the slowest core, is currently chosen as the chip frequency sacrificing the increased performance capabilities of cores that could operate at higher frequencies. In this paper, we propose choosing the mean frequency across all cores, in lieu of the minimum frequency, as the single-frequency to use as the chip sales frequency. We examine several scheduling algorithms implemented below the O/S in hardware/firmware that guarantee minimum application performance near that of the average frequency, by masking process-variation induced heterogeneity from the end-user. We show that our Throughput-Driven Fairness (TDF) scheduling policy improves throughput by an average of 12% compared to a naive fairness scheme (round-robin) for frequency-sensitive applications. At the same time, TDF allows 98% of chips to maintain minimum performance at or above 90% of that expected at the mean frequency, providing a single uniform performance level to present for the chip.
工艺技术的持续扩展对于处理器频率和性能的持续改进至关重要。然而,工艺技术的萎缩加剧了工艺变化-工艺参数偏离其目标规范。在多核cmp的背景下,实现了同质核,模内工艺的变化导致了本质上不同的核频率。暴露这种过程变异引起的异质性在单一频率上干扰了营销芯片的规范。此外,应用程序的性能受其所运行的核心的频率的影响是不可取的。为了解决这些问题,目前选择由最慢的核心决定的单一统一频率作为芯片频率,牺牲了可以在更高频率下工作的核心的性能。在本文中,我们建议选择所有核心的平均频率,而不是最小频率,作为单频用作芯片销售频率。我们研究了在硬件/固件中实现的低于O/S的几种调度算法,这些算法通过屏蔽来自最终用户的进程变化引起的异质性,保证了接近平均频率的最低应用性能。我们表明,对于频率敏感的应用程序,与简单的公平性方案(轮循)相比,我们的吞吐量驱动公平性(TDF)调度策略平均提高了12%的吞吐量。同时,TDF允许98%的芯片在平均频率下保持最低性能等于或高于预期的90%,为芯片提供单一统一的性能水平。
{"title":"Achieving uniform performance and maximizing throughput in the presence of heterogeneity","authors":"K. Rangan, Michael D. Powell, Gu-Yeon Wei, D. Brooks","doi":"10.1109/HPCA.2011.5749712","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749712","url":null,"abstract":"Continued scaling of process technologies is critical to sustaining improvements in processor frequencies and performance. However, shrinking process technologies exacerbates process variations — the deviation of process parameters from their target specifications. In the context of multi-core CMPs, which are implemented to feature homogeneous cores, within-die process variations result in substantially different core frequencies. Exposing such process-variation induced heterogeneity interferes with the norm of marketing chips at a single frequency. Further, application performance is undesirably dictated by the frequency of the core it is running on. To work around these challenges, a single uniform frequency, dictated by the slowest core, is currently chosen as the chip frequency sacrificing the increased performance capabilities of cores that could operate at higher frequencies. In this paper, we propose choosing the mean frequency across all cores, in lieu of the minimum frequency, as the single-frequency to use as the chip sales frequency. We examine several scheduling algorithms implemented below the O/S in hardware/firmware that guarantee minimum application performance near that of the average frequency, by masking process-variation induced heterogeneity from the end-user. We show that our Throughput-Driven Fairness (TDF) scheduling policy improves throughput by an average of 12% compared to a naive fairness scheme (round-robin) for frequency-sensitive applications. At the same time, TDF allows 98% of chips to maintain minimum performance at or above 90% of that expected at the mean frequency, providing a single uniform performance level to present for the chip.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126365352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Beyond block I/O: Rethinking traditional storage primitives 超越块I/O:重新思考传统的存储基元
Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749738
Xiangyong Ouyang, D. Nellans, Robert Wipfel, David Flynn, D. Panda
Over the last twenty years the interfaces for accessing persistent storage within a computer system have remained essentially unchanged. Simply put, seek, read and write have defined the fundamental operations that can be performed against storage devices. These three interfaces have endured because the devices within storage subsystems have not fundamentally changed since the invention of magnetic disks. Non-volatile (flash) memory (NVM) has recently become a viable enterprise grade storage medium. Initial implementations of NVM storage devices have chosen to export these same disk-based seek/read/write interfaces because they provide compatibility for legacy applications. We propose there is a new class of higher order storage primitives beyond simple block I/O that high performance solid state storage should support. One such primitive, atomic-write, batches multiple I/O operations into a single logical group that will be persisted as a whole or rolled back upon failure. By moving write-atomicity down the stack into the storage device, it is possible to significantly reduce the amount of work required at the application, filesystem, or operating system layers to guarantee the consistency and integrity of data. In this work we provide a proof of concept implementation of atomic-write on a modern solid state device that leverages the underlying log-based flash translation layer (FTL). We present an example of how database management systems can benefit from atomic-write by modifying the MySQL InnoDB transactional storage engine. Using this new atomic-write primitive we are able to increase system throughput by 33%, improve the 90th percentile transaction response time by 20%, and reduce the volume of data written from MySQL to the storage subsystem by as much as 43% on industry standard benchmarks, while maintaining ACID transaction semantics.
在过去的二十年中,访问计算机系统内持久存储的接口基本上没有改变。简单地说,查找、读和写定义了可以对存储设备执行的基本操作。这三种接口一直存在,因为自磁盘发明以来,存储子系统中的设备并没有发生根本性的变化。非易失性(闪存)存储器(NVM)最近已成为一种可行的企业级存储介质。NVM存储设备的初始实现选择导出这些相同的基于磁盘的寻/读/写接口,因为它们为遗留应用程序提供了兼容性。我们提出,除了简单的块I/O之外,高性能固态存储应该支持一类新的高阶存储原语。其中一种基本的原子写操作将多个I/O操作批处理到单个逻辑组中,该逻辑组将作为一个整体持久化,或者在出现故障时回滚。通过将写原子性从堆栈向下移动到存储设备,可以显著减少应用程序层、文件系统层或操作系统层保证数据一致性和完整性所需的工作量。在这项工作中,我们提供了一个在现代固态设备上实现原子写入的概念验证,该设备利用了底层基于日志的闪存转换层(FTL)。我们提供了一个示例,说明数据库管理系统如何通过修改MySQL InnoDB事务存储引擎从原子写入中获益。使用这个新的原子写原语,我们能够将系统吞吐量提高33%,将第90百分位事务响应时间提高20%,并将从MySQL写入存储子系统的数据量减少多达43%,同时保持ACID事务语义。
{"title":"Beyond block I/O: Rethinking traditional storage primitives","authors":"Xiangyong Ouyang, D. Nellans, Robert Wipfel, David Flynn, D. Panda","doi":"10.1109/HPCA.2011.5749738","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749738","url":null,"abstract":"Over the last twenty years the interfaces for accessing persistent storage within a computer system have remained essentially unchanged. Simply put, seek, read and write have defined the fundamental operations that can be performed against storage devices. These three interfaces have endured because the devices within storage subsystems have not fundamentally changed since the invention of magnetic disks. Non-volatile (flash) memory (NVM) has recently become a viable enterprise grade storage medium. Initial implementations of NVM storage devices have chosen to export these same disk-based seek/read/write interfaces because they provide compatibility for legacy applications. We propose there is a new class of higher order storage primitives beyond simple block I/O that high performance solid state storage should support. One such primitive, atomic-write, batches multiple I/O operations into a single logical group that will be persisted as a whole or rolled back upon failure. By moving write-atomicity down the stack into the storage device, it is possible to significantly reduce the amount of work required at the application, filesystem, or operating system layers to guarantee the consistency and integrity of data. In this work we provide a proof of concept implementation of atomic-write on a modern solid state device that leverages the underlying log-based flash translation layer (FTL). We present an example of how database management systems can benefit from atomic-write by modifying the MySQL InnoDB transactional storage engine. Using this new atomic-write primitive we are able to increase system throughput by 33%, improve the 90th percentile transaction response time by 20%, and reduce the volume of data written from MySQL to the storage subsystem by as much as 43% on industry standard benchmarks, while maintaining ACID transaction semantics.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126699204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 138
期刊
2011 IEEE 17th International Symposium on High Performance Computer Architecture
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1