首页 > 最新文献

The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.最新文献

英文 中文
Beyond performance: some (other) challenges for systems design 性能之外:系统设计的一些(其他)挑战
E. Kronstadt
Traditionally we have focused on higher performance semiconductor technology, higher performance processors, higher performance memory systems, higher performance interconnect, higher performance software, etc., that is until we began focusing on lower power semiconductor technology, more power efficient processors, less power hungry interconnect etc., at the same time we have tried to reduce the cost of each of these components. The point is, that despite the ever-expanding nature of our system approach, historically we have taken essentially a componentized view. This appears to be changing, as evidenced by recent additions to our technical vocabulary: “systems on a chip,” “hardware-software codesign,” “autonomic computing.” All of these point to the possibility of a more holistic approach. We will examine this phenomenon to see if there really is something new here, what is driving it, and what are the consequences and challenges.
传统上,我们专注于高性能半导体技术、高性能处理器、高性能存储系统、高性能互连、高性能软件等,直到我们开始关注低功耗半导体技术、更节能的处理器、更节能的互连等,同时我们试图降低这些组件的成本。关键是,尽管我们的系统方法具有不断扩展的性质,但从历史上看,我们基本上采取了组件化的观点。这种情况似乎正在发生变化,正如我们最近增加的技术词汇所证明的那样:“芯片上的系统”、“硬件软件协同设计”、“自主计算”。所有这些都指向一种更全面的方法的可能性。我们将研究这一现象,看看是否真的有什么新的东西,是什么推动了它,以及后果和挑战是什么。
{"title":"Beyond performance: some (other) challenges for systems design","authors":"E. Kronstadt","doi":"10.1109/HPCA.2003.1183531","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183531","url":null,"abstract":"Traditionally we have focused on higher performance semiconductor technology, higher performance processors, higher performance memory systems, higher performance interconnect, higher performance software, etc., that is until we began focusing on lower power semiconductor technology, more power efficient processors, less power hungry interconnect etc., at the same time we have tried to reduce the cost of each of these components. The point is, that despite the ever-expanding nature of our system approach, historically we have taken essentially a componentized view. This appears to be changing, as evidenced by recent additions to our technical vocabulary: “systems on a chip,” “hardware-software codesign,” “autonomic computing.” All of these point to the possibility of a more holistic approach. We will examine this phenomenon to see if there really is something new here, what is driving it, and what are the consequences and challenges.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125367914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Microarchitecture and performance analysis of a SPARC-V9 microprocessor for enterprise server systems 用于企业服务器系统的SPARC-V9微处理器的微体系结构和性能分析
M. Sakamoto, A. Katsuno, Aiichiro Inoue, T. Asakawa, H. Ueno, K. Morita, Yasunori Kimura
We developed a 1.3-GHz SPARC-V9 processor: the SPARC64 V. This processor is designed to address requirements for enterprise servers and high-performance computing. Processing speed under multiuser interactive workloads is very sensitive to system balance because of the large number of memory requests included. From many years of experience with such workloads in mainframe system developments, we give importance to design a well-balanced communication structure. To accomplish this task, a system-level performance study must begin at an early please. Therefore we developed a performance model, which consists of a detailed processor model and detailed memory model, before hardware design was started. We updated it continuously. Once a logic simulator became available, we used it to verify the performance model for improving its accuracy. The model quite effectively enabled us to achieve performance goals and finish development quickly. This paper describes the SPARC64 V microarchitecture and performance analyses for hardware design.
我们开发了一个1.3 ghz的SPARC-V9处理器:sparc64v。该处理器旨在满足企业服务器和高性能计算的需求。多用户交互工作负载下的处理速度对系统平衡非常敏感,因为包含大量内存请求。根据多年来在大型机系统开发中处理此类工作负载的经验,我们认为设计一个平衡良好的通信结构非常重要。为了完成这项任务,系统级性能研究必须尽早开始。因此,在硬件设计开始之前,我们建立了一个性能模型,该模型由详细的处理器模型和详细的内存模型组成。我们不断更新。一旦逻辑模拟器可用,我们就使用它来验证性能模型,以提高其准确性。该模型非常有效地使我们能够实现性能目标并快速完成开发。本文介绍了sparc64v的微体系结构和硬件设计的性能分析。
{"title":"Microarchitecture and performance analysis of a SPARC-V9 microprocessor for enterprise server systems","authors":"M. Sakamoto, A. Katsuno, Aiichiro Inoue, T. Asakawa, H. Ueno, K. Morita, Yasunori Kimura","doi":"10.1109/HPCA.2003.1183533","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183533","url":null,"abstract":"We developed a 1.3-GHz SPARC-V9 processor: the SPARC64 V. This processor is designed to address requirements for enterprise servers and high-performance computing. Processing speed under multiuser interactive workloads is very sensitive to system balance because of the large number of memory requests included. From many years of experience with such workloads in mainframe system developments, we give importance to design a well-balanced communication structure. To accomplish this task, a system-level performance study must begin at an early please. Therefore we developed a performance model, which consists of a detailed processor model and detailed memory model, before hardware design was started. We updated it continuously. Once a logic simulator became available, we used it to verify the performance model for improving its accuracy. The model quite effectively enabled us to achieve performance goals and finish development quickly. This paper describes the SPARC64 V microarchitecture and performance analyses for hardware design.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127663918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Catching accurate profiles in hardware 在硬件中捕获准确的配置文件
S. Narayanasamy, T. Sherwood, S. Sair, B. Calder, G. Varghese
Run-time optimization is one of the most important ways of getting performance out of modern processors. Techniques such as prefetching, trace caching, memory disambiguation etc., are all based upon the principle of observation followed by adaptation, and all make use of some sort of profile information gathered at run-time. Programs are very complex, and the real trick in generating useful run-time profiles is sifting through all the unimportant and infrequently occurring events to find those that are important enough to warrant optimization. In this paper, we present the multi-hash architecture to catch important events even in the presence of extensive noise. Multi-hash uses a small amount of area, between 7 to 16 Kilo-bytes, to accurately capture these important events in hardware, without requiring any software support. This is achieved using multiple hash tables for the filtering, and interval-based profiling to help identify how important an event is in relationship to all the other events. We evaluate our design for value and edge profiling, and show that over a set of benchmarks, we get an average error less than 1%.
运行时优化是提高现代处理器性能的最重要方法之一。诸如预取、跟踪缓存、内存消歧等技术都是基于先观察后适应的原则,并且都利用了在运行时收集的某种概要信息。程序非常复杂,生成有用的运行时概要文件的真正技巧是筛选所有不重要和不经常发生的事件,以找到那些重要到足以保证优化的事件。在本文中,我们提出了多哈希架构,即使在存在大量噪声的情况下也能捕获重要事件。Multi-hash使用少量的面积(在7到16千字节之间)在硬件中准确捕获这些重要事件,而不需要任何软件支持。这是通过使用多个哈希表进行过滤和基于间隔的分析来帮助确定事件相对于所有其他事件的重要性来实现的。我们评估了我们的设计的价值和边缘分析,并表明在一组基准测试中,我们得到的平均误差小于1%。
{"title":"Catching accurate profiles in hardware","authors":"S. Narayanasamy, T. Sherwood, S. Sair, B. Calder, G. Varghese","doi":"10.1109/HPCA.2003.1183545","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183545","url":null,"abstract":"Run-time optimization is one of the most important ways of getting performance out of modern processors. Techniques such as prefetching, trace caching, memory disambiguation etc., are all based upon the principle of observation followed by adaptation, and all make use of some sort of profile information gathered at run-time. Programs are very complex, and the real trick in generating useful run-time profiles is sifting through all the unimportant and infrequently occurring events to find those that are important enough to warrant optimization. In this paper, we present the multi-hash architecture to catch important events even in the presence of extensive noise. Multi-hash uses a small amount of area, between 7 to 16 Kilo-bytes, to accurately capture these important events in hardware, without requiring any software support. This is achieved using multiple hash tables for the filtering, and interval-based profiling to help identify how important an event is in relationship to all the other events. We evaluate our design for value and edge profiling, and show that over a set of benchmarks, we get an average error less than 1%.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131560385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Memory system behavior of Java-based middleware 基于java的中间件的内存系统行为
Martin Karlsson, Kevin E. Moore, Erik Hagersten, D. Wood
In this paper, we present a detailed characterization of the memory system, behavior of ECperf and SPECjbb using both commercial server hardware and Simics full-system simulation. We find that the memory footprint and primary working sets of these workloads are small compared to other commercial workloads (e.g. on-line transaction processing), and that a large fraction of the working sets are shared between processors. We observed two key differences between ECperf and SPECjbb that highlight the importance of isolating the behavior of the middle tier. First, ECperf has a larger instruction footprint, resulting in much higher miss rates for intermediate-size instruction caches. Second, SPECjbb's data set size increases linearly as the benchmark scales up, while ECperf's remains roughly constant. This difference can lead to opposite conclusions on the design of multiprocessor memory systems, such as the utility of moderate sized (i.e. 1 MB) shared caches in a chip multiprocessor.
在本文中,我们详细描述了内存系统,ECperf和SPECjbb的行为,使用商用服务器硬件和Simics全系统仿真。我们发现,与其他商业工作负载(例如在线事务处理)相比,这些工作负载的内存占用和主要工作集很小,而且大部分工作集在处理器之间共享。我们观察到ECperf和SPECjbb之间的两个关键区别,它们突出了隔离中间层行为的重要性。首先,ECperf具有较大的指令占用,导致中等大小的指令缓存的缺失率要高得多。其次,SPECjbb的数据集大小随着基准扩展呈线性增长,而ECperf的数据集大小大致保持不变。这种差异可能会导致在多处理器内存系统的设计上得出相反的结论,例如在芯片多处理器中使用中等大小(即1 MB)的共享缓存。
{"title":"Memory system behavior of Java-based middleware","authors":"Martin Karlsson, Kevin E. Moore, Erik Hagersten, D. Wood","doi":"10.1109/HPCA.2003.1183540","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183540","url":null,"abstract":"In this paper, we present a detailed characterization of the memory system, behavior of ECperf and SPECjbb using both commercial server hardware and Simics full-system simulation. We find that the memory footprint and primary working sets of these workloads are small compared to other commercial workloads (e.g. on-line transaction processing), and that a large fraction of the working sets are shared between processors. We observed two key differences between ECperf and SPECjbb that highlight the importance of isolating the behavior of the middle tier. First, ECperf has a larger instruction footprint, resulting in much higher miss rates for intermediate-size instruction caches. Second, SPECjbb's data set size increases linearly as the benchmark scales up, while ECperf's remains roughly constant. This difference can lead to opposite conclusions on the design of multiprocessor memory systems, such as the utility of moderate sized (i.e. 1 MB) shared caches in a chip multiprocessor.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114421965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 63
Power-aware control speculation through selective throttling 通过选择性节流进行功率感知控制推测
Juan L. Aragón, José González, Antonio González
With the constant advances in technology that lead to the increasing of the transistor count and processor frequency, power dissipation is becoming one of the major issues in high-performance processors. These processors increase their clock frequency by lengthening the pipeline, which puts more pressure on the branch prediction engine since branches take longer to be resolved. Branch mispredictions are responsible for around 28% of the power dissipated by a typical processor due to the useless activities performed by instructions that are squashed. This work focuses on reducing the power dissipated by mis-speculated instructions. We propose selective throttling as an effective way of triggering different power-aware techniques (fetch throttling, decode throttling or disabling the selection logic). The particular set of techniques applied to each branch is dynamically chosen depending on the branch prediction confidence level. For branches with a low confidence on the prediction, the most aggressive throttling mechanism is used whereas high confidence branch predictions trigger the least aggressive techniques. Results show that combining fetch bandwidth reduction along with select logic disabling provides the best performance both in terms of energy reduction and energy-delay improvement (14% and 9% respectively for 14 stages, and 17% and 12% respectively for 28 stages).
随着技术的不断进步,晶体管数量和处理器频率不断增加,功耗成为高性能处理器的主要问题之一。这些处理器通过延长管道来提高时钟频率,这给分支预测引擎带来了更大的压力,因为分支需要更长的时间来解决。分支错误预测导致典型处理器大约28%的功耗消耗,这是由于被压扁的指令执行了无用的活动。这项工作的重点是减少错误推测指令所消耗的功率。我们提出选择性节流作为触发不同功率感知技术(获取节流,解码节流或禁用选择逻辑)的有效方法。根据分支预测置信水平动态选择应用于每个分支的特定技术集。对于预测置信度较低的分支,将使用最激进的节流机制,而高置信度的分支预测将触发最不激进的技术。结果表明,减少取带宽与禁用选择逻辑相结合,在能量减少和能量延迟改善方面都具有最佳性能(14级分别为14%和9%,28级分别为17%和12%)。
{"title":"Power-aware control speculation through selective throttling","authors":"Juan L. Aragón, José González, Antonio González","doi":"10.1109/HPCA.2003.1183528","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183528","url":null,"abstract":"With the constant advances in technology that lead to the increasing of the transistor count and processor frequency, power dissipation is becoming one of the major issues in high-performance processors. These processors increase their clock frequency by lengthening the pipeline, which puts more pressure on the branch prediction engine since branches take longer to be resolved. Branch mispredictions are responsible for around 28% of the power dissipated by a typical processor due to the useless activities performed by instructions that are squashed. This work focuses on reducing the power dissipated by mis-speculated instructions. We propose selective throttling as an effective way of triggering different power-aware techniques (fetch throttling, decode throttling or disabling the selection logic). The particular set of techniques applied to each branch is dynamically chosen depending on the branch prediction confidence level. For branches with a low confidence on the prediction, the most aggressive throttling mechanism is used whereas high confidence branch predictions trigger the least aggressive techniques. Results show that combining fetch bandwidth reduction along with select logic disabling provides the best performance both in terms of energy reduction and energy-delay improvement (14% and 9% respectively for 14 stages, and 17% and 12% respectively for 28 stages).","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128498355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
Cost-sensitive cache replacement algorithms 代价敏感的缓存替换算法
Jaeheon Jeong, M. Dubois
Cache replacement algorithms originally developed in the context of simple uniprocessor systems aim to reduce the miss count. However, in modern systems, cache misses have different costs. The cost may be latency, penalty, power consumption, bandwidth consumption, or any other ad-hoc numerical property attached to a miss. In many practical situations, it is desirable to inject the cost of a miss into the replacement policy. In this paper, we propose several extensions of LRU which account for nonuniform miss costs. These LRU extensions have simple implementations, yet they are very effective in various situations. We first explore the simple case of two static miss costs using trace-driven simulations to understand when cost-sensitive replacements are effective. We show that very large improvements of the cost function are possible in many practical cases. As an example of their effectiveness, we apply the algorithms to the second-level cache of a multiprocessor with superscalar processors, using the miss latency as the cost function. By applying our simple replacement policies sensitive to the latency of misses we can improve the execution time of some parallel applications by up to 18%.
缓存替换算法最初是在简单的单处理器系统中开发的,目的是减少丢失计数。然而,在现代系统中,缓存丢失有不同的代价。代价可能是延迟、惩罚、功耗、带宽消耗或任何其他附加在miss上的特定数值属性。在许多实际情况下,我们希望将miss的代价注入到替换策略中。在本文中,我们提出了考虑非均匀缺失代价的LRU的几种扩展。这些LRU扩展具有简单的实现,但它们在各种情况下都非常有效。我们首先探讨了两个静态脱靶成本的简单案例,使用轨迹驱动模拟来了解何时成本敏感的替换是有效的。我们表明,在许多实际情况下,成本函数的很大改进是可能的。作为其有效性的一个例子,我们将算法应用于具有超标量处理器的多处理器的二级缓存,使用错过延迟作为代价函数。通过应用对遗漏延迟敏感的简单替换策略,我们可以将一些并行应用程序的执行时间提高18%。
{"title":"Cost-sensitive cache replacement algorithms","authors":"Jaeheon Jeong, M. Dubois","doi":"10.1109/HPCA.2003.1183550","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183550","url":null,"abstract":"Cache replacement algorithms originally developed in the context of simple uniprocessor systems aim to reduce the miss count. However, in modern systems, cache misses have different costs. The cost may be latency, penalty, power consumption, bandwidth consumption, or any other ad-hoc numerical property attached to a miss. In many practical situations, it is desirable to inject the cost of a miss into the replacement policy. In this paper, we propose several extensions of LRU which account for nonuniform miss costs. These LRU extensions have simple implementations, yet they are very effective in various situations. We first explore the simple case of two static miss costs using trace-driven simulations to understand when cost-sensitive replacements are effective. We show that very large improvements of the cost function are possible in many practical cases. As an example of their effectiveness, we apply the algorithms to the second-level cache of a multiprocessor with superscalar processors, using the miss latency as the cost function. By applying our simple replacement policies sensitive to the latency of misses we can improve the execution time of some parallel applications by up to 18%.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"476 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116234942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 70
Dynamic optimization of micro-operations 微操作动态优化
Brian Slechta, David Crowe, Brian Fahs, M. Fertig, Gregory A. Muthler, Justin Quek, Francesco Spadini, Sanjay J. Patel, S. Lumetta
Inherent within complex instruction set architectures such as /spl times/86 are inefficiencies that do not exist in a simpler ISA. Modern /spl times/86 implementations decode instructions into one or more micro-operations in order to deal with the complexity of the ISA. Since these micro-operations are not visible to the compiler the stream of micro-operations can contain redundancies even in statically optimized /spl times/86 code. Within a processor implementation, however barriers at the ISA level do not apply, and these redundancies can be removed by optimizing the micro-operation stream. In this paper we explore the opportunities to optimize code at the micro-operation granularity. We execute these micro-operation optimizations using the rePLay Framework as a microarchitectural substrate. Using a simple set of seven optimizations, including two that aggressively and speculatively attempt to remove redundant load instructions, we examine the effects of dynamic optimization of micro-operations using a trace-driven simulation environment. Simulation reveals that across a sampling of SPECint 2000 and real /spl times/86 applications, rePLay is able to reduce micro-operation count by 21% and, in particular load micro-operation count by 22%. These reductions correspond to a boost in observed instruction-level parallelism on an 8-wide optimizing rePLay processor by 17% over a non-optimizing configuration.
在复杂的指令集架构(如/spl times/86)中,固有的效率低下在更简单的ISA中是不存在的。现代/spl times/86实现将指令解码成一个或多个微操作,以处理ISA的复杂性。由于这些微操作对编译器是不可见的,因此即使在静态优化的/spl times/86代码中,微操作流也可能包含冗余。然而,在处理器实现中,ISA级别的障碍并不适用,这些冗余可以通过优化微操作流来消除。在本文中,我们探讨了在微操作粒度上优化代码的机会。我们使用rePLay框架作为微架构基板来执行这些微操作优化。我们使用一组简单的七项优化,包括两项积极且推测性地尝试删除冗余负载指令的优化,使用跟踪驱动的仿真环境检查微操作动态优化的效果。仿真表明,在SPECint 2000和real /spl times/86应用程序的采样中,rePLay能够将微操作计数减少21%,特别是负载微操作计数减少22%。这些减少对应于在8宽优化的rePLay处理器上观察到的指令级并行性比非优化配置提高了17%。
{"title":"Dynamic optimization of micro-operations","authors":"Brian Slechta, David Crowe, Brian Fahs, M. Fertig, Gregory A. Muthler, Justin Quek, Francesco Spadini, Sanjay J. Patel, S. Lumetta","doi":"10.1109/HPCA.2003.1183535","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183535","url":null,"abstract":"Inherent within complex instruction set architectures such as /spl times/86 are inefficiencies that do not exist in a simpler ISA. Modern /spl times/86 implementations decode instructions into one or more micro-operations in order to deal with the complexity of the ISA. Since these micro-operations are not visible to the compiler the stream of micro-operations can contain redundancies even in statically optimized /spl times/86 code. Within a processor implementation, however barriers at the ISA level do not apply, and these redundancies can be removed by optimizing the micro-operation stream. In this paper we explore the opportunities to optimize code at the micro-operation granularity. We execute these micro-operation optimizations using the rePLay Framework as a microarchitectural substrate. Using a simple set of seven optimizations, including two that aggressively and speculatively attempt to remove redundant load instructions, we examine the effects of dynamic optimization of micro-operations using a trace-driven simulation environment. Simulation reveals that across a sampling of SPECint 2000 and real /spl times/86 applications, rePLay is able to reduce micro-operation count by 21% and, in particular load micro-operation count by 22%. These reductions correspond to a boost in observed instruction-level parallelism on an 8-wide optimizing rePLay processor by 17% over a non-optimizing configuration.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132000001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Front-end policies for improved issue efficiency in SMT processors 用于提高SMT处理器中问题效率的前端策略
A. El-Moursy, D. Albonesi
The performance and power optimization of dynamic superscalar microprocessors requires striking a careful balance between exploiting parallelism and hardware simplification. Hardware structures which are needlessly complex may exacerbate critical timing paths and dissipate extra power. One such structure requiring careful design is the issue queue. In a simultaneous multi-threading (SMT) processor it is particularly challenging to achieve issue queue simplification due to the increased utilization of the queue afforded by multi-threading. In this paper we propose new front-end policies that reduce the required integer and floating point issue queue sizes in SMT processors. We explore both general policies as well as those directed towards alleviating a particular cause of issue queue inefficiency. For the same level of performance, the most effective policies reduce the issue queue occupancy by 33% for an SMT processor with appropriately sized issue queue resources.
动态超标量微处理器的性能和功耗优化需要在利用并行性和硬件简化之间取得谨慎的平衡。过于复杂的硬件结构可能会加剧关键时间路径,并消耗额外的功率。需要仔细设计的一个这样的结构是问题队列。在同步多线程(SMT)处理器中,由于多线程增加了队列的利用率,因此实现问题队列简化尤其具有挑战性。在本文中,我们提出了新的前端策略,以减少SMT处理器中所需的整数和浮点问题队列大小。我们将探讨一般策略以及旨在减轻问题队列效率低下的特定原因的策略。对于相同级别的性能,对于具有适当大小的问题队列资源的SMT处理器,最有效的策略可以将问题队列占用率降低33%。
{"title":"Front-end policies for improved issue efficiency in SMT processors","authors":"A. El-Moursy, D. Albonesi","doi":"10.1109/HPCA.2003.1183522","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183522","url":null,"abstract":"The performance and power optimization of dynamic superscalar microprocessors requires striking a careful balance between exploiting parallelism and hardware simplification. Hardware structures which are needlessly complex may exacerbate critical timing paths and dissipate extra power. One such structure requiring careful design is the issue queue. In a simultaneous multi-threading (SMT) processor it is particularly challenging to achieve issue queue simplification due to the increased utilization of the queue afforded by multi-threading. In this paper we propose new front-end policies that reduce the required integer and floating point issue queue sizes in SMT processors. We explore both general policies as well as those directed towards alleviating a particular cause of issue queue inefficiency. For the same level of performance, the most effective policies reduce the issue queue occupancy by 33% for an SMT processor with appropriately sized issue queue resources.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128093856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 108
Dynamic data replication: an approach to providing fault-tolerant shared memory clusters 动态数据复制:一种提供容错共享内存集群的方法
Rosalia Christodoulopoulou, R. Azimi, A. Bilas
A challenging issue in today's server systems is to transparently deal with failures and application-imposed requirements for continuous operation. In this paper we address this problem in shared virtual memory (SVM) clusters at the programming abstraction layer. We design extensions to an existing SVM protocol that has been tuned for low-latency, high-bandwidth interconnects and SMP nodes and we achieve reliability through dynamic replication of application shared data and protocol information. Our extensions allow us to tolerate single (or multiple, but not simultaneous) node failures. We implement our extensions on a state-of-the-art cluster and we evaluate the common, failure-free case. We find that, although the complexity of our protocol is substantially higher than its failure-free counterpart, by taking advantage of architectural features of modern systems our approach imposes low overhead and can be employed for transparently dealing with system failures.
在当今的服务器系统中,一个具有挑战性的问题是透明地处理故障和应用程序强加的持续操作需求。本文在编程抽象层讨论了共享虚拟内存(SVM)集群中的这一问题。我们设计了现有SVM协议的扩展,该协议已针对低延迟,高带宽互连和SMP节点进行了调整,并通过动态复制应用程序共享数据和协议信息来实现可靠性。我们的扩展允许我们容忍单个(或多个,但不是同时)节点故障。我们在最先进的集群上实现扩展,并评估常见的无故障情况。我们发现,尽管我们的协议的复杂性大大高于其无故障的对等物,但通过利用现代系统的架构特性,我们的方法带来了低开销,并且可以用于透明地处理系统故障。
{"title":"Dynamic data replication: an approach to providing fault-tolerant shared memory clusters","authors":"Rosalia Christodoulopoulou, R. Azimi, A. Bilas","doi":"10.1109/HPCA.2003.1183538","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183538","url":null,"abstract":"A challenging issue in today's server systems is to transparently deal with failures and application-imposed requirements for continuous operation. In this paper we address this problem in shared virtual memory (SVM) clusters at the programming abstraction layer. We design extensions to an existing SVM protocol that has been tuned for low-latency, high-bandwidth interconnects and SMP nodes and we achieve reliability through dynamic replication of application shared data and protocol information. Our extensions allow us to tolerate single (or multiple, but not simultaneous) node failures. We implement our extensions on a state-of-the-art cluster and we evaluate the common, failure-free case. We find that, although the complexity of our protocol is substantially higher than its failure-free counterpart, by taking advantage of architectural features of modern systems our approach imposes low overhead and can be employed for transparently dealing with system failures.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128000489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Variability in architectural simulations of multi-threaded workloads 多线程工作负载的体系结构模拟中的可变性
Alaa R. Alameldeen, D. Wood
Multi-threaded commercial workloads implement many important Internet services. Consequently, these workloads are increasingly used to evaluate the performance of uniprocessor and multiprocessor system designs. This paper identifies performance variability as a potentially major challenge for architectural simulation studies using these workloads. Variability refers to the differences between multiple estimates of a workload's performance. Time variability occurs when a workload exhibits different characteristics during different phases of a single run. Space variability occurs when small variations in timing cause runs starting from the same initial condition to follow widely different execution paths. Variability is a well-known phenomenon in real systems, but is nearly universally ignored in simulation experiments. In a central result of this paper we show that variability in multi-threaded commercial workloads can lead to incorrect architectural conclusions (e.g., 31% of the time in one experiment). We propose a methodology, based on multiple simulations and standard statistical techniques, to compensate for variability. Our methodology greatly reduces the probability of reaching incorrect conclusions, while enabling simulations to finish within reasonable time limits.
多线程商业工作负载实现了许多重要的Internet服务。因此,这些工作负载越来越多地用于评估单处理器和多处理器系统设计的性能。本文将性能可变性视为使用这些工作负载进行架构模拟研究的潜在主要挑战。可变性是指工作负载性能的多个估计之间的差异。当工作负载在单次运行的不同阶段表现出不同的特征时,就会发生时间可变性。当时间上的微小变化导致从相同初始条件开始的运行遵循截然不同的执行路径时,就会出现空间可变性。可变性是真实系统中一个众所周知的现象,但在模拟实验中却几乎被普遍忽略。在本文的一个中心结果中,我们表明多线程商业工作负载的可变性可能导致不正确的架构结论(例如,在一个实验中有31%的时间)。我们提出了一种基于多重模拟和标准统计技术的方法来补偿可变性。我们的方法大大降低了得出错误结论的可能性,同时使模拟能够在合理的时间限制内完成。
{"title":"Variability in architectural simulations of multi-threaded workloads","authors":"Alaa R. Alameldeen, D. Wood","doi":"10.1109/HPCA.2003.1183520","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183520","url":null,"abstract":"Multi-threaded commercial workloads implement many important Internet services. Consequently, these workloads are increasingly used to evaluate the performance of uniprocessor and multiprocessor system designs. This paper identifies performance variability as a potentially major challenge for architectural simulation studies using these workloads. Variability refers to the differences between multiple estimates of a workload's performance. Time variability occurs when a workload exhibits different characteristics during different phases of a single run. Space variability occurs when small variations in timing cause runs starting from the same initial condition to follow widely different execution paths. Variability is a well-known phenomenon in real systems, but is nearly universally ignored in simulation experiments. In a central result of this paper we show that variability in multi-threaded commercial workloads can lead to incorrect architectural conclusions (e.g., 31% of the time in one experiment). We propose a methodology, based on multiple simulations and standard statistical techniques, to compensate for variability. Our methodology greatly reduces the probability of reaching incorrect conclusions, while enabling simulations to finish within reasonable time limits.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117028398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 274
期刊
The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1