首页 > 最新文献

Conference Proceedings. The 24th Annual International Symposium on Computer Architecture最新文献

英文 中文
Dynamic Speculation And Synchronization Of Data Dependence 数据依赖的动态推测与同步
Andreas Moshovos, S. E. Breach, T. N. Vijaykumar, G. Sohi
Data dependence speculation is used in instruction-level parallel (ILP) processors to allow early execution of an instruction before a logically preceding instruction on which it may be data dependent. If the instruction is independent, data dependence speculation succeeds; if not, it fails, and the two instructions must be synchronized. The modern dynamically scheduled processors that use data dependence speculation do so blindly (i.e., every load instruction with unresolved dependences is speculated). In this paper, we demonstrate that as dynamic instruction windows get larger, significant performance benefits can result when intelligent decisions about data dependence speculation are made. We propose dynamic data dependence speculation techniques: (i) to predict if the execution of an instruction is likely to result in a data dependence mis-specalation, and (ii) to provide the synchronization needed to avoid a mis-speculation. Experimental results evaluating the effectiveness of the proposed techniques are presented within the context of a Multiscalar processor.
数据依赖推测在指令级并行(ILP)处理器中使用,允许在逻辑上可能依赖于数据的指令之前提前执行一条指令。如果指令是独立的,则数据依赖推测成功;如果不是,它就会失败,两个指令必须同步。使用数据依赖推测的现代动态调度处理器盲目地这样做(即,推测具有未解析依赖的每个加载指令)。在本文中,我们证明了随着动态指令窗口变大,当对数据依赖推测做出智能决策时,可以获得显著的性能优势。我们提出了动态数据依赖推测技术:(i)预测指令的执行是否可能导致数据依赖错误推测,以及(ii)提供避免错误推测所需的同步。在多标量处理器的背景下,给出了评估所提出技术有效性的实验结果。
{"title":"Dynamic Speculation And Synchronization Of Data Dependence","authors":"Andreas Moshovos, S. E. Breach, T. N. Vijaykumar, G. Sohi","doi":"10.1145/264107.264189","DOIUrl":"https://doi.org/10.1145/264107.264189","url":null,"abstract":"Data dependence speculation is used in instruction-level parallel (ILP) processors to allow early execution of an instruction before a logically preceding instruction on which it may be data dependent. If the instruction is independent, data dependence speculation succeeds; if not, it fails, and the two instructions must be synchronized. The modern dynamically scheduled processors that use data dependence speculation do so blindly (i.e., every load instruction with unresolved dependences is speculated). In this paper, we demonstrate that as dynamic instruction windows get larger, significant performance benefits can result when intelligent decisions about data dependence speculation are made. We propose dynamic data dependence speculation techniques: (i) to predict if the execution of an instruction is likely to result in a data dependence mis-specalation, and (ii) to provide the synchronization needed to avoid a mis-speculation. Experimental results evaluating the effectiveness of the proposed techniques are presented within the context of a Multiscalar processor.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123075581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 114
Hardware Fault Containment In Scalable Shared-memory Multiprocessors 可扩展共享内存多处理器中的硬件故障遏制
D. Teodosiu, J. Baxter, Kinshuk Govil, J. Chapin, M. Rosenblum, M. Horowitz
Current shared-memory multiprocessors are inherently vulnerable to faults: any significant hardware or system software fault causes the entire system to fail. Unless provisions are made to limit the impact of faults, users will perceive a decrease in reliability when they entrust their applications to larger machines. This paper shows that fault containment techniques can be effectively applied to scalable shared-memory multiprocessors to reduce the reliability problems created by increased machine size.The primary goal of our approach is to leave normal-mode performance unaffected. Rather than using expensive fault-tolerance techniques to mask the effects of data and resource loss, our strategy is based on limiting the damage caused by faults to only a portion of the machine. After a hardware fault, we run a distributed recovery algorithm that allows normal operation to be resumed in the functioning parts of the machine.Our approach is implemented in the Stanford FLASH multiprocessor. Using a detailed hardware simulator, we have performed a number of fault injection experiments on a FLASH system running Hive, an operating system designed to support fault containment. The results we report validate our approach and show that in conjunction with an operating system like Hive, we can improve the reliability seen by unmodified applications without substantial performance cost. Simulation results suggest that our algorithms scale well for systems up to 128 processors.
当前的共享内存多处理器天生就容易出现故障:任何重大的硬件或系统软件故障都会导致整个系统失败。除非有限制故障影响的规定,否则当用户将应用程序委托给更大的机器时,他们会感到可靠性降低。本文表明,故障遏制技术可以有效地应用于可扩展的共享内存多处理器,以减少由于机器尺寸增加而产生的可靠性问题。我们的方法的主要目标是不影响正常模式的性能。我们的策略是将故障造成的损害限制在机器的一部分,而不是使用昂贵的容错技术来掩盖数据和资源丢失的影响。在硬件故障后,我们运行分布式恢复算法,使机器的功能部分恢复正常运行。我们的方法是在斯坦福FLASH多处理器中实现的。使用详细的硬件模拟器,我们在一个运行Hive的FLASH系统上进行了许多故障注入实验,Hive是一个旨在支持故障遏制的操作系统。我们报告的结果验证了我们的方法,并表明与Hive这样的操作系统相结合,我们可以提高未经修改的应用程序所看到的可靠性,而不会造成很大的性能成本。仿真结果表明,我们的算法可以很好地扩展到128个处理器的系统。
{"title":"Hardware Fault Containment In Scalable Shared-memory Multiprocessors","authors":"D. Teodosiu, J. Baxter, Kinshuk Govil, J. Chapin, M. Rosenblum, M. Horowitz","doi":"10.1145/264107.264141","DOIUrl":"https://doi.org/10.1145/264107.264141","url":null,"abstract":"Current shared-memory multiprocessors are inherently vulnerable to faults: any significant hardware or system software fault causes the entire system to fail. Unless provisions are made to limit the impact of faults, users will perceive a decrease in reliability when they entrust their applications to larger machines. This paper shows that fault containment techniques can be effectively applied to scalable shared-memory multiprocessors to reduce the reliability problems created by increased machine size.The primary goal of our approach is to leave normal-mode performance unaffected. Rather than using expensive fault-tolerance techniques to mask the effects of data and resource loss, our strategy is based on limiting the damage caused by faults to only a portion of the machine. After a hardware fault, we run a distributed recovery algorithm that allows normal operation to be resumed in the functioning parts of the machine.Our approach is implemented in the Stanford FLASH multiprocessor. Using a detailed hardware simulator, we have performed a number of fault injection experiments on a FLASH system running Hive, an operating system designed to support fault containment. The results we report validate our approach and show that in conjunction with an operating system like Hive, we can improve the reliability seen by unmodified applications without substantial performance cost. Simulation results suggest that our algorithms scale well for systems up to 128 processors.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121476876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
Memory-system Design Considerations For Dynamically-scheduled Processors 动态调度处理器的内存系统设计注意事项
K. Farkas, P. Chow, N. Jouppi, Z. Vranesic
In this paper, we identify performance trends and design relationships between the following components of the data memory hierarchy in a dynamically-scheduled processor: the register file, the lockup-free data cache, the stream buffers, and the interface between these components and the lower levels of the memory hierarchy. Similar performance was obtained from all systems having support for fewer than four in-flight misses, irrespective of the register-file size, the issue width of the processor, and the memory bandwidth. While providing support for more than four in-flight misses did increase system performance, the improvement was less than that obtained by increasing the number of registers. The addition of stream buffers to the investigated systems led to a significant performance increase, with the larger increases for systems having less in-flight-miss support, greater memory bandwidth, or more instruction issue capability. The performance of these systems was not significantly affected by the inclusion of traffic filters, dynamic-stride calculators, or the inclusion of the per-load non-unity stride-predictor and the incremental-prefetching techniques, which we introduce. However, the incremental prefetching technique reduces the bandwidth consumed by stream buffers by 50% without a significant impact on performance.
在本文中,我们确定了动态调度处理器中数据内存层次结构的以下组件之间的性能趋势和设计关系:寄存器文件,无锁数据缓存,流缓冲区,以及这些组件与内存层次结构较低级别之间的接口。无论寄存器文件大小、处理器的问题宽度和内存带宽如何,所有支持少于4次飞行中失误的系统都获得了类似的性能。虽然为4次以上的飞行失误提供支持确实提高了系统性能,但这种改进不如增加寄存器数量所获得的改进。将流缓冲区添加到所研究的系统中可以显著提高性能,对于具有更少的飞行失误支持、更大的内存带宽或更多指令问题能力的系统,性能的提高幅度更大。这些系统的性能不会受到流量过滤器、动态步幅计算器、每负载非统一步幅预测器和增量预取技术的显著影响。然而,增量预取技术将流缓冲区消耗的带宽减少了50%,而对性能没有显著影响。
{"title":"Memory-system Design Considerations For Dynamically-scheduled Processors","authors":"K. Farkas, P. Chow, N. Jouppi, Z. Vranesic","doi":"10.1145/264107.264156","DOIUrl":"https://doi.org/10.1145/264107.264156","url":null,"abstract":"In this paper, we identify performance trends and design relationships between the following components of the data memory hierarchy in a dynamically-scheduled processor: the register file, the lockup-free data cache, the stream buffers, and the interface between these components and the lower levels of the memory hierarchy. Similar performance was obtained from all systems having support for fewer than four in-flight misses, irrespective of the register-file size, the issue width of the processor, and the memory bandwidth. While providing support for more than four in-flight misses did increase system performance, the improvement was less than that obtained by increasing the number of registers. The addition of stream buffers to the investigated systems led to a significant performance increase, with the larger increases for systems having less in-flight-miss support, greater memory bandwidth, or more instruction issue capability. The performance of these systems was not significantly affected by the inclusion of traffic filters, dynamic-stride calculators, or the inclusion of the per-load non-unity stride-predictor and the incremental-prefetching techniques, which we introduce. However, the incremental prefetching technique reduces the bandwidth consumed by stream buffers by 50% without a significant impact on performance.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116948490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 120
Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility 黛西:100的动态编译?40架构兼容性
K. Ebcioglu, E. Altman
Although VLIW architectures offer the advantages of simplicity of design and high issue rates, a major impediment to their use is that they are not compatible with the existing software base. We describe new simple hardware features for a VLIW machine we call DAISY (Dynamically Architected Instruction Set from Yorktown). DAISY is specifically intended to emulate existing architectures, so that all existing software for an old architecture (including operating system kernel code) runs without changes on the VLIW. Each time a new fragment of code is executed for the first time, the code is translated to VLIW primitives, parallelized and saved in a portion of main memory not visible to the old architecture, by a Virtual Machine Monitor (software) residing in read only memory. Subsequent executions of the same fragment do not require a translation (unless cast out). We discuss the architectural requirements for such a VLIW, to deal with issues including self-modifying code, precise exceptions, and aggressive reordering of memory references in the presence of strong MP consistency and memory mapped I/O. We have implemented the dynamic parallelization algorithms for the PowerPC architecture. The initial results show high degrees of instruction level parallelism with reasonable translation overhead and memory usage.
尽管VLIW体系结构提供了设计简单和高问题率的优点,但是使用它们的一个主要障碍是它们与现有的软件库不兼容。我们描述了VLIW机器的新的简单硬件特性,我们称之为DAISY(来自Yorktown的动态架构指令集)。DAISY专门用于模拟现有的体系结构,以便旧体系结构的所有现有软件(包括操作系统内核代码)在VLIW上无需更改即可运行。每次新代码片段第一次执行时,代码被翻译成VLIW原语,并行化并保存在主内存中,旧架构不可见,由驻留在只读内存中的虚拟机监视器(软件)。同一片段的后续执行不需要翻译(除非被抛出)。我们将讨论这种VLIW的体系结构需求,以处理以下问题:自修改代码、精确异常以及在存在强MP一致性和内存映射I/O的情况下对内存引用进行主动重新排序。我们实现了PowerPC架构的动态并行化算法。初步结果显示,在合理的翻译开销和内存使用情况下,具有高度的指令级并行性。
{"title":"Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility","authors":"K. Ebcioglu, E. Altman","doi":"10.1145/264107.264126","DOIUrl":"https://doi.org/10.1145/264107.264126","url":null,"abstract":"Although VLIW architectures offer the advantages of simplicity of design and high issue rates, a major impediment to their use is that they are not compatible with the existing software base. We describe new simple hardware features for a VLIW machine we call DAISY (Dynamically Architected Instruction Set from Yorktown). DAISY is specifically intended to emulate existing architectures, so that all existing software for an old architecture (including operating system kernel code) runs without changes on the VLIW. Each time a new fragment of code is executed for the first time, the code is translated to VLIW primitives, parallelized and saved in a portion of main memory not visible to the old architecture, by a Virtual Machine Monitor (software) residing in read only memory. Subsequent executions of the same fragment do not require a translation (unless cast out). We discuss the architectural requirements for such a VLIW, to deal with issues including self-modifying code, precise exceptions, and aggressive reordering of memory references in the presence of strong MP consistency and memory mapped I/O. We have implemented the dynamic parallelization algorithms for the PowerPC architecture. The initial results show high degrees of instruction level parallelism with reasonable translation overhead and memory usage.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129648481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 400
Efficient Synchronization: Let Them Eat QOLB /sup1/ 高效同步:让他们吃QOLB /sup1/
A. Kägi, Doug Burger, J. Goodman
Efficient synchronization primitives are essential for achieving high performance in fine-grain, shared-memory parallel programs. One function of synchronization primitives is to enable exclusive access to shared data and critical sections of code. This paper makes three contributions. (1) We enumerate the five sources of overhead that locking synchronization primitives can incur. (2) We describe four mechanisms (local spinning, queue-based locking, collocation, and synchronized prefetch) that reduce these synchronization overheads. (3) With detailed simulations, we show the extent to which these four mechanisms can improve the performance of shared-memory programs. We evaluate the space of these mechanisms using seventeen synchronization constructs, which are formed from six base typed of locks (TEST&SET, TEST&TEST&SET, MCS, LH, M, and QOLB). We show that large performance gains (speedups of more than 1.5 for three of five benchmarks) can be achieved if at least three optimizing mechanisms are used simultaneously. We find that QOLB, which incorporates all four mechanisms, outperforms all other primitives (including reactive synchronization) in all cases. Finally, we demonstrate the superior performance of a low-cost implementation of QOLB, which runs on an unmodified cluster of commodity workstations.
高效的同步原语对于实现细粒度共享内存并行程序的高性能至关重要。同步原语的一个功能是支持对共享数据和代码关键部分的独占访问。本文有三个贡献。(1)我们列举了锁定同步原语可能导致的五种开销来源。(2)我们描述了四种机制(本地旋转、基于队列的锁定、并置和同步预取)来减少这些同步开销。(3)通过详细的仿真,我们展示了这四种机制在多大程度上可以提高共享内存程序的性能。我们使用17种同步结构来评估这些机制的空间,这些同步结构由6种基本类型的锁(TEST&SET, TEST&SET, MCS, LH, M和QOLB)组成。我们表明,如果同时使用至少三种优化机制,就可以获得较大的性能提升(五个基准测试中的三个基准测试的速度提升超过1.5)。我们发现,合并了所有四种机制的QOLB在所有情况下都优于所有其他原语(包括响应式同步)。最后,我们演示了低成本QOLB实现的优越性能,该实现运行在未经修改的商用工作站集群上。
{"title":"Efficient Synchronization: Let Them Eat QOLB /sup1/","authors":"A. Kägi, Doug Burger, J. Goodman","doi":"10.1145/264107.264166","DOIUrl":"https://doi.org/10.1145/264107.264166","url":null,"abstract":"Efficient synchronization primitives are essential for achieving high performance in fine-grain, shared-memory parallel programs. One function of synchronization primitives is to enable exclusive access to shared data and critical sections of code. This paper makes three contributions. (1) We enumerate the five sources of overhead that locking synchronization primitives can incur. (2) We describe four mechanisms (local spinning, queue-based locking, collocation, and synchronized prefetch) that reduce these synchronization overheads. (3) With detailed simulations, we show the extent to which these four mechanisms can improve the performance of shared-memory programs. We evaluate the space of these mechanisms using seventeen synchronization constructs, which are formed from six base typed of locks (TEST&SET, TEST&TEST&SET, MCS, LH, M, and QOLB). We show that large performance gains (speedups of more than 1.5 for three of five benchmarks) can be achieved if at least three optimizing mechanisms are used simultaneously. We find that QOLB, which incorporates all four mechanisms, outperforms all other primitives (including reactive synchronization) in all cases. Finally, we demonstrate the superior performance of a low-cost implementation of QOLB, which runs on an unmodified cluster of commodity workstations.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127380543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 42
Designing High Bandwidth On-chip Caches 设计高带宽片上缓存
Kenneth M. Wilson, K. Olukotun
In this paper we evaluate the performance of high bandwidth caches that employ multiple ports, multiple cycle hit times, on-chip DRAM, and a line buffer to find the organization that provides the best processor performance. Processor performance is measured in execution time using a dynamic superscalar processor running realistic benchmarks that include operating system references. The results show that a large dual-ported multi-cycle pipelined SRAM cache with a line buffer maximizes processor performance. A large pipelined cache provides both a low miss rate and a high CPU clock frequency. Dual-porting the cache and the use of a line buffer provide the bandwidth needed by a dynamic superscalar processor. In addition, the line buffer makes the pipelined dual-ported cache the best option by increasing cache port bandwidth and hiding cache latency.
在本文中,我们评估了采用多个端口、多个周期命中时间、片上DRAM和行缓冲区的高带宽缓存的性能,以找到提供最佳处理器性能的组织。处理器性能是使用运行实际基准(包括操作系统参考)的动态超标量处理器在执行时间内测量的。结果表明,带行缓冲的大型双端口多周期流水线SRAM缓存可以最大限度地提高处理器性能。一个大的流水线缓存提供了低丢失率和高CPU时钟频率。双端口缓存和行缓冲区的使用提供了动态超标量处理器所需的带宽。此外,行缓冲区通过增加缓存端口带宽和隐藏缓存延迟,使流水线双端口缓存成为最佳选择。
{"title":"Designing High Bandwidth On-chip Caches","authors":"Kenneth M. Wilson, K. Olukotun","doi":"10.1145/264107.264153","DOIUrl":"https://doi.org/10.1145/264107.264153","url":null,"abstract":"In this paper we evaluate the performance of high bandwidth caches that employ multiple ports, multiple cycle hit times, on-chip DRAM, and a line buffer to find the organization that provides the best processor performance. Processor performance is measured in execution time using a dynamic superscalar processor running realistic benchmarks that include operating system references. The results show that a large dual-ported multi-cycle pipelined SRAM cache with a line buffer maximizes processor performance. A large pipelined cache provides both a low miss rate and a high CPU clock frequency. Dual-porting the cache and the use of a line buffer provide the bandwidth needed by a dynamic superscalar processor. In addition, the line buffer makes the pipelined dual-ported cache the best option by increasing cache port bandwidth and hiding cache latency.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127485056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
Complexity-Effective Superscalar Processors 复杂性-有效的超标量处理器
Subbarao Palacharla, N. Jouppi, James E. Smith
The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of 0.8µm, 0.35µm, and 0.18µm. Performance results and trends are expressed in terms of issue width and window size. Our analysis indicates that window wakeup and selection logic as well as operand bypass logic are likely to be the most critical in the future.A microarchitecture that simplifies wakeup and selection logic is proposed and discussed. This implementation puts chains of dependent instructions into queues, and issues instructions from multiple queues in parallel. Simulation shows little slowdown as compared with a completely flexible issue window when performance is measured in clock cycles. Furthermore, because only instructions at queue heads need to be awakened and selected, issue logic is simplified and the clock cycle is faster --- consequently overall performance is improved. By grouping dependent instructions together, the proposed microarchitecture will help minimize performance degradation due to slow bypasses in future wide-issue machines.
研究了硬件复杂度和时钟速度之间的性能权衡。首先,定义一个泛型超标量管道。然后分析了寄存器重命名、指令窗口唤醒和选择逻辑以及操作数绕过的具体领域。每个模型和Spice模拟特征尺寸为0.8µm, 0.35µm和0.18µm。性能结果和趋势以问题宽度和窗口大小表示。我们的分析表明,窗口唤醒和选择逻辑以及操作数旁路逻辑可能是未来最关键的。提出并讨论了一种简化唤醒和选择逻辑的微体系结构。该实现将相关指令链放入队列,并从多个队列并行发出指令。当以时钟周期测量性能时,仿真显示与完全灵活的问题窗口相比,几乎没有减速。此外,因为只有队列头部的指令需要被唤醒和选择,所以问题逻辑得到了简化,时钟周期也更快——因此整体性能得到了提高。通过将相关指令分组在一起,所提出的微架构将有助于最大限度地减少由于在未来的大问题机器中缓慢绕过而导致的性能下降。
{"title":"Complexity-Effective Superscalar Processors","authors":"Subbarao Palacharla, N. Jouppi, James E. Smith","doi":"10.1145/264107.264201","DOIUrl":"https://doi.org/10.1145/264107.264201","url":null,"abstract":"The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of 0.8µm, 0.35µm, and 0.18µm. Performance results and trends are expressed in terms of issue width and window size. Our analysis indicates that window wakeup and selection logic as well as operand bypass logic are likely to be the most critical in the future.A microarchitecture that simplifies wakeup and selection logic is proposed and discussed. This implementation puts chains of dependent instructions into queues, and issues instructions from multiple queues in parallel. Simulation shows little slowdown as compared with a completely flexible issue window when performance is measured in clock cycles. Furthermore, because only instructions at queue heads need to be awakened and selected, issue logic is simplified and the clock cycle is faster --- consequently overall performance is improved. By grouping dependent instructions together, the proposed microarchitecture will help minimize performance degradation due to slow bypasses in future wide-issue machines.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114267300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 921
Run-time Adaptive Cache Hierarchy Via Reference Analysis 通过引用分析的运行时自适应缓存层次结构
Teresa L. Johnson, Wen-mei W. Hwu
Improvements in main memory speeds have not kept pace with increasing processor clock frequency and improved exploitation of instruction-level parallelism. Consequently, the gap between processor and main memory performance is expected to grow, increasing the number of execution cycles spent waiting for memory accesses to complete. One solution to this growing problem is to reduce the number of cache misses by increasing the effectiveness of the cache hierarchy. In this paper we present a technique for dynamic analysis of program data access behavior, which is then used to proactively guide the placement of data within the cache hierarchy in a location-sensitive manner. We introduce the concept of a macroblock, which allows us to feasibly characterize the memory locations accessed by a program, and a Memory Address Table, which performs the dynamic reference analysis. Our technique is fully compatible with existing Instruction Set Architectures. Results from detailed simulations of several integer programs show significant speedups.
主存速度的提高没有跟上处理器时钟频率的提高和指令级并行性的改进。因此,处理器和主存性能之间的差距预计会扩大,从而增加等待内存访问完成所花费的执行周期数。对于这个日益严重的问题,一个解决方案是通过提高缓存层次结构的有效性来减少缓存丢失的次数。在本文中,我们提出了一种动态分析程序数据访问行为的技术,然后使用该技术以位置敏感的方式主动指导数据在缓存层次结构中的放置。我们引入了宏块的概念,它允许我们对程序访问的内存位置进行可行的描述,以及内存地址表的概念,它执行动态引用分析。我们的技术完全兼容现有的指令集架构。几个整数程序的详细模拟结果显示了显著的加速。
{"title":"Run-time Adaptive Cache Hierarchy Via Reference Analysis","authors":"Teresa L. Johnson, Wen-mei W. Hwu","doi":"10.1145/264107.264213","DOIUrl":"https://doi.org/10.1145/264107.264213","url":null,"abstract":"Improvements in main memory speeds have not kept pace with increasing processor clock frequency and improved exploitation of instruction-level parallelism. Consequently, the gap between processor and main memory performance is expected to grow, increasing the number of execution cycles spent waiting for memory accesses to complete. One solution to this growing problem is to reduce the number of cache misses by increasing the effectiveness of the cache hierarchy. In this paper we present a technique for dynamic analysis of program data access behavior, which is then used to proactively guide the placement of data within the cache hierarchy in a location-sensitive manner. We introduce the concept of a macroblock, which allows us to feasibly characterize the memory locations accessed by a program, and a Memory Address Table, which performs the dynamic reference analysis. Our technique is fully compatible with existing Instruction Set Architectures. Results from detailed simulations of several integer programs show significant speedups.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117267250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 66
The SGI Origin: A ccnuma Highly Scalable Server SGI Origin:一个ccnuma高度可扩展的服务器
J. Laudon, D. Lenoski
The SGI Origin 2000 is a cache-coherent non-uniform memory access (ccNUMA) multiprocessor designed and manufactured by Silicon Graphics, Inc. The Origin system was designed from the ground up as a multiprocessor capable of scaling to both small and large processor counts without any bandwidth, latency, or cost cliffs. The Origin system consists of up to 512 nodes interconnected by a scalable Craylink network. Each node consists of one or two R10000 processors, up to 4 GB of coherent memory, and a connection to a portion of the XIO IO subsystem. This paper discusses the motivation for building the Origin 2000 and then describes its architecture and implementation. In addition, performance results are presented for the NAS Parallel Benchmarks V2.2 and the SPLASH2 applications. Finally, the Origin system is compared to other contemporary commercial ccNUMA systems.
SGI Origin 2000是由Silicon Graphics公司设计和制造的缓存一致非统一内存访问(ccNUMA)多处理器。Origin系统被设计为一个多处理器,能够在没有带宽、延迟或成本悬崖的情况下扩展到小型和大型处理器数量。Origin系统由多达512个节点组成,通过可扩展的Craylink网络相互连接。每个节点由一个或两个R10000处理器、最多4 GB的一致内存和一个到XIO IO子系统部分的连接组成。本文讨论了构建Origin 2000的动机,然后描述了它的体系结构和实现。此外,还提供了NAS Parallel benchmark V2.2和SPLASH2应用程序的性能结果。最后,将Origin系统与其他当代商用ccNUMA系统进行了比较。
{"title":"The SGI Origin: A ccnuma Highly Scalable Server","authors":"J. Laudon, D. Lenoski","doi":"10.1145/264107.264206","DOIUrl":"https://doi.org/10.1145/264107.264206","url":null,"abstract":"The SGI Origin 2000 is a cache-coherent non-uniform memory access (ccNUMA) multiprocessor designed and manufactured by Silicon Graphics, Inc. The Origin system was designed from the ground up as a multiprocessor capable of scaling to both small and large processor counts without any bandwidth, latency, or cost cliffs. The Origin system consists of up to 512 nodes interconnected by a scalable Craylink network. Each node consists of one or two R10000 processors, up to 4 GB of coherent memory, and a connection to a portion of the XIO IO subsystem. This paper discusses the motivation for building the Origin 2000 and then describes its architecture and implementation. In addition, performance results are presented for the NAS Parallel Benchmarks V2.2 and the SPLASH2 applications. Finally, the Origin system is compared to other contemporary commercial ccNUMA systems.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"143 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132133315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 864
The Interaction Of Software Prefetching With Ilp Processors In Shared-memory Systems 共享内存系统中软件预取与Ilp处理器的交互
Parthasarathy Ranganathan, Vijay S. Pai, Hazim Abdel-Shafi, S. Adve
Current microprocessors aggressively exploit instruction-level parallelism (ILP) through techniques such as multiple issue, dynamic scheduling, and non-blocking reads. Recent work has shown that memory latency remains a significant performance bottleneck for shared-memory multiprocessor systems built of such processors.This paper provides the first study of the effectiveness of software-controlled non-binding prefetching in shared memory multiprocessors built of state-of-the-art ILP-based processors. We find that software prefetching results in significant reductions in execution time (12% to 31%) for three out of five applications on an ILP system. However, compared to previous-generation system, software prefetching is significantly less effective in reducing the memory stall component of execution time on an ILP system. Consequently, even after adding software prefetching, memory stall time accounts for over 30% of the total execution time in four out of five applications on our ILP system.This paper also investigates the interaction of software prefetching with memory consistency models on ILP-based multiprocessors. In particular, we seek to determine whether software prefetching can equalize the performance of sequential consistency (SC) and release consistency (RC). We find that even with software prefetching, for three out of five applications, RC provides a significant reduction in execution time (15% to 40%) compared to SC.
当前的微处理器通过多问题、动态调度和非阻塞读取等技术积极利用指令级并行性(ILP)。最近的研究表明,内存延迟仍然是由这些处理器构建的共享内存多处理器系统的一个重要性能瓶颈。本文首次研究了由最先进的基于ilp的处理器构建的共享内存多处理器中软件控制的非绑定预取的有效性。我们发现,对于ILP系统上的五分之三的应用程序,软件预取可以显著减少执行时间(12%到31%)。然而,与上一代系统相比,软件预取在减少ILP系统上执行时间的内存失速组件方面的效果要差得多。因此,即使在添加了软件预取之后,在我们的ILP系统上,有五分之四的应用程序的内存停顿时间占总执行时间的30%以上。本文还研究了基于ilp的多处理器上软件预取与内存一致性模型的相互作用。特别是,我们试图确定软件预取是否可以平衡顺序一致性(SC)和发布一致性(RC)的性能。我们发现,即使使用软件预取,对于五分之三的应用程序,RC与SC相比,显著减少了执行时间(15%到40%)。
{"title":"The Interaction Of Software Prefetching With Ilp Processors In Shared-memory Systems","authors":"Parthasarathy Ranganathan, Vijay S. Pai, Hazim Abdel-Shafi, S. Adve","doi":"10.1145/264107.264158","DOIUrl":"https://doi.org/10.1145/264107.264158","url":null,"abstract":"Current microprocessors aggressively exploit instruction-level parallelism (ILP) through techniques such as multiple issue, dynamic scheduling, and non-blocking reads. Recent work has shown that memory latency remains a significant performance bottleneck for shared-memory multiprocessor systems built of such processors.This paper provides the first study of the effectiveness of software-controlled non-binding prefetching in shared memory multiprocessors built of state-of-the-art ILP-based processors. We find that software prefetching results in significant reductions in execution time (12% to 31%) for three out of five applications on an ILP system. However, compared to previous-generation system, software prefetching is significantly less effective in reducing the memory stall component of execution time on an ILP system. Consequently, even after adding software prefetching, memory stall time accounts for over 30% of the total execution time in four out of five applications on our ILP system.This paper also investigates the interaction of software prefetching with memory consistency models on ILP-based multiprocessors. In particular, we seek to determine whether software prefetching can equalize the performance of sequential consistency (SC) and release consistency (RC). We find that even with software prefetching, for three out of five applications, RC provides a significant reduction in execution time (15% to 40%) compared to SC.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121818501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 42
期刊
Conference Proceedings. The 24th Annual International Symposium on Computer Architecture
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1