23rd Annual International Symposium on Computer Architecture (ISCA'96)最新文献

英文中文

A Router Architecture for Real-Time Point-to-Point Networks 实时点对点网络的路由器体系结构

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.232998

J. Rexford, J. Hall, K. Shin

Parallel machines have the potential to satisfy the large computational demands of emerging real-time applications. These applications require a predictable communication network, where time-constrained traffic requires bounds on latency or throughput while good average performance suffices for best-effort packets. This paper presents a router architecture that tailors low-level routing, switching, arbitration and flow-control policies to the conflicting demands of each traffic class. The router implements deadline-based scheduling, with packet switching and table-driven multicast routing, to bound end-to-end delay for time-constrained traffic, while allowing best-effort traffic to capitalize on the low-latency routing and switching schemes common in modern parallel machines. To limit the cost of servicing time-constrained traffic, the router shares packet buffers and link-scheduling logic between the multiple output ports. Verilog simulations demonstrate that the design meets the performance goals of both traffic classes in a single-chip solution.

并行机器有潜力满足新兴实时应用的大量计算需求。这些应用程序需要可预测的通信网络，其中受时间限制的流量需要限制延迟或吞吐量，而良好的平均性能足以满足“尽力而为”的数据包。本文提出了一种路由器架构，该架构可以根据不同流量类的冲突需求定制低级路由、交换、仲裁和流控制策略。路由器通过分组交换和表驱动的多播路由实现基于截止日期的调度，为时间受限的流量绑定端到端延迟，同时允许尽可能多的流量利用现代并行机器中常见的低延迟路由和交换方案。为了限制服务时间受限流量的成本，路由器在多个输出端口之间共享数据包缓冲区和链路调度逻辑。Verilog仿真表明，该设计在单芯片解决方案中满足两类流量的性能目标。

引用次数: 53

Evaluation of Multithreaded Uniprocessors for Commercial Application Environments 商业应用环境下多线程单处理器的评价

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.232994

R. Eickemeyer, Ross E. Johnson, S. Kunkel, M. Squillante, Shiafun Liu

As memory speeds grow at a considerably slower rate than processor speeds, memory accesses are starting to dominate the execution time of processors, and this will likely continue into the future. This trend will be exacerbated by growing miss rates due to commercial applications, object-oriented programming and micro-kernel based operating systems. We examine the use of coarse-grained multithreading to address this important problem in uniprocessor on-line transaction processing environments where there is a natural, coarse-grained parallelism between the tasks resulting from transactions being executed concurrently, with no application software modifications required. Our results suggest that multithreading can provide significant performance improvements for uniprocessor commercial computing environments.

由于内存速度的增长速度比处理器速度慢得多，内存访问开始主导处理器的执行时间，这种情况可能会持续到未来。由于商业应用程序、面向对象编程和基于微内核的操作系统导致的失误率不断上升，这一趋势将进一步加剧。我们研究了在单处理器在线事务处理环境中使用粗粒度多线程来解决这个重要问题，在这种环境中，由于并发执行的事务导致任务之间存在自然的粗粒度并行性，而不需要修改应用程序软件。我们的研究结果表明，多线程可以为单处理器商业计算环境提供显著的性能改进。

引用次数: 80

Application and Architectural Bottlenecks in Large Scale Distributed Shared Memory Machines 大规模分布式共享内存机的应用和架构瓶颈

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.232988

Chris Holt, Jaswinder Pal Singh, J. Hennessy

Many of the programming challenges encountered in small to moderate-scale hardware cache-coherent shared memory machines have been extensively studied. While work remains to be done, the basic techniques needed to efficiently program such machines have been well explored. Recently, a number of researchers have presented architectural techniques for scaling a cache coherent shared address space to much larger processor counts. In this paper, we examine the extent to which applications can achieve reasonable performance on such large-scale, cache-coherent, distributed shared address space machines, by determining the problems sizes needed to achieve a reasonable level of efficiency. We also look at how much programming effort and optimization is needed to achieve high efficiency, beyond that needed at small processor counts. For each application, we discuss the main architectural bottlenecks that prevent smaller problem sizes or less optimized programs from achieving good efficiency. Our results show that while there are some applications that either do not scale or must be heavily optimized to do so, for most of the applications we studied it is not necessary to heavily modify the code or restructure algorithms to scale well upto several hundred processors, once the basic techniques for load balancing and data locality are used that are needed for small-scale systems as well. Programs written with some care perform well without substantially compromising the ease of programming advantage of a shared address space, and the problem sizes required to achieve good performance are surprisingly small. It is important to be careful about how data structures and layouts interact with system granularities, but these optimizations are usually needed for moderate-scale machines as well.

在小型到中等规模的硬件缓存一致共享内存机器中遇到的许多编程挑战已经得到了广泛的研究。虽然还有工作要做，但有效地对这种机器进行编程所需的基本技术已经得到了很好的探索。最近，许多研究人员提出了将缓存一致共享地址空间扩展到更大处理器数量的体系结构技术。在本文中，我们通过确定实现合理效率水平所需的问题大小，来检查应用程序在这种大规模、缓存一致、分布式共享地址空间机器上可以实现合理性能的程度。我们还研究了需要多少编程工作和优化才能实现高效率，而不仅仅是在小处理器数量下。对于每个应用程序，我们将讨论阻碍较小问题规模或优化程度较低的程序获得良好效率的主要体系结构瓶颈。我们的结果表明，虽然有一些应用程序要么无法扩展，要么必须进行大量优化才能扩展，但对于我们研究的大多数应用程序来说，一旦使用了小型系统所需的负载平衡和数据局域性的基本技术，就不需要大量修改代码或重构算法来扩展到数百个处理器。小心编写的程序性能良好，而不会严重损害共享地址空间的编程便利性优势，并且实现良好性能所需的问题规模非常小。注意数据结构和布局如何与系统粒度交互是很重要的，但是对于中等规模的机器通常也需要这些优化。

{"title":"Application and Architectural Bottlenecks in Large Scale Distributed Shared Memory Machines","authors":"Chris Holt, Jaswinder Pal Singh, J. Hennessy","doi":"10.1145/232973.232988","DOIUrl":"https://doi.org/10.1145/232973.232988","url":null,"abstract":"Many of the programming challenges encountered in small to moderate-scale hardware cache-coherent shared memory machines have been extensively studied. While work remains to be done, the basic techniques needed to efficiently program such machines have been well explored. Recently, a number of researchers have presented architectural techniques for scaling a cache coherent shared address space to much larger processor counts. In this paper, we examine the extent to which applications can achieve reasonable performance on such large-scale, cache-coherent, distributed shared address space machines, by determining the problems sizes needed to achieve a reasonable level of efficiency. We also look at how much programming effort and optimization is needed to achieve high efficiency, beyond that needed at small processor counts. For each application, we discuss the main architectural bottlenecks that prevent smaller problem sizes or less optimized programs from achieving good efficiency. Our results show that while there are some applications that either do not scale or must be heavily optimized to do so, for most of the applications we studied it is not necessary to heavily modify the code or restructure algorithms to scale well upto several hundred processors, once the basic techniques for load balancing and data locality are used that are needed for small-scale systems as well. Programs written with some care perform well without substantially compromising the ease of programming advantage of a shared address space, and the problem sizes required to achieve good performance are surprisingly small. It is important to be careful about how data structures and layouts interact with system granularities, but these optimizations are usually needed for moderate-scale machines as well.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129696582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47

Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors 通知内存操作:在现代处理器中提供内存性能反馈

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.233000

M. Horowitz, M. Martonosi, T. Mowry, Michael D. Smith

Memory latency is an important bottleneck in system performance that cannot be adequately solved by hardware alone. Several promising software techniques have been shown to address this problem successfully in specific situations. However, the generality of these software approaches has been limited because current architectures do not provide a fine-grained, low-overhead mechanism for observing and reacting to memory behavior directly. To fill this need, we propose a new class of memory operations called informing memory operations, which essentially consist of a memory operation combined (either implicitly or explicitly) with a conditional branch-and-link operation that is taken only if the reference suffers a cache miss. We describe two different implementations of informing memory operations---one based on a cache-outcome condition code and another based on low-overhead traps---and find that modern in-order-issue and out-of-order-issue superscalar processors already contain the bulk of the necessary hardware support. We describe how a number of software-based memory optimizations can exploit informing memory operations to enhance performance, and look at cache coherence with fine-grained access control as a case study. Our performance results demonstrate that the runtime overhead of invoking the informing mechanism on the Alpha 21164 and MIPS R10000 processors is generally small enough to provide considerable flexibility to hardware and software designers, and that the cache coherence application has improved performance compared to other current solutions. We believe that the inclusion of informing memory operations in future processors may spur even more innovative performance optimizations.

内存延迟是系统性能的一个重要瓶颈，仅靠硬件是无法充分解决的。一些有前途的软件技术已经被证明可以在特定情况下成功地解决这个问题。然而，这些软件方法的通用性受到了限制，因为当前的体系结构没有提供一种细粒度的、低开销的机制来直接观察和响应内存行为。为了满足这一需求，我们提出了一类新的内存操作，叫做通知内存操作，它本质上由一个内存操作(隐式或显式)与一个条件分支链接操作(仅在引用遭受缓存丢失时才执行)相结合组成。我们描述了通知内存操作的两种不同实现——一种基于缓存结果条件代码，另一种基于低开销陷阱——并发现现代无序问题和无序问题标量处理器已经包含了大量必要的硬件支持。我们描述了许多基于软件的内存优化如何利用通知内存操作来提高性能，并将缓存一致性与细粒度访问控制作为案例研究。我们的性能结果表明，在Alpha 21164和MIPS R10000处理器上调用通知机制的运行时开销通常足够小，可以为硬件和软件设计人员提供相当大的灵活性，并且与其他当前解决方案相比，缓存一致性应用程序提高了性能。我们相信，在未来的处理器中包含通知内存操作可能会激发更多创新的性能优化。

{"title":"Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors","authors":"M. Horowitz, M. Martonosi, T. Mowry, Michael D. Smith","doi":"10.1145/232973.233000","DOIUrl":"https://doi.org/10.1145/232973.233000","url":null,"abstract":"Memory latency is an important bottleneck in system performance that cannot be adequately solved by hardware alone. Several promising software techniques have been shown to address this problem successfully in specific situations. However, the generality of these software approaches has been limited because current architectures do not provide a fine-grained, low-overhead mechanism for observing and reacting to memory behavior directly. To fill this need, we propose a new class of memory operations called informing memory operations, which essentially consist of a memory operation combined (either implicitly or explicitly) with a conditional branch-and-link operation that is taken only if the reference suffers a cache miss. We describe two different implementations of informing memory operations---one based on a cache-outcome condition code and another based on low-overhead traps---and find that modern in-order-issue and out-of-order-issue superscalar processors already contain the bulk of the necessary hardware support. We describe how a number of software-based memory optimizations can exploit informing memory operations to enhance performance, and look at cache coherence with fine-grained access control as a case study. Our performance results demonstrate that the runtime overhead of invoking the informing mechanism on the Alpha 21164 and MIPS R10000 processors is generally small enough to provide considerable flexibility to hardware and software designers, and that the cache coherence application has improved performance compared to other current solutions. We believe that the inclusion of informing memory operations in future processors may spur even more innovative performance optimizations.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130131666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 100

Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor 利用选择:一个可实现的同步多线程处理器上的指令获取和下发

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.232993

D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, R. Stamm

Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with similar hardware resources. This speedup is enhanced by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the "best" instructions to the processor.

同时多线程是一种允许多个独立线程在每个周期发出多条指令的技术。在之前的工作中，我们基于某种理想化的模型展示了同步多线程的性能潜力。在本文中，我们证明了同步多线程的吞吐量增益可以在不对传统的宽问题超标量进行大量更改的情况下实现，无论是在硬件结构还是大小上。我们提出了一个同时多线程的体系结构，它实现了三个目标:(1)它最大限度地减少了对传统超标量设计的体系结构影响，(2)它对单独执行的单个线程的性能影响最小，(3)它在运行多个线程时实现了显著的吞吐量增益。我们的同步多线程架构实现了每个周期5.4条指令的吞吐量，比具有类似硬件资源的未修改超标量提高了2.5倍。这种加速得到了以前在其他体系结构中未被利用的多线程优势的增强:能够在每个周期使用处理器时最有效地支持提取和发出那些线程，从而为处理器提供“最佳”指令。

{"title":"Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor","authors":"D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, R. Stamm","doi":"10.1145/232973.232993","DOIUrl":"https://doi.org/10.1145/232973.232993","url":null,"abstract":"Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with similar hardware resources. This speedup is enhanced by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the \"best\" instructions to the processor.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122302975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 848

Compiler and Hardware Support for Cache Coherence in Large-Scale Multiprocessors: Design Considerations and Performance Study 大规模多处理器中缓存一致性的编译器和硬件支持:设计考虑和性能研究

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.233002

L. Choi, P. Yew

In this paper, we study a hardware-supported, compiler directed (HSCD) cache coherence scheme, which can be implemented on a large-scale multiprocessor using off-the-shelf microprocessors, such as the Cray T3D. It can be adapted to various cache organizations, including multi-word cache lines and byte-addressable architectures. Several system related issues, including critical sections, inter-thread communication, and task migration have also been addressed. The cost of the required hardware support is small and proportional to the cache size. The necessary compiler algorithms, including intra- and interprocedural array data-flow analysis, have been implemented on the Polaris compiler [17].From our simulation study using the Perfect Club benchmarks, we found that, in spite of the conservative analysis made by the compiler, the performance of the proposed HSCD scheme can be comparable to that of a full-map hardware directory scheme. With its comparable performance and reduced hardware cost, the scheme can be a viable alternative for large-scale multiprocessors, such as the Cray T3D, that rely on users to maintain data coherence.

在本文中，我们研究了一种硬件支持的编译器导向(HSCD)缓存一致性方案，该方案可以使用现成的微处理器(如Cray T3D)在大规模多处理器上实现。它可以适应各种缓存组织，包括多字缓存线和字节可寻址架构。还讨论了几个系统相关问题，包括临界区、线程间通信和任务迁移。所需硬件支持的成本很小，并且与缓存大小成正比。必要的编译算法，包括程序内和程序间的数组数据流分析，已经在Polaris编译器上实现[17]。从我们使用Perfect Club基准测试的模拟研究中，我们发现，尽管编译器进行了保守的分析，但所提出的HSCD方案的性能可以与全映射硬件目录方案相媲美。由于具有相当的性能和更低的硬件成本，该方案可以成为大型多处理器(如Cray T3D)的可行替代方案，这些处理器依赖于用户来保持数据一致性。

{"title":"Compiler and Hardware Support for Cache Coherence in Large-Scale Multiprocessors: Design Considerations and Performance Study","authors":"L. Choi, P. Yew","doi":"10.1145/232973.233002","DOIUrl":"https://doi.org/10.1145/232973.233002","url":null,"abstract":"In this paper, we study a hardware-supported, compiler directed (HSCD) cache coherence scheme, which can be implemented on a large-scale multiprocessor using off-the-shelf microprocessors, such as the Cray T3D. It can be adapted to various cache organizations, including multi-word cache lines and byte-addressable architectures. Several system related issues, including critical sections, inter-thread communication, and task migration have also been addressed. The cost of the required hardware support is small and proportional to the cache size. The necessary compiler algorithms, including intra- and interprocedural array data-flow analysis, have been implemented on the Polaris compiler [17].From our simulation study using the Perfect Club benchmarks, we found that, in spite of the conservative analysis made by the compiler, the performance of the proposed HSCD scheme can be comparable to that of a full-map hardware directory scheme. With its comparable performance and reduced hardware cost, the scheme can be a viable alternative for large-scale multiprocessors, such as the Cray T3D, that rely on users to maintain data coherence.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"2017 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132776919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Using Hybrid Branch Predictors to Improve Branch Prediction Accuracy in the Presence of Context Switches 使用混合分支预测器提高上下文切换下的分支预测精度

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.232975

M. Evers, Po-Yung Chang, Y. Patt

Pipeline stalls due to conditional branches represent one of the most significant impediments to realizing the performance potential of deeply pipelined, superscalar processors. Many branch predictors have been proposed to help alleviate this problem, including the Two-Level Adaptive Branch Predictor, and more recently, two-component hybrid branch predictors.In a less idealized environment, such as a time-shared system, code of interest involves context switches. Context switches, even at fairly large intervals, can seriously degrade the performance of many of the most accurate branch prediction schemes. In this paper, we introduce a new hybrid branch predictor and show that it is more accurate (for a given cost) than any previously published scheme, especially if the branch histories are periodically flushed due to the presence of context switches.

由于条件分支导致的管道停滞是实现深度管道、超标量处理器的性能潜力的最大障碍之一。已经提出了许多分支预测器来帮助缓解这个问题，包括两级自适应分支预测器，以及最近的双组分混合分支预测器。在不太理想的环境中，例如分时系统，感兴趣的代码涉及上下文切换。上下文切换，即使间隔相当大，也会严重降低许多最准确的分支预测方案的性能。在本文中，我们引入了一种新的混合分支预测器，并表明它比以前发布的任何方案都更准确(对于给定的成本)，特别是在分支历史由于上下文切换而定期刷新的情况下。

引用次数: 165

Instruction Prefetching of Systems Codes with Layout Optimized for Reduced Cache Misses 为减少缓存缺失而优化布局的系统代码指令预取

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.233001

Chun Xia, J. Torrellas

High-performing on-chip instruction caches are crucial to keep fast processors busy. Unfortunately, while on-chip caches are usually successful at intercepting instruction fetches in loop-intensive engineering codes, they are less able to do so in large systems codes. To improve the performance of the latter codes, the compiler can be used to lay out the code in memory for reduced cache conflicts. Interestingly, such an operation leaves the code in a state that can be exploited by a new type of instruction prefetching: guarded sequential prefetching.The idea is that the compiler leaves hints in the code as to how the code was laid out. Then, at run time, the prefetching hardware detects these hints and uses them to prefetch more effectively. This scheme can be implemented very cheaply: one bit encoded in control transfer instructions and a prefetch module that requires minor extensions to existing next-line sequential prefetchers. Furthermore, the scheme can be turned off and on at run time with the toggling of a bit in the TLB. The scheme is evaluated with simulations using complete traces from a 4-processor machine. Overall, for 16-Kbyte primary instruction caches, guarded sequential prefetching removes, on average, 66% of the instruction misses remaining in an operating system with an optimized layout, speeding up the operating system by 10%. Moreover, the scheme is more cost-effective and robust than existing sequential prefetching techniques.

高性能片上指令缓存对于保持快速处理器的繁忙状态至关重要。不幸的是，虽然片上缓存通常能够成功地拦截循环密集型工程代码中的指令获取，但它们在大型系统代码中却不太能够做到这一点。为了提高后一种代码的性能，可以使用编译器在内存中布局代码，以减少缓存冲突。有趣的是，这样的操作使代码处于一种可以被一种新的指令预取所利用的状态:受保护的顺序预取。其思想是编译器在代码中留下关于代码如何布局的提示。然后，在运行时，预取硬件检测到这些提示，并使用它们更有效地预取。该方案可以非常便宜地实现:在控制传输指令中编码一个比特，并使用一个预取模块，该模块需要对现有的下一行顺序预取器进行少量扩展。此外，该方案可以在运行时通过在TLB中切换一个位来关闭和打开。利用4处理器机器的完整迹线对该方案进行了仿真评估。总的来说，对于16kbyte的主指令缓存，有保护的顺序预取平均删除了操作系统中66%的指令遗漏，使操作系统的速度提高了10%。此外，该方案比现有的顺序预取技术更具成本效益和鲁棒性。

{"title":"Instruction Prefetching of Systems Codes with Layout Optimized for Reduced Cache Misses","authors":"Chun Xia, J. Torrellas","doi":"10.1145/232973.233001","DOIUrl":"https://doi.org/10.1145/232973.233001","url":null,"abstract":"High-performing on-chip instruction caches are crucial to keep fast processors busy. Unfortunately, while on-chip caches are usually successful at intercepting instruction fetches in loop-intensive engineering codes, they are less able to do so in large systems codes. To improve the performance of the latter codes, the compiler can be used to lay out the code in memory for reduced cache conflicts. Interestingly, such an operation leaves the code in a state that can be exploited by a new type of instruction prefetching: guarded sequential prefetching.The idea is that the compiler leaves hints in the code as to how the code was laid out. Then, at run time, the prefetching hardware detects these hints and uses them to prefetch more effectively. This scheme can be implemented very cheaply: one bit encoded in control transfer instructions and a prefetch module that requires minor extensions to existing next-line sequential prefetchers. Furthermore, the scheme can be turned off and on at run time with the toggling of a bit in the TLB. The scheme is evaluated with simulations using complete traces from a 4-processor machine. Overall, for 16-Kbyte primary instruction caches, guarded sequential prefetching removes, on average, 66% of the instruction misses remaining in an operating system with an optimized layout, speeding up the operating system by 10%. Moreover, the scheme is more cost-effective and robust than existing sequential prefetching techniques.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114984206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Evaluation of Design Alternatives for a Multiprocessor Microprocessor 多处理器微处理器设计方案的评价

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.232982

B. A. Nayfeh, Lance Hammond, K. Olukotun

In the future, advanced integrated circuit processing and packaging technology will allow for several design options for multiprocessor microprocessors. In this paper we consider three architectures: shared-primary cache, shared-secondary cache, and shared-memory. We evaluate these three architectures using a complete system simulation environment which models the CPU, memory hierarchy and I/O devices in sufficient detail to boot and run a commercial operating system. Within our simulation environment, we measure performance using representative hand and compiler generated parallel applications, and a multiprogramming workload. Our results show that when applications exhibit fine-grained sharing, both shared-primary and shared-secondary architectures perform similarly when the full costs of sharing the primary cache are included.

未来，先进的集成电路处理和封装技术将为多处理器微处理器提供多种设计选择。在本文中，我们考虑了三种架构:共享主缓存、共享从缓存和共享内存。我们使用一个完整的系统仿真环境来评估这三种架构，该环境对CPU、内存层次结构和I/O设备进行了足够详细的建模，以启动和运行商业操作系统。在我们的模拟环境中，我们使用代表性手动和编译器生成的并行应用程序以及多道编程工作负载来测量性能。我们的结果表明，当应用程序表现出细粒度的共享时，当包括共享主缓存的全部成本时，共享主缓存和共享从缓存体系结构的性能相似。

引用次数: 115

Polling Watchdog: Combining Polling and Interrupts for Efficient Message Handling 轮询看门狗:结合轮询和中断的有效消息处理

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.232992

O. Maquelin, G. Gao, H. Hum, K. B. Theobald, Xin-Min Tian

Parallel systems supporting multithreading, or message passing in general, have typically used either polling or interrupts to handle incoming messages. Neither approach is ideal; either may lead to excessive overheads or message-handling latencies, depending on the application. This paper investigates a combined approach---Polling Watchdog, where both are used depending on the circumstances. The Polling Watchdog is a simple hardware extension that limits the generation of interrupts to the cases where explicit polling fails to handle the message quickly. As an added benefit, this mechanism also has the potential to simplify the interaction between interrupts and the network accesses performed by the program.We present the resulting performance for the EARTH-MANNA-S system, an implementation of the EARTH (Efficient Architecture for Running THreads) execution model on the MANNA multiprocessor. In contrast to the original EARTH-MANNA system, this system does not use a dedicated communication processor. Rather, synchronization and communication tasks are performed on the same processor as the regular computations. Therefore, an efficient message-handling mechanism is essential to good performance. Simulation results and performance measurements show that the Polling Watchdog indeed performs better than either polling or interrupts alone. In fact, this mechanism allows the EARTH-MANNA-S system to achieve the same level of performance as the original EARTH-MANNA multithreaded system.

支持多线程或消息传递的并行系统通常使用轮询或中断来处理传入的消息。这两种方法都不理想;这两种情况都可能导致过多的开销或消息处理延迟，具体取决于应用程序。本文研究了一种组合方法——Polling Watchdog，根据具体情况使用这两种方法。Polling Watchdog是一个简单的硬件扩展，它将中断的生成限制在显式轮询无法快速处理消息的情况下。作为一个额外的好处，这种机制还有可能简化中断与程序执行的网络访问之间的交互。我们展示了EARTH-MANNA- s系统的最终性能，该系统是在MANNA多处理器上实现EARTH (Efficient Architecture for Running THreads)执行模型的系统。与原来的EARTH-MANNA系统相比，该系统不使用专用通信处理器。相反，同步和通信任务在与常规计算相同的处理器上执行。因此，高效的消息处理机制对于良好的性能至关重要。仿真结果和性能测量表明，轮询看门狗确实比单独轮询或中断性能更好。事实上，这种机制允许EARTH-MANNA- s系统达到与原始EARTH-MANNA多线程系统相同的性能水平。

{"title":"Polling Watchdog: Combining Polling and Interrupts for Efficient Message Handling","authors":"O. Maquelin, G. Gao, H. Hum, K. B. Theobald, Xin-Min Tian","doi":"10.1145/232973.232992","DOIUrl":"https://doi.org/10.1145/232973.232992","url":null,"abstract":"Parallel systems supporting multithreading, or message passing in general, have typically used either polling or interrupts to handle incoming messages. Neither approach is ideal; either may lead to excessive overheads or message-handling latencies, depending on the application. This paper investigates a combined approach---Polling Watchdog, where both are used depending on the circumstances. The Polling Watchdog is a simple hardware extension that limits the generation of interrupts to the cases where explicit polling fails to handle the message quickly. As an added benefit, this mechanism also has the potential to simplify the interaction between interrupts and the network accesses performed by the program.We present the resulting performance for the EARTH-MANNA-S system, an implementation of the EARTH (Efficient Architecture for Running THreads) execution model on the MANNA multiprocessor. In contrast to the original EARTH-MANNA system, this system does not use a dedicated communication processor. Rather, synchronization and communication tasks are performed on the same processor as the regular computations. Therefore, an efficient message-handling mechanism is essential to good performance. Simulation results and performance measurements show that the Polling Watchdog indeed performs better than either polling or interrupts alone. In fact, this mechanism allows the EARTH-MANNA-S system to achieve the same level of performance as the original EARTH-MANNA multithreaded system.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129279610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 89

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

23rd Annual International Symposium on Computer Architecture (ISCA'96)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀