首页 > 最新文献

Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture最新文献

英文 中文
Fine-grain multi-thread processor architecture for massively parallel processing 用于大规模并行处理的细粒度多线程处理器架构
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386532
T. Kawano, S. Kusakabe, R. Taniguchi, M. Amamiya
Latency, caused by remote memory access and remote procedure call, is one of the most serious problems in massively parallel computers. In order to eliminate the processors' idle time caused by these latencies, processors must perform fast context switching among fine-grain concurrent processes. In this paper, we propose a processor architecture, called Datarol-II, that promotes efficient fine-grain multi-thread execution by performing fast context switching among fine-grain concurrent processes. In the Datarol-II processor, an implicit register load/store mechanism is embedded in the execution pipeline in order to reduce memory access overhead caused by context switching. In order to reduce local memory access latency, a two-level hierarchical memory system and a load control mechanism are also introduced. We describe the Datarol-II processor architecture, and show its evaluation results.<>
由远程内存访问和远程过程调用引起的延迟是大规模并行计算机中最严重的问题之一。为了消除由这些延迟引起的处理器空闲时间,处理器必须在细粒度并发进程之间执行快速上下文切换。在本文中,我们提出了一种称为Datarol-II的处理器架构,它通过在细粒度并发进程之间执行快速上下文切换来促进高效的细粒度多线程执行。在Datarol-II处理器中,隐式的寄存器加载/存储机制嵌入到执行管道中,以减少上下文切换引起的内存访问开销。为了减少本地存储器访问延迟,还引入了两级分层存储器系统和负载控制机制。描述了Datarol-II处理器的体系结构,并给出了其评估结果。
{"title":"Fine-grain multi-thread processor architecture for massively parallel processing","authors":"T. Kawano, S. Kusakabe, R. Taniguchi, M. Amamiya","doi":"10.1109/HPCA.1995.386532","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386532","url":null,"abstract":"Latency, caused by remote memory access and remote procedure call, is one of the most serious problems in massively parallel computers. In order to eliminate the processors' idle time caused by these latencies, processors must perform fast context switching among fine-grain concurrent processes. In this paper, we propose a processor architecture, called Datarol-II, that promotes efficient fine-grain multi-thread execution by performing fast context switching among fine-grain concurrent processes. In the Datarol-II processor, an implicit register load/store mechanism is embedded in the execution pipeline in order to reduce memory access overhead caused by context switching. In order to reduce local memory access latency, a two-level hierarchical memory system and a load control mechanism are also introduced. We describe the Datarol-II processor architecture, and show its evaluation results.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115206539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Modeling virtual channel flow control in hypercubes 超多维数据集中的虚拟通道流控制建模
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386545
Younes M. Boura, C. Das
An analytical model for virtual channel flow control in n-dimensional hypercubes using the e-cube routing algorithm is developed. The model is based on determining the values of the different components that make up the average message latency. These components include the message transfer time, the blocking delay at each dimension, the multiplexing delay at each dimension, and the waiting delay at the source node. The first two components are determined using a probabilistic analysis. The average degree of multiplexing is determined using a Markov model, and the waiting delay at the source node is determined using an M/M/m queueing system. The model is fairly accurate in predicting the average message latency for different message sizes and a varying number of virtual channels per physical channel. It is demonstrated that wormhole switching along with virtual channel flow control make the average message latency insensitive to the network size when the network is relatively lightly loaded (message arrival rate is equal to 40% of channel capacity), and that the average message latency increases linearly with the average message size. The simplicity and accuracy of the analytical model make it an attractive and effective tool for predicting the behavior of n-dimensional hypercubes.<>
建立了基于e-cube路由算法的n维超立方体虚拟通道流量控制的解析模型。该模型基于确定构成平均消息延迟的不同组件的值。这些组件包括消息传输时间、每个维度上的阻塞延迟、每个维度上的多路复用延迟以及源节点上的等待延迟。前两个组成部分是使用概率分析确定的。采用马尔可夫模型确定平均复用度,采用M/M/ M排队系统确定源节点的等待时延。该模型在预测不同消息大小和每个物理通道的虚拟通道数量变化时的平均消息延迟方面相当准确。研究表明,当网络负载相对较轻(消息到达率等于通道容量的40%)时,虫洞交换和虚拟通道流量控制使平均消息延迟对网络大小不敏感,并且平均消息延迟随着平均消息大小线性增加。解析模型的简单性和准确性使其成为预测n维超立方体行为的有效工具
{"title":"Modeling virtual channel flow control in hypercubes","authors":"Younes M. Boura, C. Das","doi":"10.1109/HPCA.1995.386545","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386545","url":null,"abstract":"An analytical model for virtual channel flow control in n-dimensional hypercubes using the e-cube routing algorithm is developed. The model is based on determining the values of the different components that make up the average message latency. These components include the message transfer time, the blocking delay at each dimension, the multiplexing delay at each dimension, and the waiting delay at the source node. The first two components are determined using a probabilistic analysis. The average degree of multiplexing is determined using a Markov model, and the waiting delay at the source node is determined using an M/M/m queueing system. The model is fairly accurate in predicting the average message latency for different message sizes and a varying number of virtual channels per physical channel. It is demonstrated that wormhole switching along with virtual channel flow control make the average message latency insensitive to the network size when the network is relatively lightly loaded (message arrival rate is equal to 40% of channel capacity), and that the average message latency increases linearly with the average message size. The simplicity and accuracy of the analytical model make it an attractive and effective tool for predicting the behavior of n-dimensional hypercubes.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128069578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
An argument for simple COMA 简单昏迷的一个论据
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386535
Ashley Saulsbury, T. Wilkinson, J. Carter, A. Landin
We present design details and some initial performance results of a novel scalable shared memory multiprocessor architecture. This architecture features the automatic data migration and replication capabilities of cache-only memory architecture (COMA) machines, without the accompanying hardware complexity. A software layer manages cache space allocation at a page-granularity-similarly to distributed virtual shared memory (DVSM) systems, leaving simpler hardware to maintain shared memory coherence at a cache line granularity. By reducing the hardware complexity, the machine cost and development time are reduced. We call the resulting hybrid hardware and software multiprocessor architecture Simple COMA. Preliminary results indicate that the performance of Simple COMA is comparable to that of more complex contemporary all hardware designs.<>
本文介绍了一种新型可扩展共享内存多处理器架构的设计细节和一些初步性能结果。该体系结构具有纯缓存内存体系结构(COMA)机器的自动数据迁移和复制功能,而没有相应的硬件复杂性。软件层按照页面粒度管理缓存空间分配——类似于分布式虚拟共享内存(DVSM)系统,这样就留下了更简单的硬件来按照缓存线粒度维护共享内存的一致性。通过降低硬件复杂性,减少了机器成本和开发时间。我们将由此产生的混合硬件和软件多处理器架构称为简单的COMA。初步结果表明,简单彗差的性能可与更复杂的当代所有硬件设计相媲美。
{"title":"An argument for simple COMA","authors":"Ashley Saulsbury, T. Wilkinson, J. Carter, A. Landin","doi":"10.1109/HPCA.1995.386535","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386535","url":null,"abstract":"We present design details and some initial performance results of a novel scalable shared memory multiprocessor architecture. This architecture features the automatic data migration and replication capabilities of cache-only memory architecture (COMA) machines, without the accompanying hardware complexity. A software layer manages cache space allocation at a page-granularity-similarly to distributed virtual shared memory (DVSM) systems, leaving simpler hardware to maintain shared memory coherence at a cache line granularity. By reducing the hardware complexity, the machine cost and development time are reduced. We call the resulting hybrid hardware and software multiprocessor architecture Simple COMA. Preliminary results indicate that the performance of Simple COMA is comparable to that of more complex contemporary all hardware designs.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129712192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 109
Architectural support for inter-stream communication in a MSIMD system MSIMD系统中流间通信的体系结构支持
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386528
V. Garg, D. Schimmel
This paper considers hardware support for the exploitation of control parallelism on data parallel architectures. It is well known that data parallel algorithms may also possess control parallel structure. However the splitting of control leads to data dependency and synchronization issues that were implicitly handled in conventional SIMD architectures. These include synchronization of access to scalar and parallel variables, and synchronization for parallel communication operations. We propose a sharing mechanism for scalar variables and identify a strategy which allows synchronization of scalar variables between multiple streams. The techniques considered are based on a bit-interleaved register file structure which allows fast copy between register sets. Hardware cost estimates and timing analyses are provided, and comparison with an alternate scheme is presented. The register file structure has been designed and simulated for the HP 0.8 /spl mu/m CMOS process, and circuit simulation indicates that access times are less than six nanoseconds. In addition, the impact of this structure on system performance is also studied.<>
本文考虑了在数据并行体系结构中利用控制并行性的硬件支持。众所周知,数据并行算法也可能具有控制并行结构。然而,控制的分离会导致数据依赖和同步问题,而这些问题在传统SIMD体系结构中是隐式处理的。这包括访问标量和并行变量的同步,以及并行通信操作的同步。我们提出了一种标量变量的共享机制,并确定了一种允许在多个流之间同步标量变量的策略。所考虑的技术基于位交错寄存器文件结构,该结构允许在寄存器集之间快速复制。给出了硬件成本估算和时序分析,并与备选方案进行了比较。设计并仿真了HP 0.8 /spl mu/m CMOS工艺的寄存器文件结构,电路仿真表明,存取时间小于6纳秒。此外,还研究了这种结构对系统性能的影响。
{"title":"Architectural support for inter-stream communication in a MSIMD system","authors":"V. Garg, D. Schimmel","doi":"10.1109/HPCA.1995.386528","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386528","url":null,"abstract":"This paper considers hardware support for the exploitation of control parallelism on data parallel architectures. It is well known that data parallel algorithms may also possess control parallel structure. However the splitting of control leads to data dependency and synchronization issues that were implicitly handled in conventional SIMD architectures. These include synchronization of access to scalar and parallel variables, and synchronization for parallel communication operations. We propose a sharing mechanism for scalar variables and identify a strategy which allows synchronization of scalar variables between multiple streams. The techniques considered are based on a bit-interleaved register file structure which allows fast copy between register sets. Hardware cost estimates and timing analyses are provided, and comparison with an alternate scheme is presented. The register file structure has been designed and simulated for the HP 0.8 /spl mu/m CMOS process, and circuit simulation indicates that access times are less than six nanoseconds. In addition, the impact of this structure on system performance is also studied.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126898641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Toward high communication performance through compiled communications on a circuit switched interconnection network 通过电路交换互联网络上的编译通信实现高通信性能
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386556
F. Cappello, C. Germain
This paper discusses a new principle of interconnection network for massively parallel architectures in the field of numerical computation. The principle is motivated by an analysis of the application features and the need to design new kind of communication networks combining very high bandwidth, very low latency, performance independence to communication pattern or network load and a performance improvement proportional to the hardware performance improvement. Our approach is to associate compiled communications and a circuit switched interconnection network. This paper presents the motivations for this principle, the hardware and software issues and the design of a first prototype. The expected performance are a sustained aggregate bandwidth of more than 500 GBytes/s and an overall latency less than 270 ns, for a large implementation (4K inputs) with the current available technology.<>
本文讨论了数值计算领域大规模并行体系结构互连网络的一种新原理。该原则的动机是对应用程序特性的分析,以及设计新型通信网络的需要,这些网络结合了非常高的带宽、非常低的延迟、与通信模式或网络负载无关的性能以及与硬件性能改进成比例的性能改进。我们的方法是将编译通信和电路交换互连网络联系起来。本文介绍了这一原理的动机、硬件和软件问题以及第一个原型的设计。对于当前可用技术的大型实现(4K输入),预期性能是持续聚合带宽超过500 gb /s,总延迟小于270 ns。
{"title":"Toward high communication performance through compiled communications on a circuit switched interconnection network","authors":"F. Cappello, C. Germain","doi":"10.1109/HPCA.1995.386556","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386556","url":null,"abstract":"This paper discusses a new principle of interconnection network for massively parallel architectures in the field of numerical computation. The principle is motivated by an analysis of the application features and the need to design new kind of communication networks combining very high bandwidth, very low latency, performance independence to communication pattern or network load and a performance improvement proportional to the hardware performance improvement. Our approach is to associate compiled communications and a circuit switched interconnection network. This paper presents the motivations for this principle, the hardware and software issues and the design of a first prototype. The expected performance are a sustained aggregate bandwidth of more than 500 GBytes/s and an overall latency less than 270 ns, for a large implementation (4K inputs) with the current available technology.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133077849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Software cache coherence for large scale multiprocessors 大规模多处理器软件缓存一致性
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386534
L. Kontothanassis, M. Scott
Shared memory is an appealing abstraction for parallel programming. It must be implemented with caches in order to perform well, however and caches require a coherence mechanism to ensure that processors reference current data. Hardware coherence mechanisms for large-scale machines are complex and costly, but existing software mechanisms for message-passing machines have not provided a performance-competitive solution. We claim that an intermediate hardware option-memory-mapped network interfaces that support a global physical address space-can provide most of the performance benefits of hardware cache coherence. We present a software coherence protocol that runs on this class of machines and greatly narrows the performance gap between hardware and software coherence. We compare the performance of the protocol to that of existing software and hardware alternatives and evaluate the tradeoffs among various cache-write policies. We also observe that simple program changes can greatly improve performance. For the programs in our test suite and with the changes in place, software coherence is often faster and never more than 55% slower than hardware coherence.<>
共享内存对于并行编程来说是一个很有吸引力的抽象。它必须与缓存一起实现,以便性能良好,然而,缓存需要一个一致性机制来确保处理器引用当前数据。用于大型机器的硬件一致性机制是复杂和昂贵的,但是用于消息传递机器的现有软件机制并没有提供具有性能竞争力的解决方案。我们声称,一种中间硬件选项——支持全局物理地址空间的内存映射网络接口——可以提供硬件缓存一致性的大部分性能优势。我们提出了一个软件一致性协议,在这类机器上运行,大大缩小了硬件和软件一致性之间的性能差距。我们将该协议的性能与现有的软件和硬件替代方案进行比较,并评估各种缓存写策略之间的权衡。我们还观察到,简单的程序更改可以极大地提高性能。对于我们测试套件中的程序和适当的更改,软件一致性通常比硬件一致性更快,并且不会超过55%。
{"title":"Software cache coherence for large scale multiprocessors","authors":"L. Kontothanassis, M. Scott","doi":"10.1109/HPCA.1995.386534","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386534","url":null,"abstract":"Shared memory is an appealing abstraction for parallel programming. It must be implemented with caches in order to perform well, however and caches require a coherence mechanism to ensure that processors reference current data. Hardware coherence mechanisms for large-scale machines are complex and costly, but existing software mechanisms for message-passing machines have not provided a performance-competitive solution. We claim that an intermediate hardware option-memory-mapped network interfaces that support a global physical address space-can provide most of the performance benefits of hardware cache coherence. We present a software coherence protocol that runs on this class of machines and greatly narrows the performance gap between hardware and software coherence. We compare the performance of the protocol to that of existing software and hardware alternatives and evaluate the tradeoffs among various cache-write policies. We also observe that simple program changes can greatly improve performance. For the programs in our test suite and with the changes in place, software coherence is often faster and never more than 55% slower than hardware coherence.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"41 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120839971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Program balance and its impact on high performance RISC architectures 程序平衡及其对高性能RISC架构的影响
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386526
L. John, V. Reddy, P. T. Hulina, L. D. Coraor
Information on the behavior of programs is essential for deciding the number and nature of functional units in high performance architectures. In this paper, we present studies on the balance of access and computation tasks on a typical RISC architecture, the MIPS. The MIPS programs are analyzed to find the demands they place on the memory system and the floating point or integer computation units. A balance metric that indicates the match of accessing power to computation power is calculated. It is observed that many of the SPEC floating point programs and kernels from supercomputing applications typically considered as computation intensive programs, place extensive demands on the memory system in terms of memory bandwidth. Access related instructions are seen to dominate most instruction streams. We discuss how these instruction stream characteristics can limit the instruction issue in superscalar processors. The properties of the dynamic instruction mix are used to alert computer architects to the importance of memory bandwidth. Single instruction stream parallelism will not be much greater than two if memory bandwidth is only one. A decoupled access/execute architecture with multiple load/store units and queues which alleviate the balance problem is presented.<>
关于程序行为的信息对于决定高性能体系结构中功能单元的数量和性质是必不可少的。在本文中,我们对典型的RISC架构MIPS的访问和计算任务的平衡进行了研究。对MIPS程序进行分析,找出它们对存储系统和浮点或整数计算单元的需求。计算访问能力与计算能力匹配的平衡指标。可以观察到,来自超级计算应用程序的许多SPEC浮点程序和内核通常被认为是计算密集型程序,在内存带宽方面对内存系统提出了广泛的要求。与访问相关的指令在大多数指令流中占主导地位。我们讨论了这些指令流特征如何限制超标量处理器中的指令问题。动态指令混合的特性用来提醒计算机架构师内存带宽的重要性。如果内存带宽只有一个,单个指令流并行性不会比两个大很多。提出了一种具有多个负载/存储单元和队列的解耦访问/执行架构,以缓解平衡问题。
{"title":"Program balance and its impact on high performance RISC architectures","authors":"L. John, V. Reddy, P. T. Hulina, L. D. Coraor","doi":"10.1109/HPCA.1995.386526","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386526","url":null,"abstract":"Information on the behavior of programs is essential for deciding the number and nature of functional units in high performance architectures. In this paper, we present studies on the balance of access and computation tasks on a typical RISC architecture, the MIPS. The MIPS programs are analyzed to find the demands they place on the memory system and the floating point or integer computation units. A balance metric that indicates the match of accessing power to computation power is calculated. It is observed that many of the SPEC floating point programs and kernels from supercomputing applications typically considered as computation intensive programs, place extensive demands on the memory system in terms of memory bandwidth. Access related instructions are seen to dominate most instruction streams. We discuss how these instruction stream characteristics can limit the instruction issue in superscalar processors. The properties of the dynamic instruction mix are used to alert computer architects to the importance of memory bandwidth. Single instruction stream parallelism will not be much greater than two if memory bandwidth is only one. A decoupled access/execute architecture with multiple load/store units and queues which alleviate the balance problem is presented.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129424144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Thread prioritization: a thread scheduling mechanism for multiple-context parallel processors 线程优先级:多上下文并行处理器的线程调度机制
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386541
S. Fiske, W. Dally
Multiple-context processors provide register resources that allow rapid context switching between several threads as a means of tolerating long communication and synchronization latencies. When scheduling threads on such a processor, we must first decide which threads should have their state loaded into the multiple contexts, and second, which loaded thread is to execute instructions at any given time. In this paper we show that both decisions are important, and that incorrect choices can lead to serious performance degradation. We propose thread prioritization as a means of guiding both levels of scheduling. Each thread has a priority that can change dynamically, and that the scheduler uses to allocate as many computation resources as possible to critical threads. We briefly describe its implementation, and we show simulation performance results for a number of simple benchmarks in which synchronization performance is critical.<>
多上下文处理器提供寄存器资源,允许在多个线程之间快速切换上下文,作为容忍长通信和同步延迟的一种手段。当在这样的处理器上调度线程时,我们必须首先决定哪些线程应该将其状态加载到多个上下文中,其次,哪个加载的线程将在任何给定时间执行指令。在本文中,我们表明这两个决策都很重要,不正确的选择可能导致严重的性能下降。我们提出线程优先级作为指导这两个级别调度的一种手段。每个线程都有一个可以动态更改的优先级,调度器使用该优先级将尽可能多的计算资源分配给关键线程。我们简要描述了它的实现,并展示了一些简单基准测试的模拟性能结果,其中同步性能至关重要。
{"title":"Thread prioritization: a thread scheduling mechanism for multiple-context parallel processors","authors":"S. Fiske, W. Dally","doi":"10.1109/HPCA.1995.386541","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386541","url":null,"abstract":"Multiple-context processors provide register resources that allow rapid context switching between several threads as a means of tolerating long communication and synchronization latencies. When scheduling threads on such a processor, we must first decide which threads should have their state loaded into the multiple contexts, and second, which loaded thread is to execute instructions at any given time. In this paper we show that both decisions are important, and that incorrect choices can lead to serious performance degradation. We propose thread prioritization as a means of guiding both levels of scheduling. Each thread has a priority that can change dynamically, and that the scheduler uses to allocate as many computation resources as possible to critical threads. We briefly describe its implementation, and we show simulation performance results for a number of simple benchmarks in which synchronization performance is critical.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116768056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Access ordering and memory-conscious cache utilization 访问顺序和内存敏感缓存利用率
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386537
S. Mckee, W. Wulf
As processor speeds increase relative to memory speeds, memory bandwidth is rapidly becoming the limiting performance, factor for many applications. Several approaches to bridging this performance gap have been suggested. This paper examines one approach, access ordering, and pushes its limits to determine bounds on memory performance. We present several access-ordering schemes, and compare their performance, developing analytic models and partially validating these with benchmark timings on the Intel i860XR.<>
随着处理器速度相对于内存速度的提高,内存带宽正迅速成为许多应用程序的性能限制因素。已经提出了几种弥合这一性能差距的方法。本文研究了一种方法,访问排序,并推动其极限,以确定内存性能的界限。我们提出了几种访问排序方案,并比较了它们的性能,开发了分析模型,并在Intel i860XR上对这些方案进行了部分验证。
{"title":"Access ordering and memory-conscious cache utilization","authors":"S. Mckee, W. Wulf","doi":"10.1109/HPCA.1995.386537","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386537","url":null,"abstract":"As processor speeds increase relative to memory speeds, memory bandwidth is rapidly becoming the limiting performance, factor for many applications. Several approaches to bridging this performance gap have been suggested. This paper examines one approach, access ordering, and pushes its limits to determine bounds on memory performance. We present several access-ordering schemes, and compare their performance, developing analytic models and partially validating these with benchmark timings on the Intel i860XR.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114696525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 73
Optimizing instruction cache performance for operating system intensive workloads 针对操作系统密集型工作负载优化指令缓存性能
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386527
J. Torrellas, Chun Xia, Russell L. Daigle
High instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to use an optimizing compiler to minimize cache interference via an improved layout of the code. This technique, however, has been applied to application code only, even though there is evidence that the operating system often uses the cache heavily and with less uniform patterns than applications. Therefore, it is unknown how well existing optimizations perform for systems code and whether better optimizations can be found. We address this problem in this paper. This paper characterizes in detail the locality patterns of the operating system code and shows that there is substantial locality. Unfortunately, caches are not able to extract much of it: rarely-executed special-case code disrupts spatial locality, loops with few iterations that call routines make loop locality hard to exploit, and plenty of loop-less code hampers temporal locality. As a result, interference within popular execution paths dominates instruction cache misses. Based on our observations, we propose an algorithm to expose these localities and reduce interference. For a range of cache sizes, associativities, lines sizes, and other organizations we show that we reduce total instruction miss rates by 31-86% (up to 2.9 absolute points). Using a simple model this corresponds to execution time reductions in the order of 12-26%. In addition, our optimized operating system combines well with optimized or unoptimized applications.<>
高指令缓存命中率是高性能的关键。提高缓存命中率的一种已知技术是使用优化编译器,通过改进代码布局来最小化缓存干扰。然而,这种技术只应用于应用程序代码,尽管有证据表明操作系统经常大量使用缓存,并且使用的模式比应用程序更不统一。因此,目前尚不清楚现有的优化对系统代码的效果如何,以及是否可以找到更好的优化。我们在本文中解决了这个问题。本文详细地描述了操作系统代码的局部性模式,并表明存在大量的局部性。不幸的是,缓存无法提取其中的大部分:很少执行的特殊情况代码破坏了空间局部性,调用例程的迭代很少的循环使循环局部性难以利用,而大量无循环的代码妨碍了时间局部性。结果,流行的执行路径中的干扰主宰了指令缓存丢失。基于我们的观察,我们提出了一种算法来暴露这些位置并减少干扰。对于缓存大小、关联、行大小和其他组织的范围,我们表明我们将总指令失误率降低了31-86%(高达2.9个绝对点数)。使用一个简单的模型,这相当于减少了12-26%的执行时间。此外,我们优化的操作系统与优化或未优化的应用程序结合得很好。
{"title":"Optimizing instruction cache performance for operating system intensive workloads","authors":"J. Torrellas, Chun Xia, Russell L. Daigle","doi":"10.1109/HPCA.1995.386527","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386527","url":null,"abstract":"High instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to use an optimizing compiler to minimize cache interference via an improved layout of the code. This technique, however, has been applied to application code only, even though there is evidence that the operating system often uses the cache heavily and with less uniform patterns than applications. Therefore, it is unknown how well existing optimizations perform for systems code and whether better optimizations can be found. We address this problem in this paper. This paper characterizes in detail the locality patterns of the operating system code and shows that there is substantial locality. Unfortunately, caches are not able to extract much of it: rarely-executed special-case code disrupts spatial locality, loops with few iterations that call routines make loop locality hard to exploit, and plenty of loop-less code hampers temporal locality. As a result, interference within popular execution paths dominates instruction cache misses. Based on our observations, we propose an algorithm to expose these localities and reduce interference. For a range of cache sizes, associativities, lines sizes, and other organizations we show that we reduce total instruction miss rates by 31-86% (up to 2.9 absolute points). Using a simple model this corresponds to execution time reductions in the order of 12-26%. In addition, our optimized operating system combines well with optimized or unoptimized applications.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129612994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 72
期刊
Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1