首页 > 最新文献

Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture最新文献

英文 中文
Memory access reordering in vector processors 向量处理器中的内存访问重排序
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386525
D. Lee
Interference among multiple vector streams that access memory concurrently is the major source of performance degradation in main memory of pipelined vector processors. While totally eliminating interference appears to be impossible, little is known on how to design a memory system that can reduce it. In this paper, we introduce a concept called memory access reordering for reducing interference. This technique reduces interference by means of making the multiple vector streams access memory in an orderly fashion. Effective algorithms for memory access reordering are presented and their efficient hardware implementations are described.<>
并发访问内存的多个向量流之间的干扰是导致流水线向量处理器主存性能下降的主要原因。虽然完全消除干扰似乎是不可能的,但人们对如何设计一种可以减少干扰的存储系统知之甚少。在本文中,我们引入了一个称为内存访问重排序的概念来减少干扰。这种技术通过使多个矢量流以有序的方式访问存储器来减少干扰。提出了一种有效的内存访问重排序算法,并描述了其高效的硬件实现。
{"title":"Memory access reordering in vector processors","authors":"D. Lee","doi":"10.1109/HPCA.1995.386525","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386525","url":null,"abstract":"Interference among multiple vector streams that access memory concurrently is the major source of performance degradation in main memory of pipelined vector processors. While totally eliminating interference appears to be impossible, little is known on how to design a memory system that can reduce it. In this paper, we introduce a concept called memory access reordering for reducing interference. This technique reduces interference by means of making the multiple vector streams access memory in an orderly fashion. Effective algorithms for memory access reordering are presented and their efficient hardware implementations are described.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114838761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
An argument for simple COMA 简单昏迷的一个论据
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386535
Ashley Saulsbury, T. Wilkinson, J. Carter, A. Landin
We present design details and some initial performance results of a novel scalable shared memory multiprocessor architecture. This architecture features the automatic data migration and replication capabilities of cache-only memory architecture (COMA) machines, without the accompanying hardware complexity. A software layer manages cache space allocation at a page-granularity-similarly to distributed virtual shared memory (DVSM) systems, leaving simpler hardware to maintain shared memory coherence at a cache line granularity. By reducing the hardware complexity, the machine cost and development time are reduced. We call the resulting hybrid hardware and software multiprocessor architecture Simple COMA. Preliminary results indicate that the performance of Simple COMA is comparable to that of more complex contemporary all hardware designs.<>
本文介绍了一种新型可扩展共享内存多处理器架构的设计细节和一些初步性能结果。该体系结构具有纯缓存内存体系结构(COMA)机器的自动数据迁移和复制功能,而没有相应的硬件复杂性。软件层按照页面粒度管理缓存空间分配——类似于分布式虚拟共享内存(DVSM)系统,这样就留下了更简单的硬件来按照缓存线粒度维护共享内存的一致性。通过降低硬件复杂性,减少了机器成本和开发时间。我们将由此产生的混合硬件和软件多处理器架构称为简单的COMA。初步结果表明,简单彗差的性能可与更复杂的当代所有硬件设计相媲美。
{"title":"An argument for simple COMA","authors":"Ashley Saulsbury, T. Wilkinson, J. Carter, A. Landin","doi":"10.1109/HPCA.1995.386535","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386535","url":null,"abstract":"We present design details and some initial performance results of a novel scalable shared memory multiprocessor architecture. This architecture features the automatic data migration and replication capabilities of cache-only memory architecture (COMA) machines, without the accompanying hardware complexity. A software layer manages cache space allocation at a page-granularity-similarly to distributed virtual shared memory (DVSM) systems, leaving simpler hardware to maintain shared memory coherence at a cache line granularity. By reducing the hardware complexity, the machine cost and development time are reduced. We call the resulting hybrid hardware and software multiprocessor architecture Simple COMA. Preliminary results indicate that the performance of Simple COMA is comparable to that of more complex contemporary all hardware designs.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129712192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 109
Modeling virtual channel flow control in hypercubes 超多维数据集中的虚拟通道流控制建模
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386545
Younes M. Boura, C. Das
An analytical model for virtual channel flow control in n-dimensional hypercubes using the e-cube routing algorithm is developed. The model is based on determining the values of the different components that make up the average message latency. These components include the message transfer time, the blocking delay at each dimension, the multiplexing delay at each dimension, and the waiting delay at the source node. The first two components are determined using a probabilistic analysis. The average degree of multiplexing is determined using a Markov model, and the waiting delay at the source node is determined using an M/M/m queueing system. The model is fairly accurate in predicting the average message latency for different message sizes and a varying number of virtual channels per physical channel. It is demonstrated that wormhole switching along with virtual channel flow control make the average message latency insensitive to the network size when the network is relatively lightly loaded (message arrival rate is equal to 40% of channel capacity), and that the average message latency increases linearly with the average message size. The simplicity and accuracy of the analytical model make it an attractive and effective tool for predicting the behavior of n-dimensional hypercubes.<>
建立了基于e-cube路由算法的n维超立方体虚拟通道流量控制的解析模型。该模型基于确定构成平均消息延迟的不同组件的值。这些组件包括消息传输时间、每个维度上的阻塞延迟、每个维度上的多路复用延迟以及源节点上的等待延迟。前两个组成部分是使用概率分析确定的。采用马尔可夫模型确定平均复用度,采用M/M/ M排队系统确定源节点的等待时延。该模型在预测不同消息大小和每个物理通道的虚拟通道数量变化时的平均消息延迟方面相当准确。研究表明,当网络负载相对较轻(消息到达率等于通道容量的40%)时,虫洞交换和虚拟通道流量控制使平均消息延迟对网络大小不敏感,并且平均消息延迟随着平均消息大小线性增加。解析模型的简单性和准确性使其成为预测n维超立方体行为的有效工具
{"title":"Modeling virtual channel flow control in hypercubes","authors":"Younes M. Boura, C. Das","doi":"10.1109/HPCA.1995.386545","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386545","url":null,"abstract":"An analytical model for virtual channel flow control in n-dimensional hypercubes using the e-cube routing algorithm is developed. The model is based on determining the values of the different components that make up the average message latency. These components include the message transfer time, the blocking delay at each dimension, the multiplexing delay at each dimension, and the waiting delay at the source node. The first two components are determined using a probabilistic analysis. The average degree of multiplexing is determined using a Markov model, and the waiting delay at the source node is determined using an M/M/m queueing system. The model is fairly accurate in predicting the average message latency for different message sizes and a varying number of virtual channels per physical channel. It is demonstrated that wormhole switching along with virtual channel flow control make the average message latency insensitive to the network size when the network is relatively lightly loaded (message arrival rate is equal to 40% of channel capacity), and that the average message latency increases linearly with the average message size. The simplicity and accuracy of the analytical model make it an attractive and effective tool for predicting the behavior of n-dimensional hypercubes.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128069578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Architectural support for inter-stream communication in a MSIMD system MSIMD系统中流间通信的体系结构支持
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386528
V. Garg, D. Schimmel
This paper considers hardware support for the exploitation of control parallelism on data parallel architectures. It is well known that data parallel algorithms may also possess control parallel structure. However the splitting of control leads to data dependency and synchronization issues that were implicitly handled in conventional SIMD architectures. These include synchronization of access to scalar and parallel variables, and synchronization for parallel communication operations. We propose a sharing mechanism for scalar variables and identify a strategy which allows synchronization of scalar variables between multiple streams. The techniques considered are based on a bit-interleaved register file structure which allows fast copy between register sets. Hardware cost estimates and timing analyses are provided, and comparison with an alternate scheme is presented. The register file structure has been designed and simulated for the HP 0.8 /spl mu/m CMOS process, and circuit simulation indicates that access times are less than six nanoseconds. In addition, the impact of this structure on system performance is also studied.<>
本文考虑了在数据并行体系结构中利用控制并行性的硬件支持。众所周知,数据并行算法也可能具有控制并行结构。然而,控制的分离会导致数据依赖和同步问题,而这些问题在传统SIMD体系结构中是隐式处理的。这包括访问标量和并行变量的同步,以及并行通信操作的同步。我们提出了一种标量变量的共享机制,并确定了一种允许在多个流之间同步标量变量的策略。所考虑的技术基于位交错寄存器文件结构,该结构允许在寄存器集之间快速复制。给出了硬件成本估算和时序分析,并与备选方案进行了比较。设计并仿真了HP 0.8 /spl mu/m CMOS工艺的寄存器文件结构,电路仿真表明,存取时间小于6纳秒。此外,还研究了这种结构对系统性能的影响。
{"title":"Architectural support for inter-stream communication in a MSIMD system","authors":"V. Garg, D. Schimmel","doi":"10.1109/HPCA.1995.386528","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386528","url":null,"abstract":"This paper considers hardware support for the exploitation of control parallelism on data parallel architectures. It is well known that data parallel algorithms may also possess control parallel structure. However the splitting of control leads to data dependency and synchronization issues that were implicitly handled in conventional SIMD architectures. These include synchronization of access to scalar and parallel variables, and synchronization for parallel communication operations. We propose a sharing mechanism for scalar variables and identify a strategy which allows synchronization of scalar variables between multiple streams. The techniques considered are based on a bit-interleaved register file structure which allows fast copy between register sets. Hardware cost estimates and timing analyses are provided, and comparison with an alternate scheme is presented. The register file structure has been designed and simulated for the HP 0.8 /spl mu/m CMOS process, and circuit simulation indicates that access times are less than six nanoseconds. In addition, the impact of this structure on system performance is also studied.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126898641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Toward high communication performance through compiled communications on a circuit switched interconnection network 通过电路交换互联网络上的编译通信实现高通信性能
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386556
F. Cappello, C. Germain
This paper discusses a new principle of interconnection network for massively parallel architectures in the field of numerical computation. The principle is motivated by an analysis of the application features and the need to design new kind of communication networks combining very high bandwidth, very low latency, performance independence to communication pattern or network load and a performance improvement proportional to the hardware performance improvement. Our approach is to associate compiled communications and a circuit switched interconnection network. This paper presents the motivations for this principle, the hardware and software issues and the design of a first prototype. The expected performance are a sustained aggregate bandwidth of more than 500 GBytes/s and an overall latency less than 270 ns, for a large implementation (4K inputs) with the current available technology.<>
本文讨论了数值计算领域大规模并行体系结构互连网络的一种新原理。该原则的动机是对应用程序特性的分析,以及设计新型通信网络的需要,这些网络结合了非常高的带宽、非常低的延迟、与通信模式或网络负载无关的性能以及与硬件性能改进成比例的性能改进。我们的方法是将编译通信和电路交换互连网络联系起来。本文介绍了这一原理的动机、硬件和软件问题以及第一个原型的设计。对于当前可用技术的大型实现(4K输入),预期性能是持续聚合带宽超过500 gb /s,总延迟小于270 ns。
{"title":"Toward high communication performance through compiled communications on a circuit switched interconnection network","authors":"F. Cappello, C. Germain","doi":"10.1109/HPCA.1995.386556","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386556","url":null,"abstract":"This paper discusses a new principle of interconnection network for massively parallel architectures in the field of numerical computation. The principle is motivated by an analysis of the application features and the need to design new kind of communication networks combining very high bandwidth, very low latency, performance independence to communication pattern or network load and a performance improvement proportional to the hardware performance improvement. Our approach is to associate compiled communications and a circuit switched interconnection network. This paper presents the motivations for this principle, the hardware and software issues and the design of a first prototype. The expected performance are a sustained aggregate bandwidth of more than 500 GBytes/s and an overall latency less than 270 ns, for a large implementation (4K inputs) with the current available technology.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133077849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Thread prioritization: a thread scheduling mechanism for multiple-context parallel processors 线程优先级:多上下文并行处理器的线程调度机制
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386541
S. Fiske, W. Dally
Multiple-context processors provide register resources that allow rapid context switching between several threads as a means of tolerating long communication and synchronization latencies. When scheduling threads on such a processor, we must first decide which threads should have their state loaded into the multiple contexts, and second, which loaded thread is to execute instructions at any given time. In this paper we show that both decisions are important, and that incorrect choices can lead to serious performance degradation. We propose thread prioritization as a means of guiding both levels of scheduling. Each thread has a priority that can change dynamically, and that the scheduler uses to allocate as many computation resources as possible to critical threads. We briefly describe its implementation, and we show simulation performance results for a number of simple benchmarks in which synchronization performance is critical.<>
多上下文处理器提供寄存器资源,允许在多个线程之间快速切换上下文,作为容忍长通信和同步延迟的一种手段。当在这样的处理器上调度线程时,我们必须首先决定哪些线程应该将其状态加载到多个上下文中,其次,哪个加载的线程将在任何给定时间执行指令。在本文中,我们表明这两个决策都很重要,不正确的选择可能导致严重的性能下降。我们提出线程优先级作为指导这两个级别调度的一种手段。每个线程都有一个可以动态更改的优先级,调度器使用该优先级将尽可能多的计算资源分配给关键线程。我们简要描述了它的实现,并展示了一些简单基准测试的模拟性能结果,其中同步性能至关重要。
{"title":"Thread prioritization: a thread scheduling mechanism for multiple-context parallel processors","authors":"S. Fiske, W. Dally","doi":"10.1109/HPCA.1995.386541","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386541","url":null,"abstract":"Multiple-context processors provide register resources that allow rapid context switching between several threads as a means of tolerating long communication and synchronization latencies. When scheduling threads on such a processor, we must first decide which threads should have their state loaded into the multiple contexts, and second, which loaded thread is to execute instructions at any given time. In this paper we show that both decisions are important, and that incorrect choices can lead to serious performance degradation. We propose thread prioritization as a means of guiding both levels of scheduling. Each thread has a priority that can change dynamically, and that the scheduler uses to allocate as many computation resources as possible to critical threads. We briefly describe its implementation, and we show simulation performance results for a number of simple benchmarks in which synchronization performance is critical.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116768056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Access ordering and memory-conscious cache utilization 访问顺序和内存敏感缓存利用率
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386537
S. Mckee, W. Wulf
As processor speeds increase relative to memory speeds, memory bandwidth is rapidly becoming the limiting performance, factor for many applications. Several approaches to bridging this performance gap have been suggested. This paper examines one approach, access ordering, and pushes its limits to determine bounds on memory performance. We present several access-ordering schemes, and compare their performance, developing analytic models and partially validating these with benchmark timings on the Intel i860XR.<>
随着处理器速度相对于内存速度的提高,内存带宽正迅速成为许多应用程序的性能限制因素。已经提出了几种弥合这一性能差距的方法。本文研究了一种方法,访问排序,并推动其极限,以确定内存性能的界限。我们提出了几种访问排序方案,并比较了它们的性能,开发了分析模型,并在Intel i860XR上对这些方案进行了部分验证。
{"title":"Access ordering and memory-conscious cache utilization","authors":"S. Mckee, W. Wulf","doi":"10.1109/HPCA.1995.386537","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386537","url":null,"abstract":"As processor speeds increase relative to memory speeds, memory bandwidth is rapidly becoming the limiting performance, factor for many applications. Several approaches to bridging this performance gap have been suggested. This paper examines one approach, access ordering, and pushes its limits to determine bounds on memory performance. We present several access-ordering schemes, and compare their performance, developing analytic models and partially validating these with benchmark timings on the Intel i860XR.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114696525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 73
Optimizing instruction cache performance for operating system intensive workloads 针对操作系统密集型工作负载优化指令缓存性能
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386527
J. Torrellas, Chun Xia, Russell L. Daigle
High instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to use an optimizing compiler to minimize cache interference via an improved layout of the code. This technique, however, has been applied to application code only, even though there is evidence that the operating system often uses the cache heavily and with less uniform patterns than applications. Therefore, it is unknown how well existing optimizations perform for systems code and whether better optimizations can be found. We address this problem in this paper. This paper characterizes in detail the locality patterns of the operating system code and shows that there is substantial locality. Unfortunately, caches are not able to extract much of it: rarely-executed special-case code disrupts spatial locality, loops with few iterations that call routines make loop locality hard to exploit, and plenty of loop-less code hampers temporal locality. As a result, interference within popular execution paths dominates instruction cache misses. Based on our observations, we propose an algorithm to expose these localities and reduce interference. For a range of cache sizes, associativities, lines sizes, and other organizations we show that we reduce total instruction miss rates by 31-86% (up to 2.9 absolute points). Using a simple model this corresponds to execution time reductions in the order of 12-26%. In addition, our optimized operating system combines well with optimized or unoptimized applications.<>
高指令缓存命中率是高性能的关键。提高缓存命中率的一种已知技术是使用优化编译器,通过改进代码布局来最小化缓存干扰。然而,这种技术只应用于应用程序代码,尽管有证据表明操作系统经常大量使用缓存,并且使用的模式比应用程序更不统一。因此,目前尚不清楚现有的优化对系统代码的效果如何,以及是否可以找到更好的优化。我们在本文中解决了这个问题。本文详细地描述了操作系统代码的局部性模式,并表明存在大量的局部性。不幸的是,缓存无法提取其中的大部分:很少执行的特殊情况代码破坏了空间局部性,调用例程的迭代很少的循环使循环局部性难以利用,而大量无循环的代码妨碍了时间局部性。结果,流行的执行路径中的干扰主宰了指令缓存丢失。基于我们的观察,我们提出了一种算法来暴露这些位置并减少干扰。对于缓存大小、关联、行大小和其他组织的范围,我们表明我们将总指令失误率降低了31-86%(高达2.9个绝对点数)。使用一个简单的模型,这相当于减少了12-26%的执行时间。此外,我们优化的操作系统与优化或未优化的应用程序结合得很好。
{"title":"Optimizing instruction cache performance for operating system intensive workloads","authors":"J. Torrellas, Chun Xia, Russell L. Daigle","doi":"10.1109/HPCA.1995.386527","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386527","url":null,"abstract":"High instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to use an optimizing compiler to minimize cache interference via an improved layout of the code. This technique, however, has been applied to application code only, even though there is evidence that the operating system often uses the cache heavily and with less uniform patterns than applications. Therefore, it is unknown how well existing optimizations perform for systems code and whether better optimizations can be found. We address this problem in this paper. This paper characterizes in detail the locality patterns of the operating system code and shows that there is substantial locality. Unfortunately, caches are not able to extract much of it: rarely-executed special-case code disrupts spatial locality, loops with few iterations that call routines make loop locality hard to exploit, and plenty of loop-less code hampers temporal locality. As a result, interference within popular execution paths dominates instruction cache misses. Based on our observations, we propose an algorithm to expose these localities and reduce interference. For a range of cache sizes, associativities, lines sizes, and other organizations we show that we reduce total instruction miss rates by 31-86% (up to 2.9 absolute points). Using a simple model this corresponds to execution time reductions in the order of 12-26%. In addition, our optimized operating system combines well with optimized or unoptimized applications.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129612994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 72
Implementation of atomic primitives on distributed shared memory multiprocessors 分布式共享内存多处理器上原子原语的实现
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386540
Maged M. Michael, M. Scott
In this paper we consider several hardware implementations of the general-purpose atomic primitives fetch and /spl Phi/, compare and swap, load linked, and store conditional on large-scale shared-memory multiprocessors. These primitives have proven popular on small-scale bets-based machines, but have yet to become widely available on large-scale, distributed shared memory machines. We propose several alternative hardware implementations of these primitives, and then analyze the performance of these implementations for various data sharing patterns. Our results indicate that good overall performance can be obtained by implementing compare and swap in the cache controllers, and by providing an additional instruction to load an exclusive copy of a cache line.<>
在本文中,我们考虑了在大型共享内存多处理器上通用原子原语的几种硬件实现:fetch和/spl Phi/、比较和交换、加载链接和条件存储。事实证明,这些原语在小型的基于bet的机器上很流行,但在大规模的分布式共享内存机器上还没有广泛使用。我们提出了这些原语的几种可选硬件实现,然后分析了这些实现在各种数据共享模式下的性能。我们的结果表明,通过在缓存控制器中实现比较和交换,并提供一个额外的指令来加载缓存行的独占副本,可以获得良好的总体性能
{"title":"Implementation of atomic primitives on distributed shared memory multiprocessors","authors":"Maged M. Michael, M. Scott","doi":"10.1109/HPCA.1995.386540","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386540","url":null,"abstract":"In this paper we consider several hardware implementations of the general-purpose atomic primitives fetch and /spl Phi/, compare and swap, load linked, and store conditional on large-scale shared-memory multiprocessors. These primitives have proven popular on small-scale bets-based machines, but have yet to become widely available on large-scale, distributed shared memory machines. We propose several alternative hardware implementations of these primitives, and then analyze the performance of these implementations for various data sharing patterns. Our results indicate that good overall performance can be obtained by implementing compare and swap in the cache controllers, and by providing an additional instruction to load an exclusive copy of a cache line.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129679700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
DASC cache
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386548
André Seznec
For many microprocessors, cache hit time determines the clock cycle. On the other hand, cache miss penalty(measured in instruction issue delays) becomes higher and higher. Conciliating low cache miss ratio with low cache hit time is an important issue. When caches are virtually indexed, the operating system (or some specific hardware) has to manage data consistency of caches and memory. Unfortunately, conciliating physical indexing of the cache and low cache hit time is very difficult. In this paper, we propose the Direct-mapped Access Set-associative Check cache (DASC) for addressing both difficulties. On a DASC cache, the cache array is direct-mapped, so the cache hit time is low. However the tag array is set-associative and the external miss ratio on a DASC cache is the same as the miss ratio on a set-associative cache. When the size of an associativity degree of the tag array is tied to the minimum page size, a virtually indexed but physically tagged DASC cache correctly handles all difficulties associated with cache consistency. Trace driven simulations show that, for cache sizes in the range of 16 to 64 Kbytes and for page sizes in the range 4 to 8 Kbytes, a DASC cache is a valuable trade-off allowing fast cache hit time and low cache miss ratio while cache consistency management is performed by hardware.<>
对于许多微处理器来说,缓存命中时间决定了时钟周期。另一方面,缓存丢失损失(以指令发布延迟衡量)变得越来越高。协调低缓存丢失率和低缓存命中时间是一个重要的问题。当缓存被虚拟索引时,操作系统(或某些特定的硬件)必须管理缓存和内存的数据一致性。不幸的是,协调缓存的物理索引和低缓存命中时间是非常困难的。在本文中,我们提出了直接映射访问集关联检查缓存(DASC)来解决这两个问题。在DASC缓存上,缓存数组是直接映射的,因此缓存命中时间较低。然而,标签数组是集关联的,DASC缓存上的外部缺失率与集关联缓存上的缺失率相同。当标记数组的关联度大小与最小页面大小相关联时,虚拟索引但物理标记的DASC缓存可以正确处理与缓存一致性相关的所有困难。跟踪驱动的模拟表明,对于16到64 kb的缓存大小和4到8 kb的页面大小,DASC缓存是一个有价值的权衡,允许快速缓存命中时间和低缓存丢失率,同时缓存一致性管理由硬件执行。
{"title":"DASC cache","authors":"André Seznec","doi":"10.1109/HPCA.1995.386548","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386548","url":null,"abstract":"For many microprocessors, cache hit time determines the clock cycle. On the other hand, cache miss penalty(measured in instruction issue delays) becomes higher and higher. Conciliating low cache miss ratio with low cache hit time is an important issue. When caches are virtually indexed, the operating system (or some specific hardware) has to manage data consistency of caches and memory. Unfortunately, conciliating physical indexing of the cache and low cache hit time is very difficult. In this paper, we propose the Direct-mapped Access Set-associative Check cache (DASC) for addressing both difficulties. On a DASC cache, the cache array is direct-mapped, so the cache hit time is low. However the tag array is set-associative and the external miss ratio on a DASC cache is the same as the miss ratio on a set-associative cache. When the size of an associativity degree of the tag array is tied to the minimum page size, a virtually indexed but physically tagged DASC cache correctly handles all difficulties associated with cache consistency. Trace driven simulations show that, for cache sizes in the range of 16 to 64 Kbytes and for page sizes in the range 4 to 8 Kbytes, a DASC cache is a valuable trade-off allowing fast cache hit time and low cache miss ratio while cache consistency management is performed by hardware.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125222499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
期刊
Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1