首页 > 最新文献

ASPLOS III最新文献

英文 中文
The fuzzy barrier: a mechanism for high speed synchronization of processors 模糊屏障:处理器高速同步的一种机制
Pub Date : 1989-04-01 DOI: 10.1145/70082.68187
Rajiv Gupta
Parallel programs are commonly written using barriers to synchronize parallel processes. Upon reaching a barrier, a processor must stall until all participating processors reach the barrier. A software implementation of the barrier mechanism using shared variables has two major drawbacks. Firstly, the execution of the barrier may be slow as it may not only require execution of several instructions and but also result in hot-spot accesses. Secondly, processors that are stalled waiting for other processors to reach the barrier are essentially idling and cannot do any useful work. In this paper, the notion of the fuzzy barrier is presented, that avoids the above drawbacks. The first problem is avoided by implementing the mechanism in hardware. The second problem is solved by extending the barrier concept to include a region of statements that can be executed by a processor while it awaits synchronization. The barrier regions are constructed by a compiler and consist of several instructions such that a processor is ready to synchronize upon reaching the first instruction in this region and must synchronize before exiting the region. When synchronization does occur, the processors could be executing at any point in their respective barrier regions. The larger the barrier region, the more likely it is that none of the processors will have to stall. Preliminary investigations show that barrier regions can be large and the use of program transformations can significantly increase their size. Examples of situations where such a mechanism can result in improved performance are presented. Results based on a software implementation of the fuzzy barrier on the Encore multiprocessor indicate that the synchronization overhead can be greatly reduced using the mechanism.
并行程序通常使用屏障来同步并行进程。到达屏障后,处理器必须停止运行,直到所有参与的处理器到达屏障。使用共享变量的屏障机制的软件实现有两个主要缺点。首先,屏障的执行可能很慢,因为它不仅需要执行几个指令,而且还会导致热点访问。其次,等待其他处理器到达屏障的处理器实际上是闲置的,不能做任何有用的工作。本文提出了模糊屏障的概念,避免了上述缺陷。通过在硬件中实现该机制,可以避免第一个问题。第二个问题是通过扩展屏障概念来解决的,它包含了处理器在等待同步时可以执行的语句区域。屏障区域由编译器构造,由几个指令组成,这样处理器在到达该区域的第一条指令时就准备好同步,并且必须在退出该区域之前同步。当同步发生时,处理器可以在各自屏障区域的任何点执行。屏障区域越大,就越有可能没有一个处理器必须停止运行。初步调查表明,屏障区域可能很大,使用程序转换可以显着增加它们的大小。本文给出了这种机制可以提高性能的情况示例。基于Encore多处理器上模糊屏障的软件实现结果表明,使用该机制可以大大降低同步开销。
{"title":"The fuzzy barrier: a mechanism for high speed synchronization of processors","authors":"Rajiv Gupta","doi":"10.1145/70082.68187","DOIUrl":"https://doi.org/10.1145/70082.68187","url":null,"abstract":"Parallel programs are commonly written using barriers to synchronize parallel processes. Upon reaching a barrier, a processor must stall until all participating processors reach the barrier. A software implementation of the barrier mechanism using shared variables has two major drawbacks. Firstly, the execution of the barrier may be slow as it may not only require execution of several instructions and but also result in hot-spot accesses. Secondly, processors that are stalled waiting for other processors to reach the barrier are essentially idling and cannot do any useful work. In this paper, the notion of the fuzzy barrier is presented, that avoids the above drawbacks. The first problem is avoided by implementing the mechanism in hardware. The second problem is solved by extending the barrier concept to include a region of statements that can be executed by a processor while it awaits synchronization. The barrier regions are constructed by a compiler and consist of several instructions such that a processor is ready to synchronize upon reaching the first instruction in this region and must synchronize before exiting the region. When synchronization does occur, the processors could be executing at any point in their respective barrier regions. The larger the barrier region, the more likely it is that none of the processors will have to stall. Preliminary investigations show that barrier regions can be large and the use of program transformations can significantly increase their size. Examples of situations where such a mechanism can result in improved performance are presented. Results based on a software implementation of the fuzzy barrier on the Encore multiprocessor indicate that the synchronization overhead can be greatly reduced using the mechanism.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132168115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 152
Limits on multiple instruction issue 对多指令问题的限制
Pub Date : 1989-04-01 DOI: 10.1145/70082.68209
Michael D. Smith, Mike Johnson, M. Horowitz
This paper investigates the limitations on designing a processor which can sustain an execution rate of greater than one instruction per cycle on highly-optimized, non-scientific applications. We have used trace-driven simulations to determine that these applications contain enough instruction independence to sustain an instruction rate of about two instructions per cycle. In a straightforward implementation, cost considerations argue strongly against decoding more than two instructions in one cycle. Given this constraint, the efficiency in instruction fetching rather than the complexity of the execution hardware limits the concurrency attainable at the instruction level.
本文研究了在高度优化的非科学应用程序中,设计一个能够维持每周期大于一条指令的执行速率的处理器的限制。我们使用跟踪驱动的模拟来确定这些应用程序包含足够的指令独立性,以维持每个周期约两条指令的指令速率。在一个简单的实现中,成本方面的考虑强烈反对在一个周期内解码两个以上的指令。考虑到这个约束,指令获取的效率而不是执行硬件的复杂性限制了指令级别上可实现的并发性。
{"title":"Limits on multiple instruction issue","authors":"Michael D. Smith, Mike Johnson, M. Horowitz","doi":"10.1145/70082.68209","DOIUrl":"https://doi.org/10.1145/70082.68209","url":null,"abstract":"This paper investigates the limitations on designing a processor which can sustain an execution rate of greater than one instruction per cycle on highly-optimized, non-scientific applications. We have used trace-driven simulations to determine that these applications contain enough instruction independence to sustain an instruction rate of about two instructions per cycle. In a straightforward implementation, cost considerations argue strongly against decoding more than two instructions in one cycle. Given this constraint, the efficiency in instruction fetching rather than the complexity of the execution hardware limits the concurrency attainable at the instruction level.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122622774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 188
Sheaved memory: architectural support for state saving and restoration in pages systems 分层内存:对页面系统中状态保存和恢复的体系结构支持
Pub Date : 1989-04-01 DOI: 10.1145/70082.68191
M. E. Staknis
The concept of read-one/write-many paged memory is introduced and given the name sheaved memory. It is shown that sheaved memory is useful for efficiently maintaining checkpoints in main memory and for providing state saving and state restoration for software that includes recovery blocks or similar control structures. The organization of sheaved memory is described in detail, and a design is presented for a prototype sheaved-memory module that can be built easily from inexpensive, off-the-shelf components. The module can be incorporated within many available computers without altering the computers' hardware design. The concept of sheaved memory is simple and appealing, and its potential for use in a number of software contexts is foreseen.
引入了读一页/写多页内存的概念,并将其命名为分层内存。结果表明,分层内存对于有效地维护主内存中的检查点以及为包含恢复块或类似控制结构的软件提供状态保存和状态恢复非常有用。详细描述了堆叠存储器的组织结构,并提出了一个原型堆叠存储器模块的设计,该模块可以很容易地从廉价的现成组件中构建出来。该模块可以集成在许多可用的计算机中,而无需改变计算机的硬件设计。分层内存的概念简单而吸引人,并且可以预见它在许多软件上下文中的应用潜力。
{"title":"Sheaved memory: architectural support for state saving and restoration in pages systems","authors":"M. E. Staknis","doi":"10.1145/70082.68191","DOIUrl":"https://doi.org/10.1145/70082.68191","url":null,"abstract":"The concept of read-one/write-many paged memory is introduced and given the name sheaved memory. It is shown that sheaved memory is useful for efficiently maintaining checkpoints in main memory and for providing state saving and state restoration for software that includes recovery blocks or similar control structures. The organization of sheaved memory is described in detail, and a design is presented for a prototype sheaved-memory module that can be built easily from inexpensive, off-the-shelf components. The module can be incorporated within many available computers without altering the computers' hardware design. The concept of sheaved memory is simple and appealing, and its potential for use in a number of software contexts is foreseen.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122527006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
A message driven OR-parallel machine 消息驱动的或并行机器
Pub Date : 1989-04-01 DOI: 10.1145/70082.68203
S. Delgado-Rannauro, T. Reynolds
A message driven architecture for the execution of OR-parallel logic languages is proposed. The computational model is based on well known compilation techniques for Logic Languages. We present first the multiple binding mechanism for the OR-parallel Prolog architecture and the corresponding OR-parallel abstract machine is described. A scheduling algorithm which does not rely upon the availability of global data structures to direct the search for work is discussed. The message driven processor, the processing node of the parallel machine, is designed to interact with a shared global address space and to efficiently process messages from other processing nodes. We discuss some of the results obtained from a high level functional simulator of the message driven machine.
提出了一种用于or并行逻辑语言执行的消息驱动体系结构。计算模型基于众所周知的逻辑语言编译技术。首先提出了or并行Prolog体系结构的多重绑定机制,并对相应的or并行抽象机进行了描述。讨论了一种不依赖全局数据结构可用性来指导工作搜索的调度算法。消息驱动处理器作为并行机的处理节点,被设计用于与共享的全局地址空间交互,并有效地处理来自其他处理节点的消息。我们讨论了从消息驱动机器的高级功能模拟器获得的一些结果。
{"title":"A message driven OR-parallel machine","authors":"S. Delgado-Rannauro, T. Reynolds","doi":"10.1145/70082.68203","DOIUrl":"https://doi.org/10.1145/70082.68203","url":null,"abstract":"A message driven architecture for the execution of OR-parallel logic languages is proposed. The computational model is based on well known compilation techniques for Logic Languages. We present first the multiple binding mechanism for the OR-parallel Prolog architecture and the corresponding OR-parallel abstract machine is described. A scheduling algorithm which does not rely upon the availability of global data structures to direct the search for work is discussed. The message driven processor, the processing node of the parallel machine, is designed to interact with a shared global address space and to efficiently process messages from other processing nodes. We discuss some of the results obtained from a high level functional simulator of the message driven machine.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115521318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Efficient debugging primitives for multiprocessors 多处理器的高效调试原语
Pub Date : 1989-04-01 DOI: 10.1145/70082.68190
Z. Aral, I. Gertner, G. Schaffer
Existing kernel-level debugging primitives are inappropriate for instrumenting complex sequential or parallel programs. These functions incur a heavy overhead in their use of system calls and process switches. Context switches are used to alternately invoke the debugger and the target programs. System calls are used to communicate data between the target and debugger.None of this is necessary in shared-memory multiprocessors. Multiple processors concurrently run both the debugger and the target. Shared-memory is used to implement efficient communication. The target's state is accessed by running both the target and the debugger in the same address space. Finally, instrumentation points, which have largely been implemented as traps to the system, are reimplemented as simple branches to routines of arbitrary complexity maintained by the debugger. Not only are primitives such as conditional breakpoints thus generalized, but their efficiency is improved by several orders of magnitude. In the process, much of the traditional system's kernel support for debugging is reimplemented at user-level.This paper describes the implementation of debugging primitives in Parasight, a parallel programming environment. Parasight has been used to implement conditional breakpoints, an important primitive for both high-level and parallel debugging. Preliminary measurements indicate that Parasight breakpoints are 1000 times faster than the breakpoints in parallel “cdb”, a conventional UNIX debugger. Light-weight conditional breakpoints open up new opportunities for debugging and profiling both parallel and sequential programs.
现有的内核级调试原语不适合检测复杂的顺序或并行程序。这些函数在使用系统调用和进程切换时会产生很大的开销。上下文切换用于交替调用调试器和目标程序。系统调用用于在目标和调试器之间传输数据。在共享内存多处理器中,这些都不是必需的。多个处理器同时运行调试器和目标。共享内存用于实现高效的通信。通过在同一地址空间中运行目标和调试器来访问目标的状态。最后,已经在很大程度上作为系统陷阱实现的插装点,被重新实现为调试器维护的任意复杂性例程的简单分支。这样不仅可以推广诸如条件断点之类的原语,而且它们的效率也得到了几个数量级的提高。在这个过程中,许多传统系统对调试的内核支持在用户级重新实现。本文描述了在并行编程环境Parasight中调试原语的实现。Parasight已用于实现条件断点,这是高级和并行调试的重要原语。初步测量表明,Parasight的断点比并行的“cdb”(一种传统的UNIX调试器)中的断点快1000倍。轻量级条件断点为并行和顺序程序的调试和分析提供了新的机会。
{"title":"Efficient debugging primitives for multiprocessors","authors":"Z. Aral, I. Gertner, G. Schaffer","doi":"10.1145/70082.68190","DOIUrl":"https://doi.org/10.1145/70082.68190","url":null,"abstract":"Existing kernel-level debugging primitives are inappropriate for instrumenting complex sequential or parallel programs. These functions incur a heavy overhead in their use of system calls and process switches. Context switches are used to alternately invoke the debugger and the target programs. System calls are used to communicate data between the target and debugger.\u0000None of this is necessary in shared-memory multiprocessors. Multiple processors concurrently run both the debugger and the target. Shared-memory is used to implement efficient communication. The target's state is accessed by running both the target and the debugger in the same address space. Finally, instrumentation points, which have largely been implemented as traps to the system, are reimplemented as simple branches to routines of arbitrary complexity maintained by the debugger. Not only are primitives such as conditional breakpoints thus generalized, but their efficiency is improved by several orders of magnitude. In the process, much of the traditional system's kernel support for debugging is reimplemented at user-level.\u0000This paper describes the implementation of debugging primitives in Parasight, a parallel programming environment. Parasight has been used to implement conditional breakpoints, an important primitive for both high-level and parallel debugging. Preliminary measurements indicate that Parasight breakpoints are 1000 times faster than the breakpoints in parallel “cdb”, a conventional UNIX debugger. Light-weight conditional breakpoints open up new opportunities for debugging and profiling both parallel and sequential programs.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125958364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Architectural support for synchronous task communication 同步任务通信的体系结构支持
Pub Date : 1989-04-01 DOI: 10.1145/70082.68186
F. Burkowski, G. Cormack, G. D. P. Dueck
This paper describes the motivation for a set of intertask communication primitives, the hardware support of these primitives, the architecture used in the Sylvan project which studies these issues, and the experience gained from various experiments conducted in this area. We start by describing how these facilities have been implemented in a multiprocessor configuration that utilizes a shared backplane. This configuration represents a single node in the system. The latter part of the paper discusses a distributed multiple node system and the extension of the primitives that are used in this expanded environment.This research is funded by a strategic grant from the Natural Sciences and Engineering Research Council of Canada (Grant No. G1581).
本文描述了一组任务间通信原语的动机,这些原语的硬件支持,在研究这些问题的Sylvan项目中使用的体系结构,以及在该领域进行的各种实验中获得的经验。我们首先描述如何在利用共享背板的多处理器配置中实现这些功能。该配置表示系统中的单个节点。本文的后半部分讨论了分布式多节点系统以及在这个扩展环境中使用的原语的扩展。本研究由加拿大自然科学与工程研究委员会(批准号:G1581)。
{"title":"Architectural support for synchronous task communication","authors":"F. Burkowski, G. Cormack, G. D. P. Dueck","doi":"10.1145/70082.68186","DOIUrl":"https://doi.org/10.1145/70082.68186","url":null,"abstract":"This paper describes the motivation for a set of intertask communication primitives, the hardware support of these primitives, the architecture used in the Sylvan project which studies these issues, and the experience gained from various experiments conducted in this area. We start by describing how these facilities have been implemented in a multiprocessor configuration that utilizes a shared backplane. This configuration represents a single node in the system. The latter part of the paper discusses a distributed multiple node system and the extension of the primitives that are used in this expanded environment.\u0000This research is funded by a strategic grant from the Natural Sciences and Engineering Research Council of Canada (Grant No. G1581).","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133332762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Available instruction-level parallelism for superscalar and superpipelined machines 可用于超标量和超流水线机器的指令级并行性
Pub Date : 1989-04-01 DOI: 10.1145/70082.68207
N. Jouppi, D. W. Wall
Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.
超标量机器每个周期可以发出几个指令。超级流水线机器每个周期只能发出一条指令,但它们的周期时间比任何功能单元的延迟都要短。在本文中,这两种技术被证明是利用指令级并行性的大致等效的方法。开发了一个可参数化的代码重组和仿真系统,并用于一系列基准测试的指令级并行性度量。在各种编译器优化的情况下给出了这些模拟的结果。介绍了超流水线度量的平均度。我们的模拟表明,对于许多机器来说,这个指标已经很高了。这些机器已经利用了许多非数值应用程序中所有可用的指令级并行性,甚至没有并行指令问题或更高程度的流水线。
{"title":"Available instruction-level parallelism for superscalar and superpipelined machines","authors":"N. Jouppi, D. W. Wall","doi":"10.1145/70082.68207","DOIUrl":"https://doi.org/10.1145/70082.68207","url":null,"abstract":"Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117311571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 391
Evaluating the performance of software cache coherence 软件缓存一致性性能评估
Pub Date : 1989-04-01 DOI: 10.1145/70082.68204
S. Owicki, A. Agarwal
In a shared-memory multiprocessor with private caches, cached copies of a data item must be kept consistent. This is called cache coherence. Both hardware and software coherence schemes have been proposed. Software techniques are attractive because they avoid hardware complexity and can be used with any processor-memory interconnection. This paper presents an analytical model of the performance of two software coherence schemes and, for comparison, snoopy-cache hardware. The model is validated against address traces from a bus-based multiprocessor. The behavior of the coherence schemes under various workloads is compared, and their sensitivity to variations in workload parameters is assessed. The analysis shows that the performance of software schemes is critically determined by certain parameters of the workload: the proportion of data accesses, the fraction of shared references, and the number of times a shared block is accessed before it is purged from the cache. Snoopy caches are more resilient to variations in these parameters. Thus when evaluating a software scheme as a design alternative, it is essential to consider the characteristics of the expected workload. The performance of the two software schemes with a multistage interconnection network is also evaluated, and it is determined that both scale well.
在具有私有缓存的共享内存多处理器中,数据项的缓存副本必须保持一致。这被称为缓存一致性。提出了硬件和软件相干方案。软件技术很有吸引力,因为它们避免了硬件的复杂性,并且可以用于任何处理器-存储器互连。本文给出了两种软件相干方案的性能分析模型,并与窥探缓存硬件进行了比较。该模型通过来自基于总线的多处理器的地址跟踪进行验证。比较了各相干方案在不同工作负载下的行为,并评估了它们对工作负载参数变化的敏感性。分析表明,软件方案的性能在很大程度上取决于工作负载的某些参数:数据访问的比例、共享引用的比例,以及从缓存中清除共享块之前访问该块的次数。史努比缓存对这些参数的变化更有弹性。因此,在评估软件方案作为设计备选方案时,必须考虑预期工作负载的特征。对两种软件方案在多级互联网络下的性能进行了评价,结果表明两种方案都具有良好的可扩展性。
{"title":"Evaluating the performance of software cache coherence","authors":"S. Owicki, A. Agarwal","doi":"10.1145/70082.68204","DOIUrl":"https://doi.org/10.1145/70082.68204","url":null,"abstract":"In a shared-memory multiprocessor with private caches, cached copies of a data item must be kept consistent. This is called cache coherence. Both hardware and software coherence schemes have been proposed. Software techniques are attractive because they avoid hardware complexity and can be used with any processor-memory interconnection. This paper presents an analytical model of the performance of two software coherence schemes and, for comparison, snoopy-cache hardware. The model is validated against address traces from a bus-based multiprocessor. The behavior of the coherence schemes under various workloads is compared, and their sensitivity to variations in workload parameters is assessed. The analysis shows that the performance of software schemes is critically determined by certain parameters of the workload: the proportion of data accesses, the fraction of shared references, and the number of times a shared block is accessed before it is purged from the cache. Snoopy caches are more resilient to variations in these parameters. Thus when evaluating a software scheme as a design alternative, it is essential to consider the characteristics of the expected workload. The performance of the two software schemes with a multistage interconnection network is also evaluated, and it is determined that both scale well.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"4 5-6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132497590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
Tradeoffs in instruction format design for horizontal architectures 水平体系结构中指令格式设计的权衡
Pub Date : 1989-04-01 DOI: 10.1145/70082.68184
G. Sohi, S. Vajapeyam
With recent improvements in software techniques and the enhanced level of fine grain parallelism made available by such techniques, there has been an increased interest in horizontal architectures and large instruction words that are capable of issuing more that one operation per instruction. This paper investigates some issues in the design of such instruction formats. We study how the choice of an instruction format is influenced by factors such as the degree of pipelining and the instruction's view of the register file. Our results suggest that very large instruction words capable of issuing one operation to each functional unit resource in a horizontal architecture may be overkill. Restricted instruction formats with limited operation issuing capabilities are capable of providing similar performance (measured by the total number of time steps) with significantly less hardware in many cases.
随着最近软件技术的改进以及这些技术所提供的细粒度并行性水平的提高,人们对水平架构和每条指令能够发出多个操作的大型指令字的兴趣越来越大。本文对这种指令格式设计中的一些问题进行了探讨。我们研究了指令格式的选择如何受到诸如流水线程度和寄存器文件的指令视图等因素的影响。我们的结果表明,在水平架构中,能够向每个功能单元资源发出一个操作的非常大的指令字可能是多余的。在许多情况下,具有有限操作发出能力的受限指令格式能够用更少的硬件提供类似的性能(通过总时间步数衡量)。
{"title":"Tradeoffs in instruction format design for horizontal architectures","authors":"G. Sohi, S. Vajapeyam","doi":"10.1145/70082.68184","DOIUrl":"https://doi.org/10.1145/70082.68184","url":null,"abstract":"With recent improvements in software techniques and the enhanced level of fine grain parallelism made available by such techniques, there has been an increased interest in horizontal architectures and large instruction words that are capable of issuing more that one operation per instruction. This paper investigates some issues in the design of such instruction formats. We study how the choice of an instruction format is influenced by factors such as the degree of pipelining and the instruction's view of the register file. Our results suggest that very large instruction words capable of issuing one operation to each functional unit resource in a horizontal architecture may be overkill. Restricted instruction formats with limited operation issuing capabilities are capable of providing similar performance (measured by the total number of time steps) with significantly less hardware in many cases.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126405435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
A real-time support processor for ada tasking 用于数据处理任务的实时支持处理器
Pub Date : 1989-04-01 DOI: 10.1145/70082.68198
J. Roos
Task synchronization in Ada causes excessive run-time overhead due to the complex semantics of the rendezvous. To demonstrate that the speed can be increased by two orders of magnitude by using special purpose hardware, a single chip VLSI support processor has been designed. By providing predictable and uniformly low overhead for the entire semantics of a rendezvous, the powerful real-time constructs of Ada can be used freely without performance degradation.The key to high performance is the set of primitive operations implemented in hardware. Each operation is complex enough to replace a considerable amount of code was designed to execute with a minimum of communication overhead. Task control blocks are stored on-chip as well as headers for entry, delay and ready queues. All necessary scheduling is integrated in the operations. Delays are handled completely on-chip using an internal real-time clock.A multilevel design strategy, based on silicon compilation, made it possible to run actual Ada programs on a functional emulator of the chip and use the results to verify the detailed design. A high degree of parallelism and pipelining together with an elaborate internal addressing scheme has reduced the number of clock cycles needed to perform each operation. Using 2 μm CMOS, the processor can run at 20 MHz. A complex rendezvous, including the calling sequence and all necessary scheduling, can be performed in less than 15 μs.
由于集合的复杂语义,Ada中的任务同步会导致过多的运行时开销。为了证明使用专用硬件可以将速度提高两个数量级,设计了单芯片VLSI支持处理器。通过为集合的整个语义提供可预测和统一的低开销,可以自由地使用Ada强大的实时构造而不会降低性能。高性能的关键是在硬件中实现的一组基本操作。每个操作都足够复杂,可以用最少的通信开销替换相当数量的代码。任务控制块以及入口、延迟和就绪队列的头都存储在芯片上。所有必要的调度都集成在操作中。使用内部实时时钟完全在芯片上处理延迟。采用基于硅编译的多级设计策略,可以在芯片的功能模拟器上运行实际的Ada程序,并使用结果验证详细设计。高度的并行性和流水线以及精心设计的内部寻址方案减少了执行每个操作所需的时钟周期数量。采用2 μm CMOS,处理器工作频率可达20mhz。一个复杂的交会,包括调用序列和所有必要的调度,可以在不到15 μs的时间内完成。
{"title":"A real-time support processor for ada tasking","authors":"J. Roos","doi":"10.1145/70082.68198","DOIUrl":"https://doi.org/10.1145/70082.68198","url":null,"abstract":"Task synchronization in Ada causes excessive run-time overhead due to the complex semantics of the rendezvous. To demonstrate that the speed can be increased by two orders of magnitude by using special purpose hardware, a single chip VLSI support processor has been designed. By providing predictable and uniformly low overhead for the entire semantics of a rendezvous, the powerful real-time constructs of Ada can be used freely without performance degradation.\u0000The key to high performance is the set of primitive operations implemented in hardware. Each operation is complex enough to replace a considerable amount of code was designed to execute with a minimum of communication overhead. Task control blocks are stored on-chip as well as headers for entry, delay and ready queues. All necessary scheduling is integrated in the operations. Delays are handled completely on-chip using an internal real-time clock.\u0000A multilevel design strategy, based on silicon compilation, made it possible to run actual Ada programs on a functional emulator of the chip and use the results to verify the detailed design. A high degree of parallelism and pipelining together with an elaborate internal addressing scheme has reduced the number of clock cycles needed to perform each operation. Using 2 μm CMOS, the processor can run at 20 MHz. A complex rendezvous, including the calling sequence and all necessary scheduling, can be performed in less than 15 μs.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130235951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
期刊
ASPLOS III
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1