Parallel programs are commonly written using barriers to synchronize parallel processes. Upon reaching a barrier, a processor must stall until all participating processors reach the barrier. A software implementation of the barrier mechanism using shared variables has two major drawbacks. Firstly, the execution of the barrier may be slow as it may not only require execution of several instructions and but also result in hot-spot accesses. Secondly, processors that are stalled waiting for other processors to reach the barrier are essentially idling and cannot do any useful work. In this paper, the notion of the fuzzy barrier is presented, that avoids the above drawbacks. The first problem is avoided by implementing the mechanism in hardware. The second problem is solved by extending the barrier concept to include a region of statements that can be executed by a processor while it awaits synchronization. The barrier regions are constructed by a compiler and consist of several instructions such that a processor is ready to synchronize upon reaching the first instruction in this region and must synchronize before exiting the region. When synchronization does occur, the processors could be executing at any point in their respective barrier regions. The larger the barrier region, the more likely it is that none of the processors will have to stall. Preliminary investigations show that barrier regions can be large and the use of program transformations can significantly increase their size. Examples of situations where such a mechanism can result in improved performance are presented. Results based on a software implementation of the fuzzy barrier on the Encore multiprocessor indicate that the synchronization overhead can be greatly reduced using the mechanism.
{"title":"The fuzzy barrier: a mechanism for high speed synchronization of processors","authors":"Rajiv Gupta","doi":"10.1145/70082.68187","DOIUrl":"https://doi.org/10.1145/70082.68187","url":null,"abstract":"Parallel programs are commonly written using barriers to synchronize parallel processes. Upon reaching a barrier, a processor must stall until all participating processors reach the barrier. A software implementation of the barrier mechanism using shared variables has two major drawbacks. Firstly, the execution of the barrier may be slow as it may not only require execution of several instructions and but also result in hot-spot accesses. Secondly, processors that are stalled waiting for other processors to reach the barrier are essentially idling and cannot do any useful work. In this paper, the notion of the fuzzy barrier is presented, that avoids the above drawbacks. The first problem is avoided by implementing the mechanism in hardware. The second problem is solved by extending the barrier concept to include a region of statements that can be executed by a processor while it awaits synchronization. The barrier regions are constructed by a compiler and consist of several instructions such that a processor is ready to synchronize upon reaching the first instruction in this region and must synchronize before exiting the region. When synchronization does occur, the processors could be executing at any point in their respective barrier regions. The larger the barrier region, the more likely it is that none of the processors will have to stall. Preliminary investigations show that barrier regions can be large and the use of program transformations can significantly increase their size. Examples of situations where such a mechanism can result in improved performance are presented. Results based on a software implementation of the fuzzy barrier on the Encore multiprocessor indicate that the synchronization overhead can be greatly reduced using the mechanism.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132168115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper investigates the limitations on designing a processor which can sustain an execution rate of greater than one instruction per cycle on highly-optimized, non-scientific applications. We have used trace-driven simulations to determine that these applications contain enough instruction independence to sustain an instruction rate of about two instructions per cycle. In a straightforward implementation, cost considerations argue strongly against decoding more than two instructions in one cycle. Given this constraint, the efficiency in instruction fetching rather than the complexity of the execution hardware limits the concurrency attainable at the instruction level.
{"title":"Limits on multiple instruction issue","authors":"Michael D. Smith, Mike Johnson, M. Horowitz","doi":"10.1145/70082.68209","DOIUrl":"https://doi.org/10.1145/70082.68209","url":null,"abstract":"This paper investigates the limitations on designing a processor which can sustain an execution rate of greater than one instruction per cycle on highly-optimized, non-scientific applications. We have used trace-driven simulations to determine that these applications contain enough instruction independence to sustain an instruction rate of about two instructions per cycle. In a straightforward implementation, cost considerations argue strongly against decoding more than two instructions in one cycle. Given this constraint, the efficiency in instruction fetching rather than the complexity of the execution hardware limits the concurrency attainable at the instruction level.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122622774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The concept of read-one/write-many paged memory is introduced and given the name sheaved memory. It is shown that sheaved memory is useful for efficiently maintaining checkpoints in main memory and for providing state saving and state restoration for software that includes recovery blocks or similar control structures. The organization of sheaved memory is described in detail, and a design is presented for a prototype sheaved-memory module that can be built easily from inexpensive, off-the-shelf components. The module can be incorporated within many available computers without altering the computers' hardware design. The concept of sheaved memory is simple and appealing, and its potential for use in a number of software contexts is foreseen.
{"title":"Sheaved memory: architectural support for state saving and restoration in pages systems","authors":"M. E. Staknis","doi":"10.1145/70082.68191","DOIUrl":"https://doi.org/10.1145/70082.68191","url":null,"abstract":"The concept of read-one/write-many paged memory is introduced and given the name sheaved memory. It is shown that sheaved memory is useful for efficiently maintaining checkpoints in main memory and for providing state saving and state restoration for software that includes recovery blocks or similar control structures. The organization of sheaved memory is described in detail, and a design is presented for a prototype sheaved-memory module that can be built easily from inexpensive, off-the-shelf components. The module can be incorporated within many available computers without altering the computers' hardware design. The concept of sheaved memory is simple and appealing, and its potential for use in a number of software contexts is foreseen.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122527006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A message driven architecture for the execution of OR-parallel logic languages is proposed. The computational model is based on well known compilation techniques for Logic Languages. We present first the multiple binding mechanism for the OR-parallel Prolog architecture and the corresponding OR-parallel abstract machine is described. A scheduling algorithm which does not rely upon the availability of global data structures to direct the search for work is discussed. The message driven processor, the processing node of the parallel machine, is designed to interact with a shared global address space and to efficiently process messages from other processing nodes. We discuss some of the results obtained from a high level functional simulator of the message driven machine.
{"title":"A message driven OR-parallel machine","authors":"S. Delgado-Rannauro, T. Reynolds","doi":"10.1145/70082.68203","DOIUrl":"https://doi.org/10.1145/70082.68203","url":null,"abstract":"A message driven architecture for the execution of OR-parallel logic languages is proposed. The computational model is based on well known compilation techniques for Logic Languages. We present first the multiple binding mechanism for the OR-parallel Prolog architecture and the corresponding OR-parallel abstract machine is described. A scheduling algorithm which does not rely upon the availability of global data structures to direct the search for work is discussed. The message driven processor, the processing node of the parallel machine, is designed to interact with a shared global address space and to efficiently process messages from other processing nodes. We discuss some of the results obtained from a high level functional simulator of the message driven machine.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115521318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Existing kernel-level debugging primitives are inappropriate for instrumenting complex sequential or parallel programs. These functions incur a heavy overhead in their use of system calls and process switches. Context switches are used to alternately invoke the debugger and the target programs. System calls are used to communicate data between the target and debugger. None of this is necessary in shared-memory multiprocessors. Multiple processors concurrently run both the debugger and the target. Shared-memory is used to implement efficient communication. The target's state is accessed by running both the target and the debugger in the same address space. Finally, instrumentation points, which have largely been implemented as traps to the system, are reimplemented as simple branches to routines of arbitrary complexity maintained by the debugger. Not only are primitives such as conditional breakpoints thus generalized, but their efficiency is improved by several orders of magnitude. In the process, much of the traditional system's kernel support for debugging is reimplemented at user-level. This paper describes the implementation of debugging primitives in Parasight, a parallel programming environment. Parasight has been used to implement conditional breakpoints, an important primitive for both high-level and parallel debugging. Preliminary measurements indicate that Parasight breakpoints are 1000 times faster than the breakpoints in parallel “cdb”, a conventional UNIX debugger. Light-weight conditional breakpoints open up new opportunities for debugging and profiling both parallel and sequential programs.
{"title":"Efficient debugging primitives for multiprocessors","authors":"Z. Aral, I. Gertner, G. Schaffer","doi":"10.1145/70082.68190","DOIUrl":"https://doi.org/10.1145/70082.68190","url":null,"abstract":"Existing kernel-level debugging primitives are inappropriate for instrumenting complex sequential or parallel programs. These functions incur a heavy overhead in their use of system calls and process switches. Context switches are used to alternately invoke the debugger and the target programs. System calls are used to communicate data between the target and debugger.\u0000None of this is necessary in shared-memory multiprocessors. Multiple processors concurrently run both the debugger and the target. Shared-memory is used to implement efficient communication. The target's state is accessed by running both the target and the debugger in the same address space. Finally, instrumentation points, which have largely been implemented as traps to the system, are reimplemented as simple branches to routines of arbitrary complexity maintained by the debugger. Not only are primitives such as conditional breakpoints thus generalized, but their efficiency is improved by several orders of magnitude. In the process, much of the traditional system's kernel support for debugging is reimplemented at user-level.\u0000This paper describes the implementation of debugging primitives in Parasight, a parallel programming environment. Parasight has been used to implement conditional breakpoints, an important primitive for both high-level and parallel debugging. Preliminary measurements indicate that Parasight breakpoints are 1000 times faster than the breakpoints in parallel “cdb”, a conventional UNIX debugger. Light-weight conditional breakpoints open up new opportunities for debugging and profiling both parallel and sequential programs.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125958364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes the motivation for a set of intertask communication primitives, the hardware support of these primitives, the architecture used in the Sylvan project which studies these issues, and the experience gained from various experiments conducted in this area. We start by describing how these facilities have been implemented in a multiprocessor configuration that utilizes a shared backplane. This configuration represents a single node in the system. The latter part of the paper discusses a distributed multiple node system and the extension of the primitives that are used in this expanded environment. This research is funded by a strategic grant from the Natural Sciences and Engineering Research Council of Canada (Grant No. G1581).
{"title":"Architectural support for synchronous task communication","authors":"F. Burkowski, G. Cormack, G. D. P. Dueck","doi":"10.1145/70082.68186","DOIUrl":"https://doi.org/10.1145/70082.68186","url":null,"abstract":"This paper describes the motivation for a set of intertask communication primitives, the hardware support of these primitives, the architecture used in the Sylvan project which studies these issues, and the experience gained from various experiments conducted in this area. We start by describing how these facilities have been implemented in a multiprocessor configuration that utilizes a shared backplane. This configuration represents a single node in the system. The latter part of the paper discusses a distributed multiple node system and the extension of the primitives that are used in this expanded environment.\u0000This research is funded by a strategic grant from the Natural Sciences and Engineering Research Council of Canada (Grant No. G1581).","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133332762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.
{"title":"Available instruction-level parallelism for superscalar and superpipelined machines","authors":"N. Jouppi, D. W. Wall","doi":"10.1145/70082.68207","DOIUrl":"https://doi.org/10.1145/70082.68207","url":null,"abstract":"Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117311571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In a shared-memory multiprocessor with private caches, cached copies of a data item must be kept consistent. This is called cache coherence. Both hardware and software coherence schemes have been proposed. Software techniques are attractive because they avoid hardware complexity and can be used with any processor-memory interconnection. This paper presents an analytical model of the performance of two software coherence schemes and, for comparison, snoopy-cache hardware. The model is validated against address traces from a bus-based multiprocessor. The behavior of the coherence schemes under various workloads is compared, and their sensitivity to variations in workload parameters is assessed. The analysis shows that the performance of software schemes is critically determined by certain parameters of the workload: the proportion of data accesses, the fraction of shared references, and the number of times a shared block is accessed before it is purged from the cache. Snoopy caches are more resilient to variations in these parameters. Thus when evaluating a software scheme as a design alternative, it is essential to consider the characteristics of the expected workload. The performance of the two software schemes with a multistage interconnection network is also evaluated, and it is determined that both scale well.
{"title":"Evaluating the performance of software cache coherence","authors":"S. Owicki, A. Agarwal","doi":"10.1145/70082.68204","DOIUrl":"https://doi.org/10.1145/70082.68204","url":null,"abstract":"In a shared-memory multiprocessor with private caches, cached copies of a data item must be kept consistent. This is called cache coherence. Both hardware and software coherence schemes have been proposed. Software techniques are attractive because they avoid hardware complexity and can be used with any processor-memory interconnection. This paper presents an analytical model of the performance of two software coherence schemes and, for comparison, snoopy-cache hardware. The model is validated against address traces from a bus-based multiprocessor. The behavior of the coherence schemes under various workloads is compared, and their sensitivity to variations in workload parameters is assessed. The analysis shows that the performance of software schemes is critically determined by certain parameters of the workload: the proportion of data accesses, the fraction of shared references, and the number of times a shared block is accessed before it is purged from the cache. Snoopy caches are more resilient to variations in these parameters. Thus when evaluating a software scheme as a design alternative, it is essential to consider the characteristics of the expected workload. The performance of the two software schemes with a multistage interconnection network is also evaluated, and it is determined that both scale well.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"4 5-6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132497590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With recent improvements in software techniques and the enhanced level of fine grain parallelism made available by such techniques, there has been an increased interest in horizontal architectures and large instruction words that are capable of issuing more that one operation per instruction. This paper investigates some issues in the design of such instruction formats. We study how the choice of an instruction format is influenced by factors such as the degree of pipelining and the instruction's view of the register file. Our results suggest that very large instruction words capable of issuing one operation to each functional unit resource in a horizontal architecture may be overkill. Restricted instruction formats with limited operation issuing capabilities are capable of providing similar performance (measured by the total number of time steps) with significantly less hardware in many cases.
{"title":"Tradeoffs in instruction format design for horizontal architectures","authors":"G. Sohi, S. Vajapeyam","doi":"10.1145/70082.68184","DOIUrl":"https://doi.org/10.1145/70082.68184","url":null,"abstract":"With recent improvements in software techniques and the enhanced level of fine grain parallelism made available by such techniques, there has been an increased interest in horizontal architectures and large instruction words that are capable of issuing more that one operation per instruction. This paper investigates some issues in the design of such instruction formats. We study how the choice of an instruction format is influenced by factors such as the degree of pipelining and the instruction's view of the register file. Our results suggest that very large instruction words capable of issuing one operation to each functional unit resource in a horizontal architecture may be overkill. Restricted instruction formats with limited operation issuing capabilities are capable of providing similar performance (measured by the total number of time steps) with significantly less hardware in many cases.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126405435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Task synchronization in Ada causes excessive run-time overhead due to the complex semantics of the rendezvous. To demonstrate that the speed can be increased by two orders of magnitude by using special purpose hardware, a single chip VLSI support processor has been designed. By providing predictable and uniformly low overhead for the entire semantics of a rendezvous, the powerful real-time constructs of Ada can be used freely without performance degradation. The key to high performance is the set of primitive operations implemented in hardware. Each operation is complex enough to replace a considerable amount of code was designed to execute with a minimum of communication overhead. Task control blocks are stored on-chip as well as headers for entry, delay and ready queues. All necessary scheduling is integrated in the operations. Delays are handled completely on-chip using an internal real-time clock. A multilevel design strategy, based on silicon compilation, made it possible to run actual Ada programs on a functional emulator of the chip and use the results to verify the detailed design. A high degree of parallelism and pipelining together with an elaborate internal addressing scheme has reduced the number of clock cycles needed to perform each operation. Using 2 μm CMOS, the processor can run at 20 MHz. A complex rendezvous, including the calling sequence and all necessary scheduling, can be performed in less than 15 μs.
{"title":"A real-time support processor for ada tasking","authors":"J. Roos","doi":"10.1145/70082.68198","DOIUrl":"https://doi.org/10.1145/70082.68198","url":null,"abstract":"Task synchronization in Ada causes excessive run-time overhead due to the complex semantics of the rendezvous. To demonstrate that the speed can be increased by two orders of magnitude by using special purpose hardware, a single chip VLSI support processor has been designed. By providing predictable and uniformly low overhead for the entire semantics of a rendezvous, the powerful real-time constructs of Ada can be used freely without performance degradation.\u0000The key to high performance is the set of primitive operations implemented in hardware. Each operation is complex enough to replace a considerable amount of code was designed to execute with a minimum of communication overhead. Task control blocks are stored on-chip as well as headers for entry, delay and ready queues. All necessary scheduling is integrated in the operations. Delays are handled completely on-chip using an internal real-time clock.\u0000A multilevel design strategy, based on silicon compilation, made it possible to run actual Ada programs on a functional emulator of the chip and use the results to verify the detailed design. A high degree of parallelism and pipelining together with an elaborate internal addressing scheme has reduced the number of clock cycles needed to perform each operation. Using 2 μm CMOS, the processor can run at 20 MHz. A complex rendezvous, including the calling sequence and all necessary scheduling, can be performed in less than 15 μs.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130235951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}