The I/O behavior of some scientific applications, a subset of Perfect benchmarks, executing on a multiprocessor is studied. The aim of this study is to explore the various patterns of I/O access of large scientific applications, and to understand the impact of this observed behavior on the I/O subsystem architecture. I/O behavior of the program is characterized by the demands it imposes on the I/O subsystem. It is observed that implicit I/O or paging is not a major problem for the applications considered and the I/O problem is mainly manifest in the explicit I/O done in the program. Various characteristics of I/O accesses are studied, and their impact on architecture design is discussed.<>
{"title":"A study of I/O behavior of Perfect benchmarks on a multiprocessor","authors":"A. Reddy, P. Banerjee","doi":"10.1145/325164.325157","DOIUrl":"https://doi.org/10.1145/325164.325157","url":null,"abstract":"The I/O behavior of some scientific applications, a subset of Perfect benchmarks, executing on a multiprocessor is studied. The aim of this study is to explore the various patterns of I/O access of large scientific applications, and to understand the impact of this observed behavior on the I/O subsystem architecture. I/O behavior of the program is characterized by the demands it imposes on the I/O subsystem. It is observed that implicit I/O or paging is not a major problem for the applications considered and the I/O problem is mainly manifest in the explicit I/O done in the program. Various characteristics of I/O accesses are studied, and their impact on architecture design is discussed.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122266865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Lenoski, J. Laudon, K. Gharachorloo, Anoop Gupta, J. Hennessy
DASH is a scalable shared-memory multiprocessor whose architecture consists of powerful processing nodes, each with a portion of the shared-memory, connected to a scalable interconnection network. A key feature of DASH is its distributed direction-based cache coherence protocol. Unlike traditional snoopy coherence protocols, the DASH protocol does not rely on broadcast; instead it uses point-to-point messages sent between the processors and memories to keep caches consistent. Furthermore, the DASH system does not contain any single serialization or control point. While these features provide the basis for scalability, they also force a reevaluation of many fundamental issues involved in the design of a protocol. These include the issues of correctness, performance, and protocol complexity. The design of the DASH coherence protocol is presented and discussed from the viewpoint of how it addresses the above issues. Also discussed is a strategy for verifying the correctness of the protocol. A brief comparison of the protocol with the IEEE Scalable Coherent Interface protocol is made.<>
{"title":"The directory-based cache coherence protocol for the DASH multiprocessor","authors":"D. Lenoski, J. Laudon, K. Gharachorloo, Anoop Gupta, J. Hennessy","doi":"10.1145/325164.325132","DOIUrl":"https://doi.org/10.1145/325164.325132","url":null,"abstract":"DASH is a scalable shared-memory multiprocessor whose architecture consists of powerful processing nodes, each with a portion of the shared-memory, connected to a scalable interconnection network. A key feature of DASH is its distributed direction-based cache coherence protocol. Unlike traditional snoopy coherence protocols, the DASH protocol does not rely on broadcast; instead it uses point-to-point messages sent between the processors and memories to keep caches consistent. Furthermore, the DASH system does not contain any single serialization or control point. While these features provide the basis for scalability, they also force a reevaluation of many fundamental issues involved in the design of a protocol. These include the issues of correctness, performance, and protocol complexity. The design of the DASH coherence protocol is presented and discussed from the viewpoint of how it addresses the above issues. Also discussed is a strategy for verifying the correctness of the protocol. A brief comparison of the protocol with the IEEE Scalable Coherent Interface protocol is made.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115896369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The authors present an empirical evaluation of two memory-efficient directory methods for maintaining coherent caches in large shared-memory multiprocessors. Both directory methods are modifications of a scheme proposed by L.M. Censier and P. Feautrier (1978) that does not rely on a specific interconnection network and can be readily distributed across interleaved main memory. The schemes considered here overcome the large amount of memory required for tags in the original scheme in two different ways. In the first scheme each main memory block is sectored into sub-blocks for which the large tag overhead is shared. In the second scheme a limited number of large tags are stored in an associative cache and shared among a much larger number of main memory blocks. Simulations show that in terms of access time and network traffic both directory methods provide significant performance improvements over a memory system in which shared-writable data are not cached. The large block sizes required for the sectored scheme, however, promote sufficient false sharing for its performance to be markedly worse than when a tag cache is used.<>
{"title":"An empirical evaluation of two memory-efficient directory methods","authors":"Brian W. O'Krafka, A. Richard Newton","doi":"10.1145/325164.325130","DOIUrl":"https://doi.org/10.1145/325164.325130","url":null,"abstract":"The authors present an empirical evaluation of two memory-efficient directory methods for maintaining coherent caches in large shared-memory multiprocessors. Both directory methods are modifications of a scheme proposed by L.M. Censier and P. Feautrier (1978) that does not rely on a specific interconnection network and can be readily distributed across interleaved main memory. The schemes considered here overcome the large amount of memory required for tags in the original scheme in two different ways. In the first scheme each main memory block is sectored into sub-blocks for which the large tag overhead is shared. In the second scheme a limited number of large tags are stored in an associative cache and shared among a much larger number of main memory blocks. Simulations show that in terms of access time and network traffic both directory methods provide significant performance improvements over a memory system in which shared-writable data are not cached. The large block sizes required for the sectored scheme, however, promote sufficient false sharing for its performance to be markedly worse than when a tag cache is used.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125034042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Two pipeline models, one implementing a load/store architecture, the other a symmetric architecture, are compared under identical simulation environments. The symmetric architecture instructions are more powerful, but also more complex; therefore the pipeline model for the symmetric architecture contains an additional stage with an additional adder, more bypasses, and an extra port to the register file. The authors' simulations show that the path length of the load/store architecture is 1.12 longer than that of the symmetric architecture. Nevertheless, most of this advantage is lost because of various pipeline delays that reduce the speedup factor from 1.12 to 1.0375. The main delaying contribution is due to resource dependency (0.064 CPI) and control dependency (0.048 CPI).<>
{"title":"Performance comparison of load/store and symmetric instruction set architectures","authors":"D. Alpert, A. Averbuch, O. Danieli","doi":"10.1145/325164.325137","DOIUrl":"https://doi.org/10.1145/325164.325137","url":null,"abstract":"Two pipeline models, one implementing a load/store architecture, the other a symmetric architecture, are compared under identical simulation environments. The symmetric architecture instructions are more powerful, but also more complex; therefore the pipeline model for the symmetric architecture contains an additional stage with an additional adder, more bypasses, and an extra port to the register file. The authors' simulations show that the path length of the load/store architecture is 1.12 longer than that of the symmetric architecture. Nevertheless, most of this advantage is lost because of various pipeline delays that reduce the speedup factor from 1.12 to 1.0375. The main delaying contribution is due to resource dependency (0.064 CPI) and control dependency (0.048 CPI).<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132051764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sequent's Symmetry series is a bus-based shared-memory multiprocessor. System performance in an OLTP (online transaction processing) relational database application was investigated using the TP1 benchmark. System performance was tested with fully cached benchmarks and with scaled benchmarks. In fully-cached tests, the entire database fits inside main memory. In scaled tests, the database is larger than available memory. In the fully-cached benchmark, performance was initially limited by bus saturation. The cause was the transfer of process context from processor to processor. This was eliminated by assigning each process to a processor. Processor affinity was combined with reductions in message passing within the database. Throughput was dramatically improved. The scaled tests were I/O bound. This bottleneck can be eliminated by connecting more disk drives or by increasing the main memory size.<>
{"title":"Performance of an OLTP application on Symmetry multiprocessor system","authors":"S. Thakkar, Mark Sweiger","doi":"10.1145/325164.325149","DOIUrl":"https://doi.org/10.1145/325164.325149","url":null,"abstract":"Sequent's Symmetry series is a bus-based shared-memory multiprocessor. System performance in an OLTP (online transaction processing) relational database application was investigated using the TP1 benchmark. System performance was tested with fully cached benchmarks and with scaled benchmarks. In fully-cached tests, the entire database fits inside main memory. In scaled tests, the database is larger than available memory. In the fully-cached benchmark, performance was initially limited by bus saturation. The cause was the transfer of process context from processor to processor. This was eliminated by assigning each process to a processor. Processor affinity was combined with reductions in message passing within the database. Throughput was dramatically improved. The scaled tests were I/O bound. This bottleneck can be eliminated by connecting more disk drives or by increasing the main memory size.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126844368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A performance metric, normalized time, which is closely related to such measures as the area-time product of VLSI theory and the price/performance ratio of advertising literature is introduced. This metric captures the idea of a piece of hardware 'pulling its own weight', that is, contributing as much to performance as it costs in resources. The authors prove general theorems for stating when the size of a given part is in balance with its utilization and give specific formulas for commonly found linear and quadratic devices. They also apply these formulas to an analysis of a specific processor element and discuss the implications for bit-serial-versus-word-parallel, RISC-versus-CISC (reduced-versus complex-instruction-set-computer), and VLIW (very-long-instruction-word) designs.<>
{"title":"Balance in architectural design","authors":"Samuel Ho, L. Snyder","doi":"10.1145/325164.325156","DOIUrl":"https://doi.org/10.1145/325164.325156","url":null,"abstract":"A performance metric, normalized time, which is closely related to such measures as the area-time product of VLSI theory and the price/performance ratio of advertising literature is introduced. This metric captures the idea of a piece of hardware 'pulling its own weight', that is, contributing as much to performance as it costs in resources. The authors prove general theorems for stating when the size of a given part is in balance with its utilization and give specific formulas for commonly found linear and quadratic devices. They also apply these formulas to an analysis of a specific processor element and discuss the implications for bit-serial-versus-word-parallel, RISC-versus-CISC (reduced-versus complex-instruction-set-computer), and VLIW (very-long-instruction-word) designs.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128085581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The performance evaluation, workload characterization, and trace-driven simulation of a hypercube multicomputer running realistic workloads are presented. Six representative parallel applications were selected as benchmarks. Software monitoring techniques were then used to collect execution traces. On the basis of the measurement results, the authors investigated both the computation and communication behavior of these parallel programs, including CPU utilization, computation task granularity, message interarrival distribution, the distribution of waiting times in receiving messages, and message length and destination distributions. The localities in communication were also studied. A trace-driven simulation environment was developed to study the behavior of the communication hardware under real workloads. Simulation results on DMA and link utilizations are reported.<>
{"title":"Performance measurement and trace driven simulation of parallel CAD and numeric applications on a hypercube multicomputer","authors":"Jiun-Ming Hsu, P. Banerjee","doi":"10.1145/325164.325152","DOIUrl":"https://doi.org/10.1145/325164.325152","url":null,"abstract":"The performance evaluation, workload characterization, and trace-driven simulation of a hypercube multicomputer running realistic workloads are presented. Six representative parallel applications were selected as benchmarks. Software monitoring techniques were then used to collect execution traces. On the basis of the measurement results, the authors investigated both the computation and communication behavior of these parallel programs, including CPU utilization, computation task granularity, message interarrival distribution, the distribution of waiting times in receiving messages, and message length and destination distributions. The localities in communication were also studied. A trace-driven simulation environment was developed to study the behavior of the communication hardware under real workloads. Simulation results on DMA and link utilizations are reported.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132623794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Borkar, R. Cohn, G. Cox, T. Gross, H. T. Kung, M. Lam, M. Levine, B. Moore, W. Moore, C. Peterson, J. Susman, J. Sutton, J. Urbanski, J. Webb
The iWarp communication system supports two widely used interprocessor communication styles: memory communication and systolic communication. A description is given of the rationale, architecture, and implementation for the iWarp communication system. Memory communication is flexible and well suited for general computing, whereas systolic communication is efficient and well suited for speed-critical applications. The iWarp design is made possible by two important innovations in communication: (1) program access to communication and (2) logical channels. The former allows programs to access data as they are transmitted and to redirect portions of messages to different destinations efficiently. The latter increases the connectivity between the processors and guarantees communication bandwidth for classes of messages. These innovations have provided a focus for the iWarp architecture. The result is a communication system that provides a total bandwidth of 320 MBytes/sec and that is integrated on a single VLSI component with a 20 MFLOPS plus 20 MIPS long instruction work computation engine.<>
{"title":"Supporting systolic and memory communication in iWarp","authors":"S. Borkar, R. Cohn, G. Cox, T. Gross, H. T. Kung, M. Lam, M. Levine, B. Moore, W. Moore, C. Peterson, J. Susman, J. Sutton, J. Urbanski, J. Webb","doi":"10.1145/325164.325116","DOIUrl":"https://doi.org/10.1145/325164.325116","url":null,"abstract":"The iWarp communication system supports two widely used interprocessor communication styles: memory communication and systolic communication. A description is given of the rationale, architecture, and implementation for the iWarp communication system. Memory communication is flexible and well suited for general computing, whereas systolic communication is efficient and well suited for speed-critical applications. The iWarp design is made possible by two important innovations in communication: (1) program access to communication and (2) logical channels. The former allows programs to access data as they are transmitted and to redirect portions of messages to different destinations efficiently. The latter increases the connectivity between the processors and guarantees communication bandwidth for classes of messages. These innovations have provided a focus for the iWarp architecture. The result is a communication system that provides a total bandwidth of 320 MBytes/sec and that is integrated on a single VLSI component with a 20 MFLOPS plus 20 MIPS long instruction work computation engine.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127090417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Network throughput can be increased by dividing the buffer storage associated with each network channel into several virtual channels. Each physical channel is associated with several small queues, virtual channels, rather than a single deep queue. The virtual channels associated with one physical channel are allocated independently but compete with each other for physical bandwidth. Virtual channels decouple buffer resources from transmission resources. This decoupling allows active messages to pass blocked messages using network bandwidth that would otherwise be left idle. Simulation studies show that, given a fixed amount of buffer storage per link, virtual-channel flow control increases throughput by a factor of 3.5, approaching the capacity of the network.<>
{"title":"Virtual-channel flow control","authors":"W. Dally","doi":"10.1145/325164.325115","DOIUrl":"https://doi.org/10.1145/325164.325115","url":null,"abstract":"Network throughput can be increased by dividing the buffer storage associated with each network channel into several virtual channels. Each physical channel is associated with several small queues, virtual channels, rather than a single deep queue. The virtual channels associated with one physical channel are allocated independently but compete with each other for physical bandwidth. Virtual channels decouple buffer resources from transmission resources. This decoupling allows active messages to pass blocked messages using network bandwidth that would otherwise be left idle. Simulation studies show that, given a fixed amount of buffer storage per link, virtual-channel flow control increases throughput by a factor of 3.5, approaching the capacity of the network.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128095396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A memory model for a shared-memory multiprocessor commonly and often implicitly assumed by programmers is that of sequential consistency, which guarantees that all memory accesses will appear to execute atomically and in program order. An alternative model, weak ordering, offers greater performance potential. The central hypothesis of this work is that programmers prefer to reason about sequentially consistent memory, rather than have to think about weaker memory, or even write buffers. Following this hypothesis, weak ordering is defined as a contract between software and hardware. By this contract, software agrees to some formally specified constraints, and hardware agrees to appear sequentially consistent, at least to the software that obeys those constraints. The authors illustrate the power of the new definition with a set of software constraints that forbid data races and with an implementation for cache-coherent systems that is not allowed by the old definition.<>
{"title":"Weak ordering-a new definition","authors":"S. Adve, M. Hill","doi":"10.1145/285930.285996","DOIUrl":"https://doi.org/10.1145/285930.285996","url":null,"abstract":"A memory model for a shared-memory multiprocessor commonly and often implicitly assumed by programmers is that of sequential consistency, which guarantees that all memory accesses will appear to execute atomically and in program order. An alternative model, weak ordering, offers greater performance potential. The central hypothesis of this work is that programmers prefer to reason about sequentially consistent memory, rather than have to think about weaker memory, or even write buffers. Following this hypothesis, weak ordering is defined as a contract between software and hardware. By this contract, software agrees to some formally specified constraints, and hardware agrees to appear sequentially consistent, at least to the software that obeys those constraints. The authors illustrate the power of the new definition with a set of software constraints that forbid data races and with an implementation for cache-coherent systems that is not allowed by the old definition.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115173988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}