Improved performance is a major motivation for using parallel computation. However, performance models are frequently used only to predict an algorithm's execution time, not to accurately evaluate how the choices of architecture, operating system, interprocessor communication protocol, and programming language also dramatically affect parallel performance. We have developed an analytic model for synchronous iterative algorithms running on distributed-memory MIMD machines, and refined it for disrete-event simulation. The model describes the execution time of a single run in terms of application parameters such as the number of iterations and the required computation in each, and architectural parameters such as the number of processors, processor speed, and communication time. Our experience has shown us that an analytic model can not only accurately predict an algorithm's performance but can also match the algorithm to an appropriate architecture, identify ways to improve the algorithm's performance, quantify the performance effects of algorithmic or architectural changes, and provide a better understanding of how the algorithm works.<>
{"title":"Beyond execution time: expanding the use of performance models","authors":"G. D. Peterson, R. Chamberlain","doi":"10.1109/88.311571","DOIUrl":"https://doi.org/10.1109/88.311571","url":null,"abstract":"Improved performance is a major motivation for using parallel computation. However, performance models are frequently used only to predict an algorithm's execution time, not to accurately evaluate how the choices of architecture, operating system, interprocessor communication protocol, and programming language also dramatically affect parallel performance. We have developed an analytic model for synchronous iterative algorithms running on distributed-memory MIMD machines, and refined it for disrete-event simulation. The model describes the execution time of a single run in terms of application parameters such as the number of iterations and the required computation in each, and architectural parameters such as the number of processors, processor speed, and communication time. Our experience has shown us that an analytic model can not only accurately predict an algorithm's performance but can also match the algorithm to an appropriate architecture, identify ways to improve the algorithm's performance, quantify the performance effects of algorithmic or architectural changes, and provide a better understanding of how the algorithm works.<<ETX>>","PeriodicalId":325213,"journal":{"name":"IEEE Parallel & Distributed Technology: Systems & Applications","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124586887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Sistare, Don Allen, R. Bowker, K. Jourdenais, Josh Simons, R. Title
In a message-passing program, there are at least as many threads as processors, and the programmer must deal with large numbers of them on a massively parallel machine. On our target machine, the CM-5, we had previously developed Prism, a programming environment that supports debugging, data visualization, and performance analysis of data-parallel programs. We discuss how our new version, Node Prism, extends Prism's capabilities for message-passing programs. It looks and feels like the data-parallel version, but it uses new methods for user-debugger interaction that promote greater understanding of parallel programs. It offers scalable expression, execution, and interpretation of all debugging operations, making it easier to debug and understand message-passing programs.<>
{"title":"A scalable debugger for massively parallel message-passing programs","authors":"S. Sistare, Don Allen, R. Bowker, K. Jourdenais, Josh Simons, R. Title","doi":"10.1109/88.311572","DOIUrl":"https://doi.org/10.1109/88.311572","url":null,"abstract":"In a message-passing program, there are at least as many threads as processors, and the programmer must deal with large numbers of them on a massively parallel machine. On our target machine, the CM-5, we had previously developed Prism, a programming environment that supports debugging, data visualization, and performance analysis of data-parallel programs. We discuss how our new version, Node Prism, extends Prism's capabilities for message-passing programs. It looks and feels like the data-parallel version, but it uses new methods for user-debugger interaction that promote greater understanding of parallel programs. It offers scalable expression, execution, and interpretation of all debugging operations, making it easier to debug and understand message-passing programs.<<ETX>>","PeriodicalId":325213,"journal":{"name":"IEEE Parallel & Distributed Technology: Systems & Applications","volume":"417 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127598269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Previous pixel-merging algorithms have required special-purpose networks, and use more network bandwidth than is necessary. We developed an algorithm that merges pixels on shared-memory bus multiprocessors, using an existing bus. Analysis and simulations suggest that it uses less bus bandwidth than other algorithms. We based our algorithm on the snooping cache-coherency protocols on which a number of shared-memory multiprocessors have been based. In these architectures, each processor keeps its cache consistent with other processors' memories by listening (snooping) on a shared bus over which memory updates are written. Snooping maintains consistent globally shared memory. This algorithm assists graphics rendering by letting processors compare pixel values and delete those pixels that do not contribute to the final image. This reduces network bandwidth requirements and eliminates the need for a special-purpose network.<>
{"title":"A distributed snooping algorithm for pixel merging","authors":"M. Cox, P. Hanrahan","doi":"10.1109/88.311570","DOIUrl":"https://doi.org/10.1109/88.311570","url":null,"abstract":"Previous pixel-merging algorithms have required special-purpose networks, and use more network bandwidth than is necessary. We developed an algorithm that merges pixels on shared-memory bus multiprocessors, using an existing bus. Analysis and simulations suggest that it uses less bus bandwidth than other algorithms. We based our algorithm on the snooping cache-coherency protocols on which a number of shared-memory multiprocessors have been based. In these architectures, each processor keeps its cache consistent with other processors' memories by listening (snooping) on a shared bus over which memory updates are written. Snooping maintains consistent globally shared memory. This algorithm assists graphics rendering by letting processors compare pixel values and delete those pixels that do not contribute to the final image. This reduces network bandwidth requirements and eliminates the need for a special-purpose network.<<ETX>>","PeriodicalId":325213,"journal":{"name":"IEEE Parallel & Distributed Technology: Systems & Applications","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129560970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We added new features for analyzing multiple programs to the IPS-2 parallel-program performance tools and were surprised at the wide range of performance problems for which this modified IPS-2 can be used. With multiapplication IPS-2, programmers can simultaneously run and analyze cooperating or contending applications; combine performance displays and metrics of multiple applications or multiple versions of the same application to directly compare performance; analyze critical paths of execution for individual applications, for a single application and the applications with which it interacts, or for entire workloads; study how the application workload performance affects the hardware, operating system, and network performance; study an application's evolution through multiple versions, hardware platforms, or input sets; study a workload's aggregate behavior, how applications interact, or how individual applications perform in the presence of other applications; and compare the measured performance of a program with predictions made by simulations or analytical models. This modified parallel-program performance tool analyzes multiple applications in a single session, allowing better performance tuning than is possible when programs are run in isolation.<>
{"title":"Multiapplication support in a parallel-program performance tool","authors":"R. Irvin, B. Miller","doi":"10.1109/88.281874","DOIUrl":"https://doi.org/10.1109/88.281874","url":null,"abstract":"We added new features for analyzing multiple programs to the IPS-2 parallel-program performance tools and were surprised at the wide range of performance problems for which this modified IPS-2 can be used. With multiapplication IPS-2, programmers can simultaneously run and analyze cooperating or contending applications; combine performance displays and metrics of multiple applications or multiple versions of the same application to directly compare performance; analyze critical paths of execution for individual applications, for a single application and the applications with which it interacts, or for entire workloads; study how the application workload performance affects the hardware, operating system, and network performance; study an application's evolution through multiple versions, hardware platforms, or input sets; study a workload's aggregate behavior, how applications interact, or how individual applications perform in the presence of other applications; and compare the measured performance of a program with predictions made by simulations or analytical models. This modified parallel-program performance tool analyzes multiple applications in a single session, allowing better performance tuning than is possible when programs are run in isolation.<<ETX>>","PeriodicalId":325213,"journal":{"name":"IEEE Parallel & Distributed Technology: Systems & Applications","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123209377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We have developed a framework for analyzing the behavior and relations of various sequential and parallel control constructs, which we can nest in a very general way. A simple yet powerful scheme defines the order of data accesses in a program, and provides a well-founded semantic structure for nested constructs. When defining parallel languages or extensions to current languages, designers can use this framework to define how each new feature interacts with the language's other features. Because our approach is based on well-known dependence analysis techniques, it is practical for compiler implementation. It determines which behavior the compiler and system must preserve while allowing aggressive automatic optimization. Instead of being confined to a single programming paradigm, programmers can use the most appropriate constructs for the application, and the compiler can transform and optimize the program for different parallel or sequential architectures.<>
{"title":"Defining, analyzing, and transforming program constructs","authors":"Jingke Le, M. Wolfe","doi":"10.1109/88.281872","DOIUrl":"https://doi.org/10.1109/88.281872","url":null,"abstract":"We have developed a framework for analyzing the behavior and relations of various sequential and parallel control constructs, which we can nest in a very general way. A simple yet powerful scheme defines the order of data accesses in a program, and provides a well-founded semantic structure for nested constructs. When defining parallel languages or extensions to current languages, designers can use this framework to define how each new feature interacts with the language's other features. Because our approach is based on well-known dependence analysis techniques, it is practical for compiler implementation. It determines which behavior the compiler and system must preserve while allowing aggressive automatic optimization. Instead of being confined to a single programming paradigm, programmers can use the most appropriate constructs for the application, and the compiler can transform and optimize the program for different parallel or sequential architectures.<<ETX>>","PeriodicalId":325213,"journal":{"name":"IEEE Parallel & Distributed Technology: Systems & Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128387591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Presentations of parallel performance can be easily and unintentionally distorted. Following some simple guidelines for measuring, presenting and comparing performance can greatly improve your presentation's accuracy and effectiveness.<>
{"title":"How to measure, present, and compare parallel performance","authors":"L. Crowl","doi":"10.1109/88.281869","DOIUrl":"https://doi.org/10.1109/88.281869","url":null,"abstract":"Presentations of parallel performance can be easily and unintentionally distorted. Following some simple guidelines for measuring, presenting and comparing performance can greatly improve your presentation's accuracy and effectiveness.<<ETX>>","PeriodicalId":325213,"journal":{"name":"IEEE Parallel & Distributed Technology: Systems & Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129129804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The drawbacks of the simple spin-lock limit its effective use to small critical sections. Applications with large critical sections and a large number of processors require more efficient algorithms to minimize processor and network overheads. Variations on the spin-lock have been tested on the Sequent Symmetry, a bus-based shared-memory multiprocessor. Algorithms for scalable synchronization have also been tested on the BBN Butterfly I, a large-scale shared-memory multiprocessor with a multistage interconnection network(MIN). We have extended the investigation to the BBN GP1000 and TC2000, both MIN-based multiprocessors with network contention heavier than that on the Butterfly I. We have also implemented algorithms on Kendall Square Research's KSR1, a hierarchical-ring multiprocessor system, to study the effects of cache coherence. The execution behavior of spin-lock algorithms is significantly different between MIN-based and HR-based architectures. Our tests suggest that HR-based architectures handle network and memory contention more efficiently than MIN-based architectures. However, our results also suggest how spin-locks can be made cost-effective on both.<>
{"title":"Spin-lock synchronization on the Butterfly and KSR1","authors":"Xiaodong Zhang, R. Castañeda, E. Chan","doi":"10.1109/88.281875","DOIUrl":"https://doi.org/10.1109/88.281875","url":null,"abstract":"The drawbacks of the simple spin-lock limit its effective use to small critical sections. Applications with large critical sections and a large number of processors require more efficient algorithms to minimize processor and network overheads. Variations on the spin-lock have been tested on the Sequent Symmetry, a bus-based shared-memory multiprocessor. Algorithms for scalable synchronization have also been tested on the BBN Butterfly I, a large-scale shared-memory multiprocessor with a multistage interconnection network(MIN). We have extended the investigation to the BBN GP1000 and TC2000, both MIN-based multiprocessors with network contention heavier than that on the Butterfly I. We have also implemented algorithms on Kendall Square Research's KSR1, a hierarchical-ring multiprocessor system, to study the effects of cache coherence. The execution behavior of spin-lock algorithms is significantly different between MIN-based and HR-based architectures. Our tests suggest that HR-based architectures handle network and memory contention more efficiently than MIN-based architectures. However, our results also suggest how spin-locks can be made cost-effective on both.<<ETX>>","PeriodicalId":325213,"journal":{"name":"IEEE Parallel & Distributed Technology: Systems & Applications","volume":"225 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117300167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We have developed an automatic compilation method that combines data- and code-based approaches to schedule a program's functional parallelism onto distributed memory systems. Our method works with Sisal, a parallel functional language, and replaces the back end of the Optimizing Sisal Compiler so that it produces code for distributed memory systems. Our extensions allow the compiler to generate code for Intel's distributed-memory Touchstone iPSC/860 machines (Gamma, Delta, and Paragon). The modified compiler can generate a partition that minimizes program completion time (for systems with many processors) or the required number of processors (for systems with few processors). To accomplish this, we have developed a heuristic algorithm that uses the new concept of threshold to treat the problem of scheduling as a trade-off between schedule length and the number of required processors. Most compilers for distributed memory systems force the programmer to partition the data or the program code. This modified version of a Sisal compiler handles both tasks automatically in a unified framework, and lets the programmer compile for a chosen number of processors.<>
{"title":"Compiling functional parallelism on distributed-memory systems","authors":"S. Pande, D. Agrawal, J. Mauney","doi":"10.1109/88.281878","DOIUrl":"https://doi.org/10.1109/88.281878","url":null,"abstract":"We have developed an automatic compilation method that combines data- and code-based approaches to schedule a program's functional parallelism onto distributed memory systems. Our method works with Sisal, a parallel functional language, and replaces the back end of the Optimizing Sisal Compiler so that it produces code for distributed memory systems. Our extensions allow the compiler to generate code for Intel's distributed-memory Touchstone iPSC/860 machines (Gamma, Delta, and Paragon). The modified compiler can generate a partition that minimizes program completion time (for systems with many processors) or the required number of processors (for systems with few processors). To accomplish this, we have developed a heuristic algorithm that uses the new concept of threshold to treat the problem of scheduling as a trade-off between schedule length and the number of required processors. Most compilers for distributed memory systems force the programmer to partition the data or the program code. This modified version of a Sisal compiler handles both tasks automatically in a unified framework, and lets the programmer compile for a chosen number of processors.<<ETX>>","PeriodicalId":325213,"journal":{"name":"IEEE Parallel & Distributed Technology: Systems & Applications","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126766456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The High-Performance Computing and Communications program has been attacked for having vague and dubious goals. Partly because of this perception, 65 members of the House of Representatives cast votes against extending it. We need concrete measures of HPCC progress without such narrowly defined goals. Measuring performance by high flops rates, speedup, and hardware efficiency can take us further from the solution to scientific problems, not closer. This paradox is especially pronounced for "Grand Challenge" and "teraflops computing". The author considers how we need a practical way to define and communicate ends-based performance of an application, not means-based measures such as teraflops or double precision. Human productivity issues such as development time and cost and the quality of the knowledge we obtain should be the basis of our performance metrics.<>
{"title":"Teraflops and other false goals","authors":"J. Gustafson","doi":"10.1109/MCC.1994.10013","DOIUrl":"https://doi.org/10.1109/MCC.1994.10013","url":null,"abstract":"The High-Performance Computing and Communications program has been attacked for having vague and dubious goals. Partly because of this perception, 65 members of the House of Representatives cast votes against extending it. We need concrete measures of HPCC progress without such narrowly defined goals. Measuring performance by high flops rates, speedup, and hardware efficiency can take us further from the solution to scientific problems, not closer. This paradox is especially pronounced for \"Grand Challenge\" and \"teraflops computing\". The author considers how we need a practical way to define and communicate ends-based performance of an application, not means-based measures such as teraflops or double precision. Human productivity issues such as development time and cost and the quality of the knowledge we obtain should be the basis of our performance metrics.<<ETX>>","PeriodicalId":325213,"journal":{"name":"IEEE Parallel & Distributed Technology: Systems & Applications","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129492909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Ewing, R. Sharpley, D. Mitchum, Patrick O Leary, J. Sochacki
The Parallel Virtual Machine(PVM) allows researchers to connect workstations, mini-supercomputers, or specialty machines to form a relatively inexpensive, powerful, parallel computer. Such hardware is frequently abundant at research locations, so PVM incurs little or no hardware costs. PVM is also flexible: it uses existing communication networks (Ethernet or fiber) and remote procedural libraries; it lets programmers use either C or Fortran; and it can emulate several commercial architectures including hypercubes, meshes, and rings. The authors believe that PVM can compete effectively with traditional supercomputers, and they have demonstrated its computational power and cost-effectiveness by simulating the propagation of seismic waves using an isolated Ethernet ring comprising an IBM RS/6000 550 as the host and six RS/6000 320H machines as the nodes.<>
{"title":"Distributed computation of wave propagation models using PVM","authors":"R. Ewing, R. Sharpley, D. Mitchum, Patrick O Leary, J. Sochacki","doi":"10.1145/169627.169642","DOIUrl":"https://doi.org/10.1145/169627.169642","url":null,"abstract":"The Parallel Virtual Machine(PVM) allows researchers to connect workstations, mini-supercomputers, or specialty machines to form a relatively inexpensive, powerful, parallel computer. Such hardware is frequently abundant at research locations, so PVM incurs little or no hardware costs. PVM is also flexible: it uses existing communication networks (Ethernet or fiber) and remote procedural libraries; it lets programmers use either C or Fortran; and it can emulate several commercial architectures including hypercubes, meshes, and rings. The authors believe that PVM can compete effectively with traditional supercomputers, and they have demonstrated its computational power and cost-effectiveness by simulating the propagation of seismic waves using an isolated Ethernet ring comprising an IBM RS/6000 550 as the host and six RS/6000 320H machines as the nodes.<<ETX>>","PeriodicalId":325213,"journal":{"name":"IEEE Parallel & Distributed Technology: Systems & Applications","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128471397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}