Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications having irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (IPRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present two new interprocedural optimizations, placement of scatter routines and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D compilation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.
{"title":"Interprocedural Compilation of Irregular Applications for Distributed Memory Machines","authors":"G. Agrawal, J. Saltz","doi":"10.1145/224170.224336","DOIUrl":"https://doi.org/10.1145/224170.224336","url":null,"abstract":"Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications having irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (IPRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present two new interprocedural optimizations, placement of scatter routines and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D compilation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115689410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A few parallel algorithms for solving triangular systems resulting from parallel factorization of sparse linear systems have been proposed and implemented recently. We present a detailed analysis of parallel complexity and scalability of the best of these algorithms and the results of its implementation on up to 256 processors of the Cray T3D parallel computer. It has been a common belief that parallel sparse triangular solvers are quite unscalable due to a high communication to computation ratio. Our analysis and experiments show that, although not as scalable as the best parallel sparse Cholesky factorization algorithms, parallel sparse triangular solvers can yield reasonable speedups in runtime on hundreds of processors. We also show that for a wide class of problems, the sparse triangular solvers described in this paper are optimal and are asymptotically as scalable as a dense triangular solver.
{"title":"Parallel Algorithms for Forward and Back Substitution in Direct Solution of Sparse Linear Systems","authors":"Anshul Gupta, Vipin Kumar","doi":"10.1145/224170.224471","DOIUrl":"https://doi.org/10.1145/224170.224471","url":null,"abstract":"A few parallel algorithms for solving triangular systems resulting from parallel factorization of sparse linear systems have been proposed and implemented recently. We present a detailed analysis of parallel complexity and scalability of the best of these algorithms and the results of its implementation on up to 256 processors of the Cray T3D parallel computer. It has been a common belief that parallel sparse triangular solvers are quite unscalable due to a high communication to computation ratio. Our analysis and experiments show that, although not as scalable as the best parallel sparse Cholesky factorization algorithms, parallel sparse triangular solvers can yield reasonable speedups in runtime on hundreds of processors. We also show that for a wide class of problems, the sparse triangular solvers described in this paper are optimal and are asymptotically as scalable as a dense triangular solver.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126782265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Abramson, Ian T Foster, J. Michalakes, R. Sosič
Because large scientific codes are rarely static objects, developers are often faced with the tedious task of accounting for discrepancies between new and old versions. In this paper, we describe a new technique called relative debugging that addresses this problem by automating the process of comparing a modified code against a correct reference code. We examine the utility of the relative debugging technique by applying a relative debugger called Guard to a range of debugging problems in a large atmospheric circulation model. Our experience confirms the effectiveness of the approach. Using Guard, we are able to validate a new sequential version of the atmospheric model, and to identify the source of a significant discrepancy in a parallel version in a short period of time.
{"title":"Relative Debugging and its Application to the Development of Large Numerical Models","authors":"D. Abramson, Ian T Foster, J. Michalakes, R. Sosič","doi":"10.1145/224170.224350","DOIUrl":"https://doi.org/10.1145/224170.224350","url":null,"abstract":"Because large scientific codes are rarely static objects, developers are often faced with the tedious task of accounting for discrepancies between new and old versions. In this paper, we describe a new technique called relative debugging that addresses this problem by automating the process of comparing a modified code against a correct reference code. We examine the utility of the relative debugging technique by applying a relative debugger called Guard to a range of debugging problems in a large atmospheric circulation model. Our experience confirms the effectiveness of the approach. Using Guard, we are able to validate a new sequential version of the atmospheric model, and to identify the source of a significant discrepancy in a parallel version in a short period of time.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116227369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Yoshida, A. Nakamura, M. Fukuda, Takashi Nakamura, S. Hioki
A portable QCD simulation program running on NWT with 128 PE's shows the performance 0.032 micro sec/link update, which is about 189 times faster than a highly optimized code on CRAY X-MP/48 with four processors. The performance corresponds to 178 GFLOPS sustained speed.
{"title":"Quantum Chromodynamics Simulation on NWT","authors":"M. Yoshida, A. Nakamura, M. Fukuda, Takashi Nakamura, S. Hioki","doi":"10.1145/224170.224403","DOIUrl":"https://doi.org/10.1145/224170.224403","url":null,"abstract":"A portable QCD simulation program running on NWT with 128 PE's shows the performance 0.032 micro sec/link update, which is about 189 times faster than a highly optimized code on CRAY X-MP/48 with four processors. The performance corresponds to 178 GFLOPS sustained speed.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128258047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Runtime incremental parallel scheduling (RIPS) is a new approach for load balancing. In parallel scheduling, all processors cooperate together to balance the workload. Parallel scheduling accurately balances the load by using global load information. In incremental scheduling, the system scheduling activity alternates with the underlying computation work. RIPS produces high-quality load balancing and adapts to applications of nonuniform structures. This paper presents methods for scheduling a single job on a dedicated parallel machine.
{"title":"High-Performance Incremental Scheduling on Massively Parallel Computers — A Global Approach","authors":"Minyou Wu, W. Shu","doi":"10.1145/224170.224358","DOIUrl":"https://doi.org/10.1145/224170.224358","url":null,"abstract":"Runtime incremental parallel scheduling (RIPS) is a new approach for load balancing. In parallel scheduling, all processors cooperate together to balance the workload. Parallel scheduling accurately balances the load by using global load information. In incremental scheduling, the system scheduling activity alternates with the underlying computation work. RIPS produces high-quality load balancing and adapts to applications of nonuniform structures. This paper presents methods for scheduling a single job on a dedicated parallel machine.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133414439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Release consistency is a widely accepted memory model for distributed shared memory systems. Eager release consistency represents the state of the art in release consistent protocols for hardware-coherent multiprocessors, while lazy release consistency has been shown to provide better performance for software distributed shared memory (DSM). Several of the optimizations performed by lazy protocols have the potential to improve the performance of hardware-coherent multiprocessors as well, but their complexity has precluded a hardware implementation. With the advent of programmable protocol processors it may become possible to use them after all. We present and evaluate a lazy release-consistent protocol suitable for machines with dedicated protocol processors. This protocol admits multiple concurrent writers, sends write notices concurrently with computation, and delays invalidations until acquire operations. We also consider a lazier protocol that delays sending write notices until release operations. Our results indicate that the first protocol outperforms eager release consistency by as much as 20% across a variety of applications. The lazier protocol, on the other hand, is unable to recoup its high synchronization overhead. This represents a qualitative shift from the DSM world, where lazier protocols always yield performance improvements. Based on our results, we conclude that machines with flexible hardware support for coherence should use protocols based on lazy release consistency, but in a less ''aggressively lazy'' form than is appropriate for DSM.
{"title":"Lazy Release Consistency for Hardware-Coherent Multiprocessors","authors":"L. Kontothanassis, M. Scott, R. Bianchini","doi":"10.1145/224170.224398","DOIUrl":"https://doi.org/10.1145/224170.224398","url":null,"abstract":"Release consistency is a widely accepted memory model for distributed shared memory systems. Eager release consistency represents the state of the art in release consistent protocols for hardware-coherent multiprocessors, while lazy release consistency has been shown to provide better performance for software distributed shared memory (DSM). Several of the optimizations performed by lazy protocols have the potential to improve the performance of hardware-coherent multiprocessors as well, but their complexity has precluded a hardware implementation. With the advent of programmable protocol processors it may become possible to use them after all. We present and evaluate a lazy release-consistent protocol suitable for machines with dedicated protocol processors. This protocol admits multiple concurrent writers, sends write notices concurrently with computation, and delays invalidations until acquire operations. We also consider a lazier protocol that delays sending write notices until release operations. Our results indicate that the first protocol outperforms eager release consistency by as much as 20% across a variety of applications. The lazier protocol, on the other hand, is unable to recoup its high synchronization overhead. This represents a qualitative shift from the DSM world, where lazier protocols always yield performance improvements. Based on our results, we conclude that machines with flexible hardware support for coherence should use protocols based on lazy release consistency, but in a less ''aggressively lazy'' form than is appropriate for DSM.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124813053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Both genetic algorithms (GAs) and artificial neural networks (ANNs) (connectionist learning models) are effective generalisations of successful biological techniques to the artificial realm. Both techniques are inherently parallel and seem ideal for implementation on the current generation of parallel supercomputers. We consider how the two techniques complement each other and how combining them (i.e. evolving artificial neural networks with a genetic algorithm), may give insights into the evolution of structure and modularity in biological brains. The incorporation of evolutionary and modularity concepts into artificial systems has the potential to decrease the development time of ANNs for specific image and information processing applications. General considerations when genetically encoding ANNs are discussed, and a new encoding method developed, which has the potential to simplify the generation of complex modular networks. The implementation of this technique on a CM-5 parallel supercomputer raises many practical and theoretical questions in the application and use of evolutionary models with artificial neural networks.
{"title":"MONSTER — The Ghost in the Connection Machine: Modularity of Neural Systems in Theoretical Evolutionary Research","authors":"Nigel Snoad, T. Bossomaier","doi":"10.1145/224170.224226","DOIUrl":"https://doi.org/10.1145/224170.224226","url":null,"abstract":"Both genetic algorithms (GAs) and artificial neural networks (ANNs) (connectionist learning models) are effective generalisations of successful biological techniques to the artificial realm. Both techniques are inherently parallel and seem ideal for implementation on the current generation of parallel supercomputers. We consider how the two techniques complement each other and how combining them (i.e. evolving artificial neural networks with a genetic algorithm), may give insights into the evolution of structure and modularity in biological brains. The incorporation of evolutionary and modularity concepts into artificial systems has the potential to decrease the development time of ANNs for specific image and information processing applications. General considerations when genetically encoding ANNs are discussed, and a new encoding method developed, which has the potential to simplify the generation of complex modular networks. The implementation of this technique on a CM-5 parallel supercomputer raises many practical and theoretical questions in the application and use of evolutionary models with artificial neural networks.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129434892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study the computational, communication, and scalability characteristics of a Computational Fluid Dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architectural platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), distributed memory multiprocessors with different topologies — the IBM SP and the Cray T3D. We investigate the impact of various networks, connecting the cluster of workstations, on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms
{"title":"Parallelizing Navier-Stokes Computations on a Variety of Architectural Platforms","authors":"D. Jayasimha, M. Hayder, S. K. Pillay","doi":"10.1145/224170.224410","DOIUrl":"https://doi.org/10.1145/224170.224410","url":null,"abstract":"We study the computational, communication, and scalability characteristics of a Computational Fluid Dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architectural platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), distributed memory multiprocessors with different topologies — the IBM SP and the Cray T3D. We investigate the impact of various networks, connecting the cluster of workstations, on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128295216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matrix-vector products (mat-vecs) form the core of iterative methods used for solving dense linear systems. Often, these systems arise in the solution of integral equations used in electromagnetics, heat transfer, and wave propagation. In this paper, we present a parallel approximate method for computing mat-vecs used in the solution of integral equations. We use this method to compute dense mat-vecs of hundreds of thousands of elements. The combined speedups obtained from the use of approximate methods and parallel processing represent an improvement of several orders of magnitude over exact mat-vecs on uniprocessors. We demonstrate that our parallel formulation incurs minimal parallel processing overhead and scales up to a large number of processors. We study the impact of varying the accuracy of the approximate mat-vec on overall time and on parallel efficiency. Experimental results are presented for 256 processor Cray T3D and Thinking Machines CM5 parallel computers. We have achieved computation rates in excess of 5 GFLOPS on the T3D.
{"title":"Parallel Matrix-Vector Product Using Approximate Hierarchical Methods","authors":"A. Grama, Vipin Kumar, A. Sameh","doi":"10.1145/224170.224487","DOIUrl":"https://doi.org/10.1145/224170.224487","url":null,"abstract":"Matrix-vector products (mat-vecs) form the core of iterative methods used for solving dense linear systems. Often, these systems arise in the solution of integral equations used in electromagnetics, heat transfer, and wave propagation. In this paper, we present a parallel approximate method for computing mat-vecs used in the solution of integral equations. We use this method to compute dense mat-vecs of hundreds of thousands of elements. The combined speedups obtained from the use of approximate methods and parallel processing represent an improvement of several orders of magnitude over exact mat-vecs on uniprocessors. We demonstrate that our parallel formulation incurs minimal parallel processing overhead and scales up to a large number of processors. We study the impact of varying the accuracy of the approximate mat-vec on overall time and on parallel efficiency. Experimental results are presented for 256 processor Cray T3D and Thinking Machines CM5 parallel computers. We have achieved computation rates in excess of 5 GFLOPS on the T3D.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129651477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents techniques for compiling loops with complex, indirect array accesses into loops whose array references have at most one level of indirection. The transformation allows prefetching of array indices for more efficient structuring of communication on distributed-memory machines. It can also improve performance on other architectures by enabling prefetching of data between levels of the memory hierarchy or exploitation of hardware support for vectorized gather/scatter. Our techniques are implemented in a compiler for Fortran D and execution speed improvements are given for multiprocessor and vector machines.
{"title":"Index Array Flattening Through Program Transformation","authors":"R. Das, P. Havlak, J. Saltz, K. Kennedy","doi":"10.1145/224170.224420","DOIUrl":"https://doi.org/10.1145/224170.224420","url":null,"abstract":"This paper presents techniques for compiling loops with complex, indirect array accesses into loops whose array references have at most one level of indirection. The transformation allows prefetching of array indices for more efficient structuring of communication on distributed-memory machines. It can also improve performance on other architectures by enabling prefetching of data between levels of the memory hierarchy or exploitation of hardware support for vectorized gather/scatter. Our techniques are implemented in a compiler for Fortran D and execution speed improvements are given for multiprocessor and vector machines.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"02 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127462353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}