Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262814
B. Kumar, Chua-Huang Huang, Rodney W. Johnson, P. Sadayappan
A programming methodology based on tensor products has been used for designing and implementing block recursive algorithms for parallel and vector multiprocessors. A previous tensor product formulation of Strassen's matrix multiplication algorithm requires working arrays of size O(7/sup n/) for multiplying 2/sup n/*2/sup n/ matrices. The authors present a modified tensor product formulation of Strassen's algorithm in which the size of working arrays can be reduced to O(4/sup n/). The modified formulation exhibits sufficient parallel and vector operations for efficient implementation. Performance results on the Cray Y-MP are presented.<>
{"title":"A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction","authors":"B. Kumar, Chua-Huang Huang, Rodney W. Johnson, P. Sadayappan","doi":"10.1109/IPPS.1993.262814","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262814","url":null,"abstract":"A programming methodology based on tensor products has been used for designing and implementing block recursive algorithms for parallel and vector multiprocessors. A previous tensor product formulation of Strassen's matrix multiplication algorithm requires working arrays of size O(7/sup n/) for multiplying 2/sup n/*2/sup n/ matrices. The authors present a modified tensor product formulation of Strassen's algorithm in which the size of working arrays can be reduced to O(4/sup n/). The modified formulation exhibits sufficient parallel and vector operations for efficient implementation. Performance results on the Cray Y-MP are presented.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132027539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262774
M. Neeracher, R. Rühl
Distributed memory parallel processors (DMPPs) have no hardware support for a global address space. However, conventional programs written in a sequential imperative language such as Fortran typically manipulate few, large arrays. The Oxygen compiler, developed as part of the K2 project, accepts conventional Fortran code, augmented with code and data distribution directives. These directives support a global name space through a run-time mechanism called data consistency analysis. Many sequential Fortran programs can be efficiently parallelized, with Oxygen directives introduced manually by the user into the sequential code. This work presents an analysis pass added to the compiler that makes suggestions for the directives to be inserted into the code. Automatic parallelization of LINPACK routines was attempted and results are given.<>
{"title":"Automatic parallelization of LINPACK routines on distributed memory parallel processors","authors":"M. Neeracher, R. Rühl","doi":"10.1109/IPPS.1993.262774","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262774","url":null,"abstract":"Distributed memory parallel processors (DMPPs) have no hardware support for a global address space. However, conventional programs written in a sequential imperative language such as Fortran typically manipulate few, large arrays. The Oxygen compiler, developed as part of the K2 project, accepts conventional Fortran code, augmented with code and data distribution directives. These directives support a global name space through a run-time mechanism called data consistency analysis. Many sequential Fortran programs can be efficiently parallelized, with Oxygen directives introduced manually by the user into the sequential code. This work presents an analysis pass added to the compiler that makes suggestions for the directives to be inserted into the code. Automatic parallelization of LINPACK routines was attempted and results are given.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133195735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262886
T. Varvarigou, V. Roychowdhury, T. Kailath, E. Lawler
The authors consider the problem of scheduling tasks on multiprocessor architectures in the presence of communication delays. Given a set of dependent tasks, the scheduling problem is to allocate the tasks to processors such that the pre-specified precedence constraints among the tasks are obeyed and certain cost-measures (such as computation time) are minimized. Several cases of the scheduling problem have been proven to be NP-complete. Nevertheless, there are polynomial time algorithms for several interesting special cases of the general scheduling problem. Most of these results, however, do not take into consideration the delays due to message passing among processors. The authors study the increase in time complexity of the scheduling problem due to the introduction of communication delays. In particular, they address the open problem of scheduling out-forests (in-forests) in a multiprocessor system of m identical processors when communication delays are considered. They present first known polynomial time algorithms for the computation of the optimal schedule when the number of available processors is given and bounded and both computation and communication delays are assumed to take one unit of time.<>
{"title":"Scheduling in and out forests in the presence of communication delays","authors":"T. Varvarigou, V. Roychowdhury, T. Kailath, E. Lawler","doi":"10.1109/IPPS.1993.262886","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262886","url":null,"abstract":"The authors consider the problem of scheduling tasks on multiprocessor architectures in the presence of communication delays. Given a set of dependent tasks, the scheduling problem is to allocate the tasks to processors such that the pre-specified precedence constraints among the tasks are obeyed and certain cost-measures (such as computation time) are minimized. Several cases of the scheduling problem have been proven to be NP-complete. Nevertheless, there are polynomial time algorithms for several interesting special cases of the general scheduling problem. Most of these results, however, do not take into consideration the delays due to message passing among processors. The authors study the increase in time complexity of the scheduling problem due to the introduction of communication delays. In particular, they address the open problem of scheduling out-forests (in-forests) in a multiprocessor system of m identical processors when communication delays are considered. They present first known polynomial time algorithms for the computation of the optimal schedule when the number of available processors is given and bounded and both computation and communication delays are assumed to take one unit of time.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128982889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262829
Shiow-yang Wu, J. Browne
This paper presents semantically-based explicit parallel structuring for rule-based programming systems. Explicit parallel structuring appears to be necessary since compile-time dependency analysis of sequential programs has not yielded large scale parallelism and run-time analysis for parallelism is restricted by the execution cost of the analysis. Simple language extensions specifying semantics of rules are used to define parallel execution behavior at the rule level. Type definitions for working memory elements are extended to include relationships within and among objects which define the parallelism allowed on instances of object types. The first result presented is that the algorithms implemented by commonly used benchmark rule-based programs contain scalable parallelism. The second result is that much of that parallelism can be captured by simple and modest extensions of rule-based languages which are analogies of models and constructs used for specification of parallel structures in imperative programming languages. A sketch is given for a comprehensive language system which exploits specification of semantics defining parallel structures in both object-definition and executable segments of rule-based programs.<>
{"title":"Explicit parallel structuring for rule-based programming","authors":"Shiow-yang Wu, J. Browne","doi":"10.1109/IPPS.1993.262829","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262829","url":null,"abstract":"This paper presents semantically-based explicit parallel structuring for rule-based programming systems. Explicit parallel structuring appears to be necessary since compile-time dependency analysis of sequential programs has not yielded large scale parallelism and run-time analysis for parallelism is restricted by the execution cost of the analysis. Simple language extensions specifying semantics of rules are used to define parallel execution behavior at the rule level. Type definitions for working memory elements are extended to include relationships within and among objects which define the parallelism allowed on instances of object types. The first result presented is that the algorithms implemented by commonly used benchmark rule-based programs contain scalable parallelism. The second result is that much of that parallelism can be captured by simple and modest extensions of rule-based languages which are analogies of models and constructs used for specification of parallel structures in imperative programming languages. A sketch is given for a comprehensive language system which exploits specification of semantics defining parallel structures in both object-definition and executable segments of rule-based programs.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116285269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262922
B. Ramkumar, P. Banerjee
The authors describe a new portable algorithm for parallel circuit extraction. The algorithm is built as part of the ongoing ProperCAD project: a portable object-oriented parallel environment for CAD applications that is built on top of the CHARM system. The algorithm, unlike prior approaches like PACE is asynchronous and is based on a coarse-grained dataflow execution model. Performance of circuit extraction is presented on four parallel machines: an Encore Multimax, a Sequent Symmetry, a NCUBE 2 hypercube, and a network of Sun Sparc workstations. The extractor runs unchanged on all these machines.<>
{"title":"A portable parallel algorithm for VLSI circuit extraction","authors":"B. Ramkumar, P. Banerjee","doi":"10.1109/IPPS.1993.262922","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262922","url":null,"abstract":"The authors describe a new portable algorithm for parallel circuit extraction. The algorithm is built as part of the ongoing ProperCAD project: a portable object-oriented parallel environment for CAD applications that is built on top of the CHARM system. The algorithm, unlike prior approaches like PACE is asynchronous and is based on a coarse-grained dataflow execution model. Performance of circuit extraction is presented on four parallel machines: an Encore Multimax, a Sequent Symmetry, a NCUBE 2 hypercube, and a network of Sun Sparc workstations. The extractor runs unchanged on all these machines.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"485 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116691718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262847
L. Valiant
The author gives a summary of some of the arguments favoring the adoption of the bulk-synchronous parallel (BSP) model as a standard for parallel computing. First, he argues that for parallel computing to become a major industry, agreement has to be reached on a standard model at a level intermediate between the language and architecture levels. He goes on to list the factors that make the BSP model attractive as a standard at this intermediate or bridging level. Finally, he provides some reasons for favoring it over the shared memory or PRAM model which is an alternative candidate for this role.<>
{"title":"Why BSP computers? (bulk-synchronous parallel computers)","authors":"L. Valiant","doi":"10.1109/IPPS.1993.262847","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262847","url":null,"abstract":"The author gives a summary of some of the arguments favoring the adoption of the bulk-synchronous parallel (BSP) model as a standard for parallel computing. First, he argues that for parallel computing to become a major industry, agreement has to be reached on a standard model at a level intermediate between the language and architecture levels. He goes on to list the factors that make the BSP model attractive as a standard at this intermediate or bridging level. Finally, he provides some reasons for favoring it over the shared memory or PRAM model which is an alternative candidate for this role.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134520510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262888
G. Saghi, H. Siegel, J. L. Gray
Mapping cyclic reduction, a known approach for the parallel solution of tridiagonal systems of equations, onto the MasPar MP-1, nCUBE 2, and PASM parallel machines is discussed. Each of these represents a different mode of parallelism. Issues addressed are SIMD/MIMD trade-offs, the effect on execution time of increasing the number of processors used, the impact of the inter-processor communications network on performance, the importance of predicting algorithm performance as a function of the mapping used, and the advantages of a partitionable system. Analytical results are validated by experimentation on all three machines.<>
{"title":"Mapping onto three classes of parallel machines: a case study using the cyclic reduction algorithm","authors":"G. Saghi, H. Siegel, J. L. Gray","doi":"10.1109/IPPS.1993.262888","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262888","url":null,"abstract":"Mapping cyclic reduction, a known approach for the parallel solution of tridiagonal systems of equations, onto the MasPar MP-1, nCUBE 2, and PASM parallel machines is discussed. Each of these represents a different mode of parallelism. Issues addressed are SIMD/MIMD trade-offs, the effect on execution time of increasing the number of processors used, the impact of the inter-processor communications network on performance, the importance of predicting algorithm performance as a function of the mapping used, and the advantages of a partitionable system. Analytical results are validated by experimentation on all three machines.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125994287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262868
G. Elsesser, Viet N. Ngo, S. Bhattacharya, W. Tsai
The speedup achieved by concurrent execution of loop iterations is determined by load balance and several other factors, so no single strategy provides maximum speedup for all classes of programs and all target architectures. Hence, the selection of a load balancing strategy must be guided by characteristics of both the application domain and the target machine architecture. The authors study loop load balance in the context of the well known Perfect Club benchmark. Several static and dynamic characteristics of DOALL loops are observed and interpreted. Late arrival of processors is identified as a significant source of load imbalance. A scheme for processor preallocation is proposed and the advantages and applicability of this scheme are demonstrated by analytical estimates as well as experimental evaluation on a Cray YMP-8.<>
{"title":"Load balancing of DOALL loops in the Perfect Club","authors":"G. Elsesser, Viet N. Ngo, S. Bhattacharya, W. Tsai","doi":"10.1109/IPPS.1993.262868","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262868","url":null,"abstract":"The speedup achieved by concurrent execution of loop iterations is determined by load balance and several other factors, so no single strategy provides maximum speedup for all classes of programs and all target architectures. Hence, the selection of a load balancing strategy must be guided by characteristics of both the application domain and the target machine architecture. The authors study loop load balance in the context of the well known Perfect Club benchmark. Several static and dynamic characteristics of DOALL loops are observed and interpreted. Late arrival of processors is identified as a significant source of load imbalance. A scheme for processor preallocation is proposed and the advantages and applicability of this scheme are demonstrated by analytical estimates as well as experimental evaluation on a Cray YMP-8.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127947439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262871
Craig Anderson, J. Baer
In order to meet the computational needs of the next decade, shared-memory processors must be scalable. Though single shared-bus architectures have been successful in the past, lack of bus bandwidth restricts the number of processors that can be effectively put on a single bus machine. One architecture that has been proposed to solve the limited bandwidth problem consists of processors connected via a tree hierarchy of buses. The authors present a tool to study a hierarchical bus based shared-memory system. They highlight the main features of a hierarchical cache coherence protocol and give some preliminary performance results obtained via an instruction level simulator.<>
{"title":"A multi-level hierarchical cache coherence protocol for multiprocessors","authors":"Craig Anderson, J. Baer","doi":"10.1109/IPPS.1993.262871","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262871","url":null,"abstract":"In order to meet the computational needs of the next decade, shared-memory processors must be scalable. Though single shared-bus architectures have been successful in the past, lack of bus bandwidth restricts the number of processors that can be effectively put on a single bus machine. One architecture that has been proposed to solve the limited bandwidth problem consists of processors connected via a tree hierarchy of buses. The authors present a tool to study a hierarchical bus based shared-memory system. They highlight the main features of a hierarchical cache coherence protocol and give some preliminary performance results obtained via an instruction level simulator.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130119150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262808
H. G. Mayer, Stefan Jähnichen
The Parallel Ada Run-Time System (PARTS), developed at TUB, is the target of an experimental translator that maps sequential Ada to a shared-memory multi-processor. Other modules of the parallel compiler are not explained. The paper summarizes the multi-processor run-time system; it explains those instructions that activate multiple processors leading to SPMD execution and discusses the scheduling policy Default architectural attributes of PARTS can be custom-tailored for each run without re-compile. The experiments exposed different machine personalities by measuring execution time profiles of the vector product run on different architectures. The goal is to find experimentally, how well a shared-memory architecture scales up to an increasing problem size, and how well the problem size scales up for a fixed multi-processor configuration. The measurements expose the advantages of shared-memory multi-processor architectures to exploit one dimension of parallelism. However, scalability is limited to the number of memory ports. Therefore another architectural dimension of parallelism, distributed-memory, must be combined with shared memories to achieve Tera-FLOP performance.<>
{"title":"The data-parallel Ada run-time system, simulation and empirical results","authors":"H. G. Mayer, Stefan Jähnichen","doi":"10.1109/IPPS.1993.262808","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262808","url":null,"abstract":"The Parallel Ada Run-Time System (PARTS), developed at TUB, is the target of an experimental translator that maps sequential Ada to a shared-memory multi-processor. Other modules of the parallel compiler are not explained. The paper summarizes the multi-processor run-time system; it explains those instructions that activate multiple processors leading to SPMD execution and discusses the scheduling policy Default architectural attributes of PARTS can be custom-tailored for each run without re-compile. The experiments exposed different machine personalities by measuring execution time profiles of the vector product run on different architectures. The goal is to find experimentally, how well a shared-memory architecture scales up to an increasing problem size, and how well the problem size scales up for a fixed multi-processor configuration. The measurements expose the advantages of shared-memory multi-processor architectures to exploit one dimension of parallelism. However, scalability is limited to the number of memory ports. Therefore another architectural dimension of parallelism, distributed-memory, must be combined with shared memories to achieve Tera-FLOP performance.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130467785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}