Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262784
Y. Ben-Asher, A. Schuster
An optical message switching system delivers messages from N sources to N destinations using beams of light. The redirection of the beams involves vector-matrix multiplication and a threshold operation. The authors consider the design of addresses which are both short (so that the number of threshold devices is reduced) and have low crosstalk (so that the sensitivity gap may grow). They show that addresses for O(log N) bits exist, for which the crosstalk is a constant fraction of the number of set bits in each address, hence allowing for a Theta (log N) sized sensitivity gap. More generally, they show the precise coefficient which depends on the desired gap. It is established that when using O(log N) bit addresses, the crosstalk cannot be further reduced. An exact construction of O(log/sup 2/ N) bit addresses is given, where the involved constant depends on the desired crosstalk. Finally they describe briefly the basic optical elements that can be used in order to construct a message switching system which use these address schemes.<>
{"title":"Low crosstalk address encodings for optical message switching systems","authors":"Y. Ben-Asher, A. Schuster","doi":"10.1109/IPPS.1993.262784","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262784","url":null,"abstract":"An optical message switching system delivers messages from N sources to N destinations using beams of light. The redirection of the beams involves vector-matrix multiplication and a threshold operation. The authors consider the design of addresses which are both short (so that the number of threshold devices is reduced) and have low crosstalk (so that the sensitivity gap may grow). They show that addresses for O(log N) bits exist, for which the crosstalk is a constant fraction of the number of set bits in each address, hence allowing for a Theta (log N) sized sensitivity gap. More generally, they show the precise coefficient which depends on the desired gap. It is established that when using O(log N) bit addresses, the crosstalk cannot be further reduced. An exact construction of O(log/sup 2/ N) bit addresses is given, where the involved constant depends on the desired crosstalk. Finally they describe briefly the basic optical elements that can be used in order to construct a message switching system which use these address schemes.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"2009 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125651746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262829
Shiow-yang Wu, J. Browne
This paper presents semantically-based explicit parallel structuring for rule-based programming systems. Explicit parallel structuring appears to be necessary since compile-time dependency analysis of sequential programs has not yielded large scale parallelism and run-time analysis for parallelism is restricted by the execution cost of the analysis. Simple language extensions specifying semantics of rules are used to define parallel execution behavior at the rule level. Type definitions for working memory elements are extended to include relationships within and among objects which define the parallelism allowed on instances of object types. The first result presented is that the algorithms implemented by commonly used benchmark rule-based programs contain scalable parallelism. The second result is that much of that parallelism can be captured by simple and modest extensions of rule-based languages which are analogies of models and constructs used for specification of parallel structures in imperative programming languages. A sketch is given for a comprehensive language system which exploits specification of semantics defining parallel structures in both object-definition and executable segments of rule-based programs.<>
{"title":"Explicit parallel structuring for rule-based programming","authors":"Shiow-yang Wu, J. Browne","doi":"10.1109/IPPS.1993.262829","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262829","url":null,"abstract":"This paper presents semantically-based explicit parallel structuring for rule-based programming systems. Explicit parallel structuring appears to be necessary since compile-time dependency analysis of sequential programs has not yielded large scale parallelism and run-time analysis for parallelism is restricted by the execution cost of the analysis. Simple language extensions specifying semantics of rules are used to define parallel execution behavior at the rule level. Type definitions for working memory elements are extended to include relationships within and among objects which define the parallelism allowed on instances of object types. The first result presented is that the algorithms implemented by commonly used benchmark rule-based programs contain scalable parallelism. The second result is that much of that parallelism can be captured by simple and modest extensions of rule-based languages which are analogies of models and constructs used for specification of parallel structures in imperative programming languages. A sketch is given for a comprehensive language system which exploits specification of semantics defining parallel structures in both object-definition and executable segments of rule-based programs.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116285269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262922
B. Ramkumar, P. Banerjee
The authors describe a new portable algorithm for parallel circuit extraction. The algorithm is built as part of the ongoing ProperCAD project: a portable object-oriented parallel environment for CAD applications that is built on top of the CHARM system. The algorithm, unlike prior approaches like PACE is asynchronous and is based on a coarse-grained dataflow execution model. Performance of circuit extraction is presented on four parallel machines: an Encore Multimax, a Sequent Symmetry, a NCUBE 2 hypercube, and a network of Sun Sparc workstations. The extractor runs unchanged on all these machines.<>
{"title":"A portable parallel algorithm for VLSI circuit extraction","authors":"B. Ramkumar, P. Banerjee","doi":"10.1109/IPPS.1993.262922","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262922","url":null,"abstract":"The authors describe a new portable algorithm for parallel circuit extraction. The algorithm is built as part of the ongoing ProperCAD project: a portable object-oriented parallel environment for CAD applications that is built on top of the CHARM system. The algorithm, unlike prior approaches like PACE is asynchronous and is based on a coarse-grained dataflow execution model. Performance of circuit extraction is presented on four parallel machines: an Encore Multimax, a Sequent Symmetry, a NCUBE 2 hypercube, and a network of Sun Sparc workstations. The extractor runs unchanged on all these machines.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"485 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116691718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262847
L. Valiant
The author gives a summary of some of the arguments favoring the adoption of the bulk-synchronous parallel (BSP) model as a standard for parallel computing. First, he argues that for parallel computing to become a major industry, agreement has to be reached on a standard model at a level intermediate between the language and architecture levels. He goes on to list the factors that make the BSP model attractive as a standard at this intermediate or bridging level. Finally, he provides some reasons for favoring it over the shared memory or PRAM model which is an alternative candidate for this role.<>
{"title":"Why BSP computers? (bulk-synchronous parallel computers)","authors":"L. Valiant","doi":"10.1109/IPPS.1993.262847","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262847","url":null,"abstract":"The author gives a summary of some of the arguments favoring the adoption of the bulk-synchronous parallel (BSP) model as a standard for parallel computing. First, he argues that for parallel computing to become a major industry, agreement has to be reached on a standard model at a level intermediate between the language and architecture levels. He goes on to list the factors that make the BSP model attractive as a standard at this intermediate or bridging level. Finally, he provides some reasons for favoring it over the shared memory or PRAM model which is an alternative candidate for this role.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134520510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262888
G. Saghi, H. Siegel, J. L. Gray
Mapping cyclic reduction, a known approach for the parallel solution of tridiagonal systems of equations, onto the MasPar MP-1, nCUBE 2, and PASM parallel machines is discussed. Each of these represents a different mode of parallelism. Issues addressed are SIMD/MIMD trade-offs, the effect on execution time of increasing the number of processors used, the impact of the inter-processor communications network on performance, the importance of predicting algorithm performance as a function of the mapping used, and the advantages of a partitionable system. Analytical results are validated by experimentation on all three machines.<>
{"title":"Mapping onto three classes of parallel machines: a case study using the cyclic reduction algorithm","authors":"G. Saghi, H. Siegel, J. L. Gray","doi":"10.1109/IPPS.1993.262888","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262888","url":null,"abstract":"Mapping cyclic reduction, a known approach for the parallel solution of tridiagonal systems of equations, onto the MasPar MP-1, nCUBE 2, and PASM parallel machines is discussed. Each of these represents a different mode of parallelism. Issues addressed are SIMD/MIMD trade-offs, the effect on execution time of increasing the number of processors used, the impact of the inter-processor communications network on performance, the importance of predicting algorithm performance as a function of the mapping used, and the advantages of a partitionable system. Analytical results are validated by experimentation on all three machines.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125994287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262868
G. Elsesser, Viet N. Ngo, S. Bhattacharya, W. Tsai
The speedup achieved by concurrent execution of loop iterations is determined by load balance and several other factors, so no single strategy provides maximum speedup for all classes of programs and all target architectures. Hence, the selection of a load balancing strategy must be guided by characteristics of both the application domain and the target machine architecture. The authors study loop load balance in the context of the well known Perfect Club benchmark. Several static and dynamic characteristics of DOALL loops are observed and interpreted. Late arrival of processors is identified as a significant source of load imbalance. A scheme for processor preallocation is proposed and the advantages and applicability of this scheme are demonstrated by analytical estimates as well as experimental evaluation on a Cray YMP-8.<>
{"title":"Load balancing of DOALL loops in the Perfect Club","authors":"G. Elsesser, Viet N. Ngo, S. Bhattacharya, W. Tsai","doi":"10.1109/IPPS.1993.262868","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262868","url":null,"abstract":"The speedup achieved by concurrent execution of loop iterations is determined by load balance and several other factors, so no single strategy provides maximum speedup for all classes of programs and all target architectures. Hence, the selection of a load balancing strategy must be guided by characteristics of both the application domain and the target machine architecture. The authors study loop load balance in the context of the well known Perfect Club benchmark. Several static and dynamic characteristics of DOALL loops are observed and interpreted. Late arrival of processors is identified as a significant source of load imbalance. A scheme for processor preallocation is proposed and the advantages and applicability of this scheme are demonstrated by analytical estimates as well as experimental evaluation on a Cray YMP-8.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127947439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262871
Craig Anderson, J. Baer
In order to meet the computational needs of the next decade, shared-memory processors must be scalable. Though single shared-bus architectures have been successful in the past, lack of bus bandwidth restricts the number of processors that can be effectively put on a single bus machine. One architecture that has been proposed to solve the limited bandwidth problem consists of processors connected via a tree hierarchy of buses. The authors present a tool to study a hierarchical bus based shared-memory system. They highlight the main features of a hierarchical cache coherence protocol and give some preliminary performance results obtained via an instruction level simulator.<>
{"title":"A multi-level hierarchical cache coherence protocol for multiprocessors","authors":"Craig Anderson, J. Baer","doi":"10.1109/IPPS.1993.262871","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262871","url":null,"abstract":"In order to meet the computational needs of the next decade, shared-memory processors must be scalable. Though single shared-bus architectures have been successful in the past, lack of bus bandwidth restricts the number of processors that can be effectively put on a single bus machine. One architecture that has been proposed to solve the limited bandwidth problem consists of processors connected via a tree hierarchy of buses. The authors present a tool to study a hierarchical bus based shared-memory system. They highlight the main features of a hierarchical cache coherence protocol and give some preliminary performance results obtained via an instruction level simulator.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130119150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262808
H. G. Mayer, Stefan Jähnichen
The Parallel Ada Run-Time System (PARTS), developed at TUB, is the target of an experimental translator that maps sequential Ada to a shared-memory multi-processor. Other modules of the parallel compiler are not explained. The paper summarizes the multi-processor run-time system; it explains those instructions that activate multiple processors leading to SPMD execution and discusses the scheduling policy Default architectural attributes of PARTS can be custom-tailored for each run without re-compile. The experiments exposed different machine personalities by measuring execution time profiles of the vector product run on different architectures. The goal is to find experimentally, how well a shared-memory architecture scales up to an increasing problem size, and how well the problem size scales up for a fixed multi-processor configuration. The measurements expose the advantages of shared-memory multi-processor architectures to exploit one dimension of parallelism. However, scalability is limited to the number of memory ports. Therefore another architectural dimension of parallelism, distributed-memory, must be combined with shared memories to achieve Tera-FLOP performance.<>
{"title":"The data-parallel Ada run-time system, simulation and empirical results","authors":"H. G. Mayer, Stefan Jähnichen","doi":"10.1109/IPPS.1993.262808","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262808","url":null,"abstract":"The Parallel Ada Run-Time System (PARTS), developed at TUB, is the target of an experimental translator that maps sequential Ada to a shared-memory multi-processor. Other modules of the parallel compiler are not explained. The paper summarizes the multi-processor run-time system; it explains those instructions that activate multiple processors leading to SPMD execution and discusses the scheduling policy Default architectural attributes of PARTS can be custom-tailored for each run without re-compile. The experiments exposed different machine personalities by measuring execution time profiles of the vector product run on different architectures. The goal is to find experimentally, how well a shared-memory architecture scales up to an increasing problem size, and how well the problem size scales up for a fixed multi-processor configuration. The measurements expose the advantages of shared-memory multi-processor architectures to exploit one dimension of parallelism. However, scalability is limited to the number of memory ports. Therefore another architectural dimension of parallelism, distributed-memory, must be combined with shared memories to achieve Tera-FLOP performance.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130467785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262823
Sabine R. Öhring, Sajal K. Das
This paper deals with optimal embeddings of various topologies into the hyper-de Bruijn network, which is a combination of the well known hypercube and the de Bruijn graph. In particular, the authors develop modular embeddings of complete binary trees and other tree-related graphs, and dynamic task allocation embeddings of dynamically evolving arbitrary binary trees. Additionally, an optimal embedding of butterflies and a subgraph-embedding of cube-connected cycles are presented. They also consider how to dynamically embed dynamically evolving grid-structures (so called quasi-grids) into hyper-de Bruijn networks. The results are important in mapping data and algorithm structures on multiprocessor networks.<>
{"title":"Dynamic embeddings of trees and quasi-grids into hyper-de Bruijn networks","authors":"Sabine R. Öhring, Sajal K. Das","doi":"10.1109/IPPS.1993.262823","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262823","url":null,"abstract":"This paper deals with optimal embeddings of various topologies into the hyper-de Bruijn network, which is a combination of the well known hypercube and the de Bruijn graph. In particular, the authors develop modular embeddings of complete binary trees and other tree-related graphs, and dynamic task allocation embeddings of dynamically evolving arbitrary binary trees. Additionally, an optimal embedding of butterflies and a subgraph-embedding of cube-connected cycles are presented. They also consider how to dynamically embed dynamically evolving grid-structures (so called quasi-grids) into hyper-de Bruijn networks. The results are important in mapping data and algorithm structures on multiprocessor networks.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122420973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262772
F. Breant, Jean-François Peyre
The authors present a technique to build a massively parallel application from a formal description. They use the colored Petri-net formalism to model applications. This formalism allows them to concisely describe parallel applications. Theoretical results on this formalism contribute to proving the correctness of the description before implementation. Furthermore, they use some linear invariants to decompose the model into interacting state machines which are easy to implement. An important feature introduced consists in using color to map state machines and to distribute data and communication onto a formal architecture description.<>
{"title":"OCCAM prototyping of massively parallel applications from colored Petri-nets","authors":"F. Breant, Jean-François Peyre","doi":"10.1109/IPPS.1993.262772","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262772","url":null,"abstract":"The authors present a technique to build a massively parallel application from a formal description. They use the colored Petri-net formalism to model applications. This formalism allows them to concisely describe parallel applications. Theoretical results on this formalism contribute to proving the correctness of the description before implementation. Furthermore, they use some linear invariants to decompose the model into interacting state machines which are easy to implement. An important feature introduced consists in using color to map state machines and to distribute data and communication onto a formal architecture description.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116071339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}