Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262823
Sabine R. Öhring, Sajal K. Das
This paper deals with optimal embeddings of various topologies into the hyper-de Bruijn network, which is a combination of the well known hypercube and the de Bruijn graph. In particular, the authors develop modular embeddings of complete binary trees and other tree-related graphs, and dynamic task allocation embeddings of dynamically evolving arbitrary binary trees. Additionally, an optimal embedding of butterflies and a subgraph-embedding of cube-connected cycles are presented. They also consider how to dynamically embed dynamically evolving grid-structures (so called quasi-grids) into hyper-de Bruijn networks. The results are important in mapping data and algorithm structures on multiprocessor networks.<>
{"title":"Dynamic embeddings of trees and quasi-grids into hyper-de Bruijn networks","authors":"Sabine R. Öhring, Sajal K. Das","doi":"10.1109/IPPS.1993.262823","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262823","url":null,"abstract":"This paper deals with optimal embeddings of various topologies into the hyper-de Bruijn network, which is a combination of the well known hypercube and the de Bruijn graph. In particular, the authors develop modular embeddings of complete binary trees and other tree-related graphs, and dynamic task allocation embeddings of dynamically evolving arbitrary binary trees. Additionally, an optimal embedding of butterflies and a subgraph-embedding of cube-connected cycles are presented. They also consider how to dynamically embed dynamically evolving grid-structures (so called quasi-grids) into hyper-de Bruijn networks. The results are important in mapping data and algorithm structures on multiprocessor networks.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122420973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262772
F. Breant, Jean-François Peyre
The authors present a technique to build a massively parallel application from a formal description. They use the colored Petri-net formalism to model applications. This formalism allows them to concisely describe parallel applications. Theoretical results on this formalism contribute to proving the correctness of the description before implementation. Furthermore, they use some linear invariants to decompose the model into interacting state machines which are easy to implement. An important feature introduced consists in using color to map state machines and to distribute data and communication onto a formal architecture description.<>
{"title":"OCCAM prototyping of massively parallel applications from colored Petri-nets","authors":"F. Breant, Jean-François Peyre","doi":"10.1109/IPPS.1993.262772","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262772","url":null,"abstract":"The authors present a technique to build a massively parallel application from a formal description. They use the colored Petri-net formalism to model applications. This formalism allows them to concisely describe parallel applications. Theoretical results on this formalism contribute to proving the correctness of the description before implementation. Furthermore, they use some linear invariants to decompose the model into interacting state machines which are easy to implement. An important feature introduced consists in using color to map state machines and to distribute data and communication onto a formal architecture description.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116071339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262880
R. Vaidyanathan, C. Hartmann, P. Varshney
The authors propose a radix sorting algorithm for n m-bit numbers (where m= Omega (log n) and polynomially upper bounded in n) that runs in O(t(n)log m) time, on any PRAM with mp(n)/logn logm O(logn)-bit processors; p(n) and t(n) are the number of processors and time needed for any deterministic algorithm to sort n logn-bit numbers stably (integer sorting) on the same type of PRAM as used by the radix sorting algorithm. The proposed algorithm has the same factor of inefficiency (if any) as that of the integer sorting algorithm used by it.<>
{"title":"Towards optimal parallel radix sorting","authors":"R. Vaidyanathan, C. Hartmann, P. Varshney","doi":"10.1109/IPPS.1993.262880","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262880","url":null,"abstract":"The authors propose a radix sorting algorithm for n m-bit numbers (where m= Omega (log n) and polynomially upper bounded in n) that runs in O(t(n)log m) time, on any PRAM with mp(n)/logn logm O(logn)-bit processors; p(n) and t(n) are the number of processors and time needed for any deterministic algorithm to sort n logn-bit numbers stably (integer sorting) on the same type of PRAM as used by the radix sorting algorithm. The proposed algorithm has the same factor of inefficiency (if any) as that of the integer sorting algorithm used by it.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"196 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124386418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262874
S. Balakrishnan, D. Panda
This paper presents a performance evaluation of multiple consumption channels in wormhole routed k-ary n-cube networks. The hotspots produced by non-uniform traffic patterns result in consumption bottleneck. The effects of this bottleneck are examined. The interplay between the number of consumption channels, the underlying routing algorithm, and the topology, is examined from the perspective of overall network performance. Two different communication patterns, all-to-one and non-uniform traffic are used in the study. The authors show that the severity of consumption bottleneck increases as the degree of adaptiveness in a routing algorithm increases, i.e., going from oblivious to partial to fully adaptive routing. They conclude that multiple consumption channels (upto 4 for 2D, 3D and 4D meshes and upto 8 for 8-cube) are desired to reduce the severity of this bottleneck and to exploit the advantages of adaptive routing schemes.<>
{"title":"Impact of multiple consumption channels on wormhole routed k-ary n-cube networks","authors":"S. Balakrishnan, D. Panda","doi":"10.1109/IPPS.1993.262874","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262874","url":null,"abstract":"This paper presents a performance evaluation of multiple consumption channels in wormhole routed k-ary n-cube networks. The hotspots produced by non-uniform traffic patterns result in consumption bottleneck. The effects of this bottleneck are examined. The interplay between the number of consumption channels, the underlying routing algorithm, and the topology, is examined from the perspective of overall network performance. Two different communication patterns, all-to-one and non-uniform traffic are used in the study. The authors show that the severity of consumption bottleneck increases as the degree of adaptiveness in a routing algorithm increases, i.e., going from oblivious to partial to fully adaptive routing. They conclude that multiple consumption channels (upto 4 for 2D, 3D and 4D meshes and upto 8 for 8-cube) are desired to reduce the severity of this bottleneck and to exploit the advantages of adaptive routing schemes.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122055134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262794
C. Raghavendra, M. Sridhar
The authors consider the problem of computing a global semigroup operation (such as addition and multiplication) on a faulty hypercube. In particular, they study the problem of performing such an operation in an n-dimensional SIMD hypercube Q/sub n/, with upto n-1 node and/or link faults. In an SIMD hypercube, during a communication step, nodes can exchange information with their neighbors only across a specific dimension. Given a set of most n-1 faults they develop an ordering d/sub 1/, d/sub 2/,. . .,d/sub n/ of n dimensions, depending on where the faults are located. An important and useful property of this dimension ordering is the following: if the n-cube is partitioned into k-subcubes using the first k dimensions f this ordering, namely d/sub 1/,d/sub 2/. . .d/sub k/ for any 1>
{"title":"Global semigroup operations in faulty SIMD hypercubes","authors":"C. Raghavendra, M. Sridhar","doi":"10.1109/IPPS.1993.262794","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262794","url":null,"abstract":"The authors consider the problem of computing a global semigroup operation (such as addition and multiplication) on a faulty hypercube. In particular, they study the problem of performing such an operation in an n-dimensional SIMD hypercube Q/sub n/, with upto n-1 node and/or link faults. In an SIMD hypercube, during a communication step, nodes can exchange information with their neighbors only across a specific dimension. Given a set of most n-1 faults they develop an ordering d/sub 1/, d/sub 2/,. . .,d/sub n/ of n dimensions, depending on where the faults are located. An important and useful property of this dimension ordering is the following: if the n-cube is partitioned into k-subcubes using the first k dimensions f this ordering, namely d/sub 1/,d/sub 2/. . .d/sub k/ for any 1<or=k<or=n, then each k-subcube in the partition contains at most k-1 faults. They use this result to develop algorithms for global sum. This ordering can be obtained in the presence of node as well as link faults. They also consider larger fault size, and show how to extend the dimension ordering theorem to handle up to (/sub 2//sup n/) faults. Using this result, it seems possible to obtain even more fault-tolerant algorithms for the semigroup operation problem.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115219661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262851
J. N. Coleman
The Event Processor / 3 is a dataflow processing element designed for high performance over a range of general computing tasks. Using a multithreading technique, program parallelism is exploited by interleaving threads onto successive pipeline stages. It may also be used as an element in a multiprocessor system. This paper describes the philosophy and design of the machine, and presents the results of detailed simulations of the performance of a single processing element. This is analysed into three factors: clock period, cycles per instruction and instructions per program; and each factor is compared with the measured performance of an advanced von Neumann computer running equivalent code. It is shown that the dataflow processor compares favourably, given a reasonable degree of parallelism in the program.<>
{"title":"A high speed dataflow processing element and its performance compared to a von Neumann mainframe","authors":"J. N. Coleman","doi":"10.1109/IPPS.1993.262851","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262851","url":null,"abstract":"The Event Processor / 3 is a dataflow processing element designed for high performance over a range of general computing tasks. Using a multithreading technique, program parallelism is exploited by interleaving threads onto successive pipeline stages. It may also be used as an element in a multiprocessor system. This paper describes the philosophy and design of the machine, and presents the results of detailed simulations of the performance of a single processing element. This is analysed into three factors: clock period, cycles per instruction and instructions per program; and each factor is compared with the measured performance of an advanced von Neumann computer running equivalent code. It is shown that the dataflow processor compares favourably, given a reasonable degree of parallelism in the program.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130232751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262896
Yen-Wen Lu, J. Burr, A. Peterson
Permutation is a common problem in both computation and communication. The authors add the buses to the mesh-connected multiprocessors and introduce the tokens to control the buses. They propose to use the mesh with segmented reconfigurable bus to increase performance of data routing. Segmented reconfigurable bus can not only use the bus-token more efficiently than the traditional bus, but also reduce interconnection delay. The authors choose the segment length to balance latency and throughput of the system to get better performance. In the simulation, the mesh with segmented reconfigurable bus can finish N*N permutation in .6065 N steps in average.<>
{"title":"Permutation on the mesh with reconfigurable bus: algorithms and practical considerations","authors":"Yen-Wen Lu, J. Burr, A. Peterson","doi":"10.1109/IPPS.1993.262896","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262896","url":null,"abstract":"Permutation is a common problem in both computation and communication. The authors add the buses to the mesh-connected multiprocessors and introduce the tokens to control the buses. They propose to use the mesh with segmented reconfigurable bus to increase performance of data routing. Segmented reconfigurable bus can not only use the bus-token more efficiently than the traditional bus, but also reduce interconnection delay. The authors choose the segment length to balance latency and throughput of the system to get better performance. In the simulation, the mesh with segmented reconfigurable bus can finish N*N permutation in .6065 N steps in average.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127816522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262916
P. Yang, C. Raghavendra
The authors present a distributed scheme for reconfiguration of embedded binary trees in hypercubes. Their scheme can reconfigure around any 3n/2 faulty nodes in O(n) time, in an n-dimensional hypercube. Their technique, which is based on a key concept called degree of occupancy, can be generalized to any task graph.<>
{"title":"Reconfiguration of binary trees in faulty hypercubes","authors":"P. Yang, C. Raghavendra","doi":"10.1109/IPPS.1993.262916","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262916","url":null,"abstract":"The authors present a distributed scheme for reconfiguration of embedded binary trees in hypercubes. Their scheme can reconfigure around any 3n/2 faulty nodes in O(n) time, in an n-dimensional hypercube. Their technique, which is based on a key concept called degree of occupancy, can be generalized to any task graph.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124552651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262819
M. R. Ghouse, M. Goodrich
Recent results in parallel algorithm theory have shown random sampling to be a powerful technique for achieving efficient bounds on the expected asymptotic running time of parallel algorithms for a number of important problems. The authors show experimentally that randomization is also a powerful practical technique in the design and implementation of parallel algorithms. Random sampling can be used to design parallel algorithms with fast expected run times, which meet or beat the run times of methods based on more conventional methods for a variety of benchmark tests. The constant factors of proportionality in the run times are small, and, most importantly, the expected work (and hence running time) avoids worst cases due to input distribution. They justify the approach through experimental results obtained on a Connection Machine CM-2 for a specific problem, namely, segment intersection reporting, and explore the effect of varying the parameters of the method.<>
{"title":"Experimental evidence for the power of random sampling in practical parallel algorithms","authors":"M. R. Ghouse, M. Goodrich","doi":"10.1109/IPPS.1993.262819","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262819","url":null,"abstract":"Recent results in parallel algorithm theory have shown random sampling to be a powerful technique for achieving efficient bounds on the expected asymptotic running time of parallel algorithms for a number of important problems. The authors show experimentally that randomization is also a powerful practical technique in the design and implementation of parallel algorithms. Random sampling can be used to design parallel algorithms with fast expected run times, which meet or beat the run times of methods based on more conventional methods for a variety of benchmark tests. The constant factors of proportionality in the run times are small, and, most importantly, the expected work (and hence running time) avoids worst cases due to input distribution. They justify the approach through experimental results obtained on a Connection Machine CM-2 for a specific problem, namely, segment intersection reporting, and explore the effect of varying the parameters of the method.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117231021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262889
L. Schwiebert, D. Jayasimha
Reducing communication overhead has been widely recognized as a requirement for achieving efficient mappings which substantially reduce the execution time of parallel algorithms. This paper presents an iterative heuristic for static mapping of parallel algorithms to architectures. Special attention is given to measuring and reducing channel contention. Experimental results are used to show the effects of channel contention for packet-switched networks and the improvement realized by the authors' heuristic. They also present preliminary results for wormhole-routed networks.<>
{"title":"Mapping to reduce contention in multiprocessor architectures","authors":"L. Schwiebert, D. Jayasimha","doi":"10.1109/IPPS.1993.262889","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262889","url":null,"abstract":"Reducing communication overhead has been widely recognized as a requirement for achieving efficient mappings which substantially reduce the execution time of parallel algorithms. This paper presents an iterative heuristic for static mapping of parallel algorithms to architectures. Special attention is given to measuring and reducing channel contention. Experimental results are used to show the effects of channel contention for packet-switched networks and the improvement realized by the authors' heuristic. They also present preliminary results for wormhole-routed networks.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121196091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}