Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262828
Michael L. Best, A. Greenberg, C. Stanfill, L. W. Tucker
The authors propose a library providing Unix file system support for highly parallel distributed-memory computers. CMMD I/O supports Unix I/O commands on the CM-5 supercomputer. The overall objective of the library is to provide the node level parallel programmer with routines for opening, reading, writing a file, and so forth. The default behavior mimics standard Unix running on each node; individual nodes can independently perform file system operations. New extensions to the standard Unix file descriptor semantics provide for co-operative parallel I/O. New functions provide access to very large (multi-gigabyte) files.<>
{"title":"CMMD I/O: a parallel Unix I/O","authors":"Michael L. Best, A. Greenberg, C. Stanfill, L. W. Tucker","doi":"10.1109/IPPS.1993.262828","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262828","url":null,"abstract":"The authors propose a library providing Unix file system support for highly parallel distributed-memory computers. CMMD I/O supports Unix I/O commands on the CM-5 supercomputer. The overall objective of the library is to provide the node level parallel programmer with routines for opening, reading, writing a file, and so forth. The default behavior mimics standard Unix running on each node; individual nodes can independently perform file system operations. New extensions to the standard Unix file descriptor semantics provide for co-operative parallel I/O. New functions provide access to very large (multi-gigabyte) files.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115161131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262826
S. Gupta, D. Panda
This paper deals with barrier synchronization in wormhole routed distributed-memory multiprocessors. New rendezvous and multirendezvous synchronization primitives are proposed to implement a barrier between two and multiple processors, respectively. These primitives reduce the number of communication steps required to implement a barrier; thus, significantly reducing the synchronization overhead for networks with high communication start-up cost. Two algorithms for barrier synchronization on k-ary n-cube networks are presented. The rendezvous primitive allows one to synchronize all processors in nlog/sub 2/(k) steps. The multirendezvous primitive allows one to synchronize an arbitrary subset of processors in optimal number of communication steps depending on the ratio of the communication start-up (t/sub s/) to the link-propagation (t/sub p/) cost.<>
{"title":"Barrier synchronization in distributed-memory multiprocessors using rendezvous primitives","authors":"S. Gupta, D. Panda","doi":"10.1109/IPPS.1993.262826","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262826","url":null,"abstract":"This paper deals with barrier synchronization in wormhole routed distributed-memory multiprocessors. New rendezvous and multirendezvous synchronization primitives are proposed to implement a barrier between two and multiple processors, respectively. These primitives reduce the number of communication steps required to implement a barrier; thus, significantly reducing the synchronization overhead for networks with high communication start-up cost. Two algorithms for barrier synchronization on k-ary n-cube networks are presented. The rendezvous primitive allows one to synchronize all processors in nlog/sub 2/(k) steps. The multirendezvous primitive allows one to synchronize an arbitrary subset of processors in optimal number of communication steps depending on the ratio of the communication start-up (t/sub s/) to the link-propagation (t/sub p/) cost.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132120287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262835
R. Bajwa, R. Owens, M. J. Irwin
Image processing applications are suitable candidates for parallelism and have at least in part motivated the design and development of some of the pioneering massively parallel processing systems including the CLIP family, the DAP, the MPP and the GAPP. By exploiting design techniques and architectures suitable for VLSI technology one can now build hardware which provides comparable performance at a fraction of the cost it took for these earlier designs. The authors describe the use of a fine-grained, massively parallel VLSI processor array, the Micro-Grained Array Processor (MGAP) for image processing applications. The array and its support systems, in their current configuration, are designed to be used as a co-processor board in a desk-top workstation. The array can be used for applications other than image processing as well. The versatility of the array and the single broad design provide a cost effective solution for a variety of parallelizable tasks.<>
{"title":"Image processing with the MGAP: a cost effective solution","authors":"R. Bajwa, R. Owens, M. J. Irwin","doi":"10.1109/IPPS.1993.262835","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262835","url":null,"abstract":"Image processing applications are suitable candidates for parallelism and have at least in part motivated the design and development of some of the pioneering massively parallel processing systems including the CLIP family, the DAP, the MPP and the GAPP. By exploiting design techniques and architectures suitable for VLSI technology one can now build hardware which provides comparable performance at a fraction of the cost it took for these earlier designs. The authors describe the use of a fine-grained, massively parallel VLSI processor array, the Micro-Grained Array Processor (MGAP) for image processing applications. The array and its support systems, in their current configuration, are designed to be used as a co-processor board in a desk-top workstation. The array can be used for applications other than image processing as well. The versatility of the array and the single broad design provide a cost effective solution for a variety of parallelizable tasks.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122914100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262814
B. Kumar, Chua-Huang Huang, Rodney W. Johnson, P. Sadayappan
A programming methodology based on tensor products has been used for designing and implementing block recursive algorithms for parallel and vector multiprocessors. A previous tensor product formulation of Strassen's matrix multiplication algorithm requires working arrays of size O(7/sup n/) for multiplying 2/sup n/*2/sup n/ matrices. The authors present a modified tensor product formulation of Strassen's algorithm in which the size of working arrays can be reduced to O(4/sup n/). The modified formulation exhibits sufficient parallel and vector operations for efficient implementation. Performance results on the Cray Y-MP are presented.<>
{"title":"A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction","authors":"B. Kumar, Chua-Huang Huang, Rodney W. Johnson, P. Sadayappan","doi":"10.1109/IPPS.1993.262814","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262814","url":null,"abstract":"A programming methodology based on tensor products has been used for designing and implementing block recursive algorithms for parallel and vector multiprocessors. A previous tensor product formulation of Strassen's matrix multiplication algorithm requires working arrays of size O(7/sup n/) for multiplying 2/sup n/*2/sup n/ matrices. The authors present a modified tensor product formulation of Strassen's algorithm in which the size of working arrays can be reduced to O(4/sup n/). The modified formulation exhibits sufficient parallel and vector operations for efficient implementation. Performance results on the Cray Y-MP are presented.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132027539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262914
J. Antonio, L. Lin, R. C. Metzger
Lower bound complexities are derived for three intensive communication patterns assuming a balanced generalized hypercube (BGHC) topology. The BGHC is a generalized hypercube that has exactly w nodes along each of the d dimensions for a total of w/sup d/ nodes. A BGHC is said to be dense if the w nodes along each dimension form a complete directed graph. A BGHC is said to be sparse if the w nodes along each dimension form a unidirectional ring. It is shown that a dense N node BGHC with a node degree equal to Klog/sub 2/N, where K>or=2, can process certain intensive communication patterns K(K-1) times faster than an N node binary hypercube (which has a node degree equal to log/sub 2/N). Furthermore, a sparse N node BGHC with a node degree equal to /sup 1///sub L/log/sub 2/N, where L>or=2, is 2/sup L/ times slower at processing certain intensive communication patterns than an N node binary hypercube.<>
{"title":"Complexity of intensive communications on balanced generalized hypercubes","authors":"J. Antonio, L. Lin, R. C. Metzger","doi":"10.1109/IPPS.1993.262914","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262914","url":null,"abstract":"Lower bound complexities are derived for three intensive communication patterns assuming a balanced generalized hypercube (BGHC) topology. The BGHC is a generalized hypercube that has exactly w nodes along each of the d dimensions for a total of w/sup d/ nodes. A BGHC is said to be dense if the w nodes along each dimension form a complete directed graph. A BGHC is said to be sparse if the w nodes along each dimension form a unidirectional ring. It is shown that a dense N node BGHC with a node degree equal to Klog/sub 2/N, where K>or=2, can process certain intensive communication patterns K(K-1) times faster than an N node binary hypercube (which has a node degree equal to log/sub 2/N). Furthermore, a sparse N node BGHC with a node degree equal to /sup 1///sub L/log/sub 2/N, where L>or=2, is 2/sup L/ times slower at processing certain intensive communication patterns than an N node binary hypercube.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129083736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262774
M. Neeracher, R. Rühl
Distributed memory parallel processors (DMPPs) have no hardware support for a global address space. However, conventional programs written in a sequential imperative language such as Fortran typically manipulate few, large arrays. The Oxygen compiler, developed as part of the K2 project, accepts conventional Fortran code, augmented with code and data distribution directives. These directives support a global name space through a run-time mechanism called data consistency analysis. Many sequential Fortran programs can be efficiently parallelized, with Oxygen directives introduced manually by the user into the sequential code. This work presents an analysis pass added to the compiler that makes suggestions for the directives to be inserted into the code. Automatic parallelization of LINPACK routines was attempted and results are given.<>
{"title":"Automatic parallelization of LINPACK routines on distributed memory parallel processors","authors":"M. Neeracher, R. Rühl","doi":"10.1109/IPPS.1993.262774","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262774","url":null,"abstract":"Distributed memory parallel processors (DMPPs) have no hardware support for a global address space. However, conventional programs written in a sequential imperative language such as Fortran typically manipulate few, large arrays. The Oxygen compiler, developed as part of the K2 project, accepts conventional Fortran code, augmented with code and data distribution directives. These directives support a global name space through a run-time mechanism called data consistency analysis. Many sequential Fortran programs can be efficiently parallelized, with Oxygen directives introduced manually by the user into the sequential code. This work presents an analysis pass added to the compiler that makes suggestions for the directives to be inserted into the code. Automatic parallelization of LINPACK routines was attempted and results are given.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133195735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262801
A. Bhattacharya, R. R. Rao, Ting-Ting Y. Lin
Multistage interconnection networks (MINs) provide a cost-effective alternative to a full crossbar connection for processor-processor or processor-memory communication in a tightly coupled multiprocessor system. Delta networks, a class of blocking type MIN with unique path property, have been studied extensively for their self-routing capability. A probabilistic analysis of the blocking and its effect on the delay is presented here, for such a network operated in a synchronous circuit-switched mode. Under the assumption of uniformly distributed access requests independently generated at each unblocked source, an upper bound on the expected latency has been established. The bound has been compared with simulation results.<>
{"title":"Delay analysis in synchronous circuit-switched delta networks","authors":"A. Bhattacharya, R. R. Rao, Ting-Ting Y. Lin","doi":"10.1109/IPPS.1993.262801","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262801","url":null,"abstract":"Multistage interconnection networks (MINs) provide a cost-effective alternative to a full crossbar connection for processor-processor or processor-memory communication in a tightly coupled multiprocessor system. Delta networks, a class of blocking type MIN with unique path property, have been studied extensively for their self-routing capability. A probabilistic analysis of the blocking and its effect on the delay is presented here, for such a network operated in a synchronous circuit-switched mode. Under the assumption of uniformly distributed access requests independently generated at each unblocked source, an upper bound on the expected latency has been established. The bound has been compared with simulation results.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"419 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133517066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262921
T. Johnson
The dramatic improvements in the processing rates of parallel computers are turning many compute-bound jobs into IO-bound jobs. Parallel file systems have been proposed to better match IO throughput to processing power. Many parallel file systems stripe files across numerous disks; each disk has its own controller. A striped file can be appended (or prepended) to and maintain its structure. However, a block can't be inserted into or deleted from the middle of the file, since this would destroy the round robin striping structure of the file. The author presents a distributed file structure that maintains files in indexed striped extents on a message passing multiprocessor. This approach allows highly parallel random and sequential reads, and also allows insertion and deletion into the middle of the file.<>
{"title":"Supporting insertions and deletions in striped parallel filesystems","authors":"T. Johnson","doi":"10.1109/IPPS.1993.262921","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262921","url":null,"abstract":"The dramatic improvements in the processing rates of parallel computers are turning many compute-bound jobs into IO-bound jobs. Parallel file systems have been proposed to better match IO throughput to processing power. Many parallel file systems stripe files across numerous disks; each disk has its own controller. A striped file can be appended (or prepended) to and maintain its structure. However, a block can't be inserted into or deleted from the middle of the file, since this would destroy the round robin striping structure of the file. The author presents a distributed file structure that maintains files in indexed striped extents on a message passing multiprocessor. This approach allows highly parallel random and sequential reads, and also allows insertion and deletion into the middle of the file.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114065302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262894
I. Scherson, R. Subramanian
This paper presents an off-line algorithm for routing permutations on expanded delta networks (EDNs) with restricted access. Restricted access means that the number of elements to be permuted may exceed the number of inputs to the EDN. For every N-element permutation on an M-input EDN, the algorithm computes a routing that takes exactly 3N/M passes (assuming M divides N). On a certain class of EDNs, the number of passes can be reduced to 2N/M. For example, for every 16 K-element permutation on the 1 K-input global network of the MasPar MP-1 and MP-2, the algorithm computes a routing that takes exactly 32 passes. The time complexity of the algorithm is Theta (NlogN) sequentially, and Theta (log/sup 2/N) on an N-processor PRAM.<>
{"title":"Efficient off-line routing of permutations on restricted access expanded delta networks","authors":"I. Scherson, R. Subramanian","doi":"10.1109/IPPS.1993.262894","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262894","url":null,"abstract":"This paper presents an off-line algorithm for routing permutations on expanded delta networks (EDNs) with restricted access. Restricted access means that the number of elements to be permuted may exceed the number of inputs to the EDN. For every N-element permutation on an M-input EDN, the algorithm computes a routing that takes exactly 3N/M passes (assuming M divides N). On a certain class of EDNs, the number of passes can be reduced to 2N/M. For example, for every 16 K-element permutation on the 1 K-input global network of the MasPar MP-1 and MP-2, the algorithm computes a routing that takes exactly 32 passes. The time complexity of the algorithm is Theta (NlogN) sequentially, and Theta (log/sup 2/N) on an N-processor PRAM.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116075581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262892
Nitin K. Singhvi
The enhanced connection cube or ECC and the minimal connection cube or MCC, proposed in this paper, are regular and symmetric static interconnection networks for large-scale, loosely coupled systems. The ECC connects 2/sup 2n+1/ processing nodes with only n+2 links per node, almost half the number used in a comparable hypercube. Yet its diameter is only n+2, almost half that of the hypercube. The MCC connects 2/sup 2n+1/ nodes using only n+1 links per node, has about the same diameter as a hypercube and is scalable like the hypercube. The MCC can be converted into the ECC by adding one more link per node. Both networks can emulate all the connections present in a hypercube of the same size, with no increase in routing complexity, so that typical parallel applications run on both types of CCs with the same time complexity as on a hypercube.<>
{"title":"The connection cubes: symmetric, low diameter interconnection networks with low node degree","authors":"Nitin K. Singhvi","doi":"10.1109/IPPS.1993.262892","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262892","url":null,"abstract":"The enhanced connection cube or ECC and the minimal connection cube or MCC, proposed in this paper, are regular and symmetric static interconnection networks for large-scale, loosely coupled systems. The ECC connects 2/sup 2n+1/ processing nodes with only n+2 links per node, almost half the number used in a comparable hypercube. Yet its diameter is only n+2, almost half that of the hypercube. The MCC connects 2/sup 2n+1/ nodes using only n+1 links per node, has about the same diameter as a hypercube and is scalable like the hypercube. The MCC can be converted into the ECC by adding one more link per node. Both networks can emulate all the connections present in a hypercube of the same size, with no increase in routing complexity, so that typical parallel applications run on both types of CCs with the same time complexity as on a hypercube.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121379734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}