Pub Date : 1991-04-28DOI: 10.1109/DMCC.1991.633140
Peter Steenkiste
example, both sends and receives can operate on both Applications have very diverse communication local and remote buffers. Although this communication requirements. Although individual algorithms often use model does not correspond directly to the low-level regular communication patterns, there is little regularity communication primitives supported by the hardware, it across applications or even across different phases of the can be implemented efficiently, and it gives the users same application. For this reason, a low-level more control over how and when transfers over the communication interface should support the unrestricted, network takes place. The interface is the lowest-level reliable exchange of variable-length messages. communication interface for the Nectar multicomputer.
{"title":"A Symmetrical Communication Interface for Distributed-Memory Computers","authors":"Peter Steenkiste","doi":"10.1109/DMCC.1991.633140","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633140","url":null,"abstract":"example, both sends and receives can operate on both Applications have very diverse communication local and remote buffers. Although this communication requirements. Although individual algorithms often use model does not correspond directly to the low-level regular communication patterns, there is little regularity communication primitives supported by the hardware, it across applications or even across different phases of the can be implemented efficiently, and it gives the users same application. For this reason, a low-level more control over how and when transfers over the communication interface should support the unrestricted, network takes place. The interface is the lowest-level reliable exchange of variable-length messages. communication interface for the Nectar multicomputer.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131637154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-04-28DOI: 10.1109/DMCC.1991.633344
A. Sarwal, F. Ozguner, J. Ramanathan
The paper investigates schemes f o r implementing the 3 0 reconstruction of the Coronary Ar te r i e s o n a! MIMD sys t em. The performance of ihe: s y s t em f o r calculating the 3 0 descript ion of the uri'erial tree i s redated t o the mapping strategy selecte,d. The image processing algorithms can be parallelized t o provide fa-, vorable performance for the complete computat ion cy-, d e . Results are provided for t w o mappzng approaches; for an X r a y image, and an extension is proposed f o r the mult iv iew case.
{"title":"Mapping Techniques for Parallel 3D Coronary Arteriography","authors":"A. Sarwal, F. Ozguner, J. Ramanathan","doi":"10.1109/DMCC.1991.633344","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633344","url":null,"abstract":"The paper investigates schemes f o r implementing the 3 0 reconstruction of the Coronary Ar te r i e s o n a! MIMD sys t em. The performance of ihe: s y s t em f o r calculating the 3 0 descript ion of the uri'erial tree i s redated t o the mapping strategy selecte,d. The image processing algorithms can be parallelized t o provide fa-, vorable performance for the complete computat ion cy-, d e . Results are provided for t w o mappzng approaches; for an X r a y image, and an extension is proposed f o r the mult iv iew case.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134078324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-04-28DOI: 10.1109/DMCC.1991.633085
B. Chapman, H. Herbeck, H. Zima
A bst rac t I n current automatic parallelizlation systems for distributed-memory machines, the user must explicitly specify how the d a t a domain of the :iequential program is t o be partitioned and mapped to the processors. I n this paper, we outline the salient features of a new knowledge-based software tool that provides automatic support f o r this task. The basic guidelines f o r the design of the tool are discussed, and its major components are described.
{"title":"Automatic Support for Data Distribution","authors":"B. Chapman, H. Herbeck, H. Zima","doi":"10.1109/DMCC.1991.633085","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633085","url":null,"abstract":"A bst rac t I n current automatic parallelizlation systems for distributed-memory machines, the user must explicitly specify how the d a t a domain of the :iequential program is t o be partitioned and mapped to the processors. I n this paper, we outline the salient features of a new knowledge-based software tool that provides automatic support f o r this task. The basic guidelines f o r the design of the tool are discussed, and its major components are described.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134417442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-04-28DOI: 10.1109/DMCC.1991.633310
D.S. Holtsinger, E. Gehringer
Distributed memory computers require eficient, high-bandwidth networks to support fine-grain computation. In developing an analytic model .for 0 network, the underlying details of the aschitectiure are often abstracted to simplify the model and to facilitate a comparison with other networks. As a rt!sult i t becomes dificult to compare the relative meids of the architectural features in a particularr network. In this paper we present a detailed analysis of the binary dcube network. Our model has been shown to provide results that are very similar to those derived from a detailed simulation model. Among other things, our analysis shows that srnall increases in routing latency can significantly degrade throughput, but does not degrade the probability of acceptance of a mqessage. It atso shows that just a few buffers 0ii heavily congested destination links can improve performance greatly, almost as much as bzqfering on all destinalion links.
{"title":"Approximate Analysis of the Binary d-Cube Network","authors":"D.S. Holtsinger, E. Gehringer","doi":"10.1109/DMCC.1991.633310","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633310","url":null,"abstract":"Distributed memory computers require eficient, high-bandwidth networks to support fine-grain computation. In developing an analytic model .for 0 network, the underlying details of the aschitectiure are often abstracted to simplify the model and to facilitate a comparison with other networks. As a rt!sult i t becomes dificult to compare the relative meids of the architectural features in a particularr network. In this paper we present a detailed analysis of the binary dcube network. Our model has been shown to provide results that are very similar to those derived from a detailed simulation model. Among other things, our analysis shows that srnall increases in routing latency can significantly degrade throughput, but does not degrade the probability of acceptance of a mqessage. It atso shows that just a few buffers 0ii heavily congested destination links can improve performance greatly, almost as much as bzqfering on all destinalion links.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130974227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-04-28DOI: 10.1109/DMCC.1991.633316
J. Francioni, J. A. Jackson, L. Albright
Portraying the behavior of parallel programs is useful in pro8ram debuming and performance tuning. For the most part, researchers have focused on finding ways to visualize what happens during a program's execution. As an alternative to visualization, auralization can also be used to portray the behavior of parallel programs. This paper investigates whether or not sound can be used effectively t o depict dzferent events that take place during a parallel proBram's execution. In particular, we focus this discussion on distributedmemory parallel programs. Three mappings of execution behavior to sound were studied. ?'he first mapping tracks the load balance of the processors of a system. In the second mapping, the jlows-of-control of the parallel processes are mapped to related sounds. The third mapping is related t o process communication in a distributed-memory parallel program.
{"title":"The Sounds of Parallel Programs","authors":"J. Francioni, J. A. Jackson, L. Albright","doi":"10.1109/DMCC.1991.633316","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633316","url":null,"abstract":"Portraying the behavior of parallel programs is useful in pro8ram debuming and performance tuning. For the most part, researchers have focused on finding ways to visualize what happens during a program's execution. As an alternative to visualization, auralization can also be used to portray the behavior of parallel programs. This paper investigates whether or not sound can be used effectively t o depict dzferent events that take place during a parallel proBram's execution. In particular, we focus this discussion on distributedmemory parallel programs. Three mappings of execution behavior to sound were studied. ?'he first mapping tracks the load balance of the processors of a system. In the second mapping, the jlows-of-control of the parallel processes are mapped to related sounds. The third mapping is related t o process communication in a distributed-memory parallel program.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131090668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-04-28DOI: 10.1109/DMCC.1991.633086
M. Baber
Static allocations of arrays on multicomputers have two major shortcomings. First, algorithms often employ more than one referencepattern for a given array, resulting in the need for more than one mapping between the array elements and the multicomputer nodes. Secondly, it is desirable to provide easily resizeable arrays, especially for multigrid algorithms. This paper describes extensions to the hypertasking paracompiler which provide both dynamically resizeable and redistributable arrays. Hypertasking is a parallel programming tool that transforms C programs containing comment-directives into SPMD Cprogirams that can be run on any size hypercube without recompilation for each cube size. Introduction This paper describes extensions tc~ hypertasking [ 11, a domain decomposition tool that operates on commentdirectives inserted into ordinary sequential C source code. The extensions support run-time redistribution and resizing of arrays. Hypertasking is one of seveial projects [4,5,6,8] that have proposed or produced sourceto-source compilers for parallel architectures. I refer to this class of software tools as paracompilers to distinguish them from the sequential source-to-object compilers they are built upon. A fundamental question for paracompiler designers is whether to make decisions about data and control decomposition at compile-time or at ruin-time. If decisions are made at compile-time, the logic does not have to be repeated every time the program is executed and it is possible to optimize the code for known parameters. * Supported in part by: Defense Advanced Research Projects Agency Information Science and Technology Office Research in Concurrent Computing Systems ARPA Order No. 6402.6402-1; Program Code No. 8E20 & 9E20 Issued by DARPAKMO under Contract #&IDA-972-89-C-0034 Unfortunately, compile-time decisions are also inflexible. Hypertasking nnakes all significant decisions about decomposition at ]run-time. A run-time initialization routine is called by each node to assign values to the members of an amay definition structure. The C code generated by the paracompiler references the values in the structure instead of constants chosen at compile-time. The resulting code is surprisingly efficient. Furthermore, because it is relatively straightforward to change the decomposition variables in the array definition structure, run -ti me decomposition great 1 y facilitates the implementation of dynamic array resizing and redistribution features such as those described in this paper. This paper will begin with an overview of the Hypertasking programming model to provide a framework for the new features. Beginning with redistributable arrays, the purpose and performance of the new features are discussed with reference to example programs. Finally, conclusions and goals for future research are presented. Hypertasking overview Hypertasking is; designed to make it easy for software developers to port their existing data parallel applications to a m
{"title":"Hypertasking Support for Dynamically Redistributable and Resizeable Arrays on the iPSC","authors":"M. Baber","doi":"10.1109/DMCC.1991.633086","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633086","url":null,"abstract":"Static allocations of arrays on multicomputers have two major shortcomings. First, algorithms often employ more than one referencepattern for a given array, resulting in the need for more than one mapping between the array elements and the multicomputer nodes. Secondly, it is desirable to provide easily resizeable arrays, especially for multigrid algorithms. This paper describes extensions to the hypertasking paracompiler which provide both dynamically resizeable and redistributable arrays. Hypertasking is a parallel programming tool that transforms C programs containing comment-directives into SPMD Cprogirams that can be run on any size hypercube without recompilation for each cube size. Introduction This paper describes extensions tc~ hypertasking [ 11, a domain decomposition tool that operates on commentdirectives inserted into ordinary sequential C source code. The extensions support run-time redistribution and resizing of arrays. Hypertasking is one of seveial projects [4,5,6,8] that have proposed or produced sourceto-source compilers for parallel architectures. I refer to this class of software tools as paracompilers to distinguish them from the sequential source-to-object compilers they are built upon. A fundamental question for paracompiler designers is whether to make decisions about data and control decomposition at compile-time or at ruin-time. If decisions are made at compile-time, the logic does not have to be repeated every time the program is executed and it is possible to optimize the code for known parameters. * Supported in part by: Defense Advanced Research Projects Agency Information Science and Technology Office Research in Concurrent Computing Systems ARPA Order No. 6402.6402-1; Program Code No. 8E20 & 9E20 Issued by DARPAKMO under Contract #&IDA-972-89-C-0034 Unfortunately, compile-time decisions are also inflexible. Hypertasking nnakes all significant decisions about decomposition at ]run-time. A run-time initialization routine is called by each node to assign values to the members of an amay definition structure. The C code generated by the paracompiler references the values in the structure instead of constants chosen at compile-time. The resulting code is surprisingly efficient. Furthermore, because it is relatively straightforward to change the decomposition variables in the array definition structure, run -ti me decomposition great 1 y facilitates the implementation of dynamic array resizing and redistribution features such as those described in this paper. This paper will begin with an overview of the Hypertasking programming model to provide a framework for the new features. Beginning with redistributable arrays, the purpose and performance of the new features are discussed with reference to example programs. Finally, conclusions and goals for future research are presented. Hypertasking overview Hypertasking is; designed to make it easy for software developers to port their existing data parallel applications to a m","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133137719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-04-28DOI: 10.1109/DMCC.1991.633358
S. Kambhatla
Hypercubes and cube-connected cycles di'er in the number of links per node which has fundamental implications on several issues including performance and ease of implementation. In this paper, we evaluate these networks with respect to a number of parameters including several topological characterizations, fault-tolerance, various broadcast and point-to-point communication primitives. In the process we also derive several lower bound figures and describe algorithms for communication in cube-connected cycles. We conclude that while having lower number of links per node in a CCC might not degrade performance drastically (especially for lowe,r dimensions) as compared to a hypercube of a similar size, this feature has several consequences which substantially aid its (VLSI and non- VLSI) implementation.
{"title":"Hypercube Vs Cube-Connected Cycles: A Topological Evaluation","authors":"S. Kambhatla","doi":"10.1109/DMCC.1991.633358","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633358","url":null,"abstract":"Hypercubes and cube-connected cycles di'er in the number of links per node which has fundamental implications on several issues including performance and ease of implementation. In this paper, we evaluate these networks with respect to a number of parameters including several topological characterizations, fault-tolerance, various broadcast and point-to-point communication primitives. In the process we also derive several lower bound figures and describe algorithms for communication in cube-connected cycles. We conclude that while having lower number of links per node in a CCC might not degrade performance drastically (especially for lowe,r dimensions) as compared to a hypercube of a similar size, this feature has several consequences which substantially aid its (VLSI and non- VLSI) implementation.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122289023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-04-28DOI: 10.1109/DMCC.1991.633216
K. Dalton, E. Hensel, S. Castillo, K. Ng
A study of the finite difSerence solution of the nonlinear partial differential equations governing twoand three-dimensional semiconductor devices is conducted on a SIMD computer. This nonlinear system is solved using Jacobi iteration and successive-under-relaxation. Row scaling and a zero order regularizer are used to aid in convergence. On a 16K CM-2 problems with up to 16.7 million unknowns have been solved. Problems of this size have not previously been reported. The ability to accurately model larger and more realistic three-dimensional devices is necessary to gain a greater physical understanding of their behavior.
{"title":"The Finite Difference Solution of Two- and Three-Dimensional Semiconductor Problems on the Connection Machine","authors":"K. Dalton, E. Hensel, S. Castillo, K. Ng","doi":"10.1109/DMCC.1991.633216","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633216","url":null,"abstract":"A study of the finite difSerence solution of the nonlinear partial differential equations governing twoand three-dimensional semiconductor devices is conducted on a SIMD computer. This nonlinear system is solved using Jacobi iteration and successive-under-relaxation. Row scaling and a zero order regularizer are used to aid in convergence. On a 16K CM-2 problems with up to 16.7 million unknowns have been solved. Problems of this size have not previously been reported. The ability to accurately model larger and more realistic three-dimensional devices is necessary to gain a greater physical understanding of their behavior.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127401838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-04-28DOI: 10.1109/DMCC.1991.633162
V. Saletore, L. Kalé
{"title":"Efficient Parallel Execution of IDA on Shared and Distributed Memory Multiprocessors","authors":"V. Saletore, L. Kalé","doi":"10.1109/DMCC.1991.633162","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633162","url":null,"abstract":"","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116563839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-04-28DOI: 10.1109/DMCC.1991.633150
S. Johnsson, Ching-Tien Ho
All-to-all personalized communication is a class, of permutations in which each processor sends a unique message to every other processor. We present optimal algorithms for concurrent communication on all channels in Boolean cube networks, both for the case with a single permutation, and the case where multiple permutations shall be performed on the same local data set, but on different sets of processors. For K elements per processor our algorithms give the optimal number of elements transfer, K/2. For a succession of all-to-all personalized communications on disjoint subcubes of p dimensions each, our best algorithm yields $.+c-p element exchanges in sequence, where cr is the total number of processor dimensions in the permutation. An implementation on the Connection Machine of one of the algorithms offers a maximum speed-up of 50% compared to the previously best known algorithm.
{"title":"Optimal All-to-All Personalized Communication with Minimum Span on Boolean Cubes","authors":"S. Johnsson, Ching-Tien Ho","doi":"10.1109/DMCC.1991.633150","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633150","url":null,"abstract":"All-to-all personalized communication is a class, of permutations in which each processor sends a unique message to every other processor. We present optimal algorithms for concurrent communication on all channels in Boolean cube networks, both for the case with a single permutation, and the case where multiple permutations shall be performed on the same local data set, but on different sets of processors. For K elements per processor our algorithms give the optimal number of elements transfer, K/2. For a succession of all-to-all personalized communications on disjoint subcubes of p dimensions each, our best algorithm yields $.+c-p element exchanges in sequence, where cr is the total number of processor dimensions in the permutation. An implementation on the Connection Machine of one of the algorithms offers a maximum speed-up of 50% compared to the previously best known algorithm.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128596445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}