Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234874
T. Sterling
Summary form only given. Discusses the latest in massively parallel processing (MPP) applications' results through high-resolution graphics and animation. Three themes are represented, demonstrating the relationship between massively parallel computing and scientific visualization. Results of applications computed on MPPs and visualized on graphics workstations are shown for many of the cases. Examples of result data whose image rendering are performed using parallel algorithms on MPPs are shown, and some performance measurements are given. Finally, graphics presentation of data representing the behavioral dynamics of MPPs are shown, opening the way for scientific visualization to assist in the optimization of MPP computation.<>
{"title":"Scientific visualization theatre","authors":"T. Sterling","doi":"10.1109/FMPC.1992.234874","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234874","url":null,"abstract":"Summary form only given. Discusses the latest in massively parallel processing (MPP) applications' results through high-resolution graphics and animation. Three themes are represented, demonstrating the relationship between massively parallel computing and scientific visualization. Results of applications computed on MPPs and visualized on graphics workstations are shown for many of the cases. Examples of result data whose image rendering are performed using parallel algorithms on MPPs are shown, and some performance measurements are given. Finally, graphics presentation of data representing the behavioral dynamics of MPPs are shown, opening the way for scientific visualization to assist in the optimization of MPP computation.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133031441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234930
D. Kramer, I. Scherson
The authors address the use of DP (dynamic precision) in fixed point iterative numerical algorithms. These algorithms are used in a wide range of numerically intensive scientific applications. One such algorithm, Muller's method, detects complex roots of an arbitrary function. This algorithm was implemented in DP on various architectures, including a MasPar MP-1 massively parallel processor and a Cray Y-MP vector processor. The results show that the use of DP can lead to a significant speedup of iterative algorithms on multiple-range architectures.<>
{"title":"Dynamic precision iterative algorithms","authors":"D. Kramer, I. Scherson","doi":"10.1109/FMPC.1992.234930","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234930","url":null,"abstract":"The authors address the use of DP (dynamic precision) in fixed point iterative numerical algorithms. These algorithms are used in a wide range of numerically intensive scientific applications. One such algorithm, Muller's method, detects complex roots of an arbitrary function. This algorithm was implemented in DP on various architectures, including a MasPar MP-1 massively parallel processor and a Cray Y-MP vector processor. The results show that the use of DP can lead to a significant speedup of iterative algorithms on multiple-range architectures.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128113770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234896
S. Kratzer
The multifrontal algorithm for sparse LU factorization has been expressed as a data parallel program that is suitable for massively parallel computers. A new way of mapping data and computations to processors is used, and good processor utilization is obtained even for unstructured sparse matrices. The sparse problem is decomposed into many smaller, dense subproblems, with low overhead for communications and memory access. Performance results are provided for factorization of regular and irregular finite-element grid matrices on the MasPar MP-1.<>
{"title":"Massively parallel sparse LU factorization","authors":"S. Kratzer","doi":"10.1109/FMPC.1992.234896","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234896","url":null,"abstract":"The multifrontal algorithm for sparse LU factorization has been expressed as a data parallel program that is suitable for massively parallel computers. A new way of mapping data and computations to processors is used, and good processor utilization is obtained even for unstructured sparse matrices. The sparse problem is decomposed into many smaller, dense subproblems, with low overhead for communications and memory access. Performance results are provided for factorization of regular and irregular finite-element grid matrices on the MasPar MP-1.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131496654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234899
R. Ponnusamy, A. Choudhary, G. Fox
The authors present experimental results for communication overhead on the scalable parallel machine CM-5. It is observed that the communication latency of the data network is 88 mu s. It was also observed that the communication cost for messages that are a multiple of 16 bytes is much smaller than for messages that are not, and therefore, for better performance, a user should pad messages to make them a multiple of 16 bytes. The authors also studied the communication overhead of three complete exchange algorithms. For small message sizes, the recursive exchange algorithm performs the best, especially for large multiprocessors. However, for large message sizes, the pairwise exchange algorithm is preferable. Finally, the authors studied two algorithms for one-to-all broadcast: the linear broadcast algorithm and the recursive broadcast algorithm. Linear broadcast does not perform well; the recursive broadcast algorithm performs well.<>
{"title":"Communication overhead on the CM5: an experimental performance evaluation","authors":"R. Ponnusamy, A. Choudhary, G. Fox","doi":"10.1109/FMPC.1992.234899","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234899","url":null,"abstract":"The authors present experimental results for communication overhead on the scalable parallel machine CM-5. It is observed that the communication latency of the data network is 88 mu s. It was also observed that the communication cost for messages that are a multiple of 16 bytes is much smaller than for messages that are not, and therefore, for better performance, a user should pad messages to make them a multiple of 16 bytes. The authors also studied the communication overhead of three complete exchange algorithms. For small message sizes, the recursive exchange algorithm performs the best, especially for large multiprocessors. However, for large message sizes, the pairwise exchange algorithm is preferable. Finally, the authors studied two algorithms for one-to-all broadcast: the linear broadcast algorithm and the recursive broadcast algorithm. Linear broadcast does not perform well; the recursive broadcast algorithm performs well.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132838116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234926
S.-Y. Lee
A feasible form of parallel architecture would be one which consists of several pipeline stages, each of which is a multiprocessor module of a large number of processing elements (PEs). In many applications, such as real-time image processing and dynamic control, the optimized computing structure would be in this form. In the present study, the performance of a parallel processing model of such an organization has been analyzed. In particular, the effect of interstage communication on throughput of the model has been investigated to suggest an efficient way of transferring data between stages. The numerical results obtained in this study could be a useful guideline for designing a parallel computer system consisting of pipeline stages each of which contains a large number of PEs.<>
{"title":"Throughput analysis of pipelined multiprocessor modules","authors":"S.-Y. Lee","doi":"10.1109/FMPC.1992.234926","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234926","url":null,"abstract":"A feasible form of parallel architecture would be one which consists of several pipeline stages, each of which is a multiprocessor module of a large number of processing elements (PEs). In many applications, such as real-time image processing and dynamic control, the optimized computing structure would be in this form. In the present study, the performance of a parallel processing model of such an organization has been analyzed. In particular, the effect of interstage communication on throughput of the model has been investigated to suggest an efficient way of transferring data between stages. The numerical results obtained in this study could be a useful guideline for designing a parallel computer system consisting of pipeline stages each of which contains a large number of PEs.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"601 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123192724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234938
D. Schaefer, R. Portee
A methodology for the description and analysis of massively parallel computers is presented. Massively parallel structures are modeled with a data path graph, a precedence graph, and a control structure. The control structure, specified with colored Petri nets, employs nomenclature that provides the concise representation of thousands of Petri places and transitions.<>
{"title":"Hues in control? (massively parallel computers)","authors":"D. Schaefer, R. Portee","doi":"10.1109/FMPC.1992.234938","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234938","url":null,"abstract":"A methodology for the description and analysis of massively parallel computers is presented. Massively parallel structures are modeled with a data path graph, a precedence graph, and a control structure. The control structure, specified with colored Petri nets, employs nomenclature that provides the concise representation of thousands of Petri places and transitions.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115139603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234950
A. Bellaachia, A. Youssef
The routing performance of banyan-hypercubes (BHs) is studied and compared with that of hypercubes. To evaluate the routing capabilities of BHs and hypercubes, a communication model is assumed. Based on this model, the traffic intensity of both networks is computed and the saturation probability of each network is determined. To compute the average time delay, the average queue length, the throughput, and the maximum queue size, extensive simulations were conducted for both networks for different sizes and different package generation packet rates. The saturation probability obtained through the simulation results is very close to that computed theoretically. The simulation results showed that all of the aforementioned measures are decreased when the network size gets larger. BHs with more than two levels are shown to congest faster than a hypercube of the same size, and deliver less throughput. However, a two-level BH has better performance than a hypercube of the same size. Although the BH has a better diameter and average distance, it does not necessarily have better communication capabilities than hypercubes.<>
{"title":"Traffic analysis of hypercubes and banyan-hypercubes","authors":"A. Bellaachia, A. Youssef","doi":"10.1109/FMPC.1992.234950","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234950","url":null,"abstract":"The routing performance of banyan-hypercubes (BHs) is studied and compared with that of hypercubes. To evaluate the routing capabilities of BHs and hypercubes, a communication model is assumed. Based on this model, the traffic intensity of both networks is computed and the saturation probability of each network is determined. To compute the average time delay, the average queue length, the throughput, and the maximum queue size, extensive simulations were conducted for both networks for different sizes and different package generation packet rates. The saturation probability obtained through the simulation results is very close to that computed theoretically. The simulation results showed that all of the aforementioned measures are decreased when the network size gets larger. BHs with more than two levels are shown to congest faster than a hypercube of the same size, and deliver less throughput. However, a two-level BH has better performance than a hypercube of the same size. Although the BH has a better diameter and average distance, it does not necessarily have better communication capabilities than hypercubes.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115852729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234890
M. Philippsen
An algorithm for mapping an arbitrary, multidimensional array onto an arbitrarily shaped multidimensional nearest-neighbor network of a distributed memory machine is presented. The individual dimensions of the array are labeled with high-level usage descriptors that either can be provided by the programmer or can be derived by sophisticated static compiler analysis. The presented algorithm achieves an appropriate exploitation of nearest-neighbor communication and allows for efficient address calculations. The author describes the integration of this technique into an optimizing compiler for Modula-2 and derives extensions that render efficient translation of nested parallelism possible and that provide support for thread scheduling.<>
{"title":"Automatic data distribution for nearest neighbor networks","authors":"M. Philippsen","doi":"10.1109/FMPC.1992.234890","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234890","url":null,"abstract":"An algorithm for mapping an arbitrary, multidimensional array onto an arbitrarily shaped multidimensional nearest-neighbor network of a distributed memory machine is presented. The individual dimensions of the array are labeled with high-level usage descriptors that either can be provided by the programmer or can be derived by sophisticated static compiler analysis. The presented algorithm achieves an appropriate exploitation of nearest-neighbor communication and allows for efficient address calculations. The author describes the integration of this technique into an optimizing compiler for Modula-2 and derives extensions that render efficient translation of nested parallelism possible and that provide support for thread scheduling.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116134364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234873
T. Al-Marzooq, F. Bastani
The authors present two problems in mapping highly maintainable expressive parallel code manipulating multidimensional arrays in massively parallel computers: bottlenecks due to simultaneous accesses in the EREW model, and interprocessor communication. They present a source code transformation approach to solve the expressibility-high-performance problem for the multidimensional arrays designed with a four-level hierarchical design of the data types (aggregate, abstract, logical, and physical levels). A systematic method is developed to transform parallel high-level low-performance code into parallel low-level efficient ones. The method is illustrated with matrix multiplication. The method is also used to generate high-performance logical-level code for the backpropagation algorithm of neural networks that makes extensive use of matrices. The transformed code has a much higher performance than the code with a naive mapping.<>
{"title":"Program transformation in massively parallel systems","authors":"T. Al-Marzooq, F. Bastani","doi":"10.1109/FMPC.1992.234873","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234873","url":null,"abstract":"The authors present two problems in mapping highly maintainable expressive parallel code manipulating multidimensional arrays in massively parallel computers: bottlenecks due to simultaneous accesses in the EREW model, and interprocessor communication. They present a source code transformation approach to solve the expressibility-high-performance problem for the multidimensional arrays designed with a four-level hierarchical design of the data types (aggregate, abstract, logical, and physical levels). A systematic method is developed to transform parallel high-level low-performance code into parallel low-level efficient ones. The method is illustrated with matrix multiplication. The method is also used to generate high-performance logical-level code for the backpropagation algorithm of neural networks that makes extensive use of matrices. The transformed code has a much higher performance than the code with a naive mapping.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129613076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234904
S. Peng, W. Lo
The authors present efficient algorithms for finding a core of tree with a specified length for both sequential and parallel computational models. The algorithms can be readily extended to a tree network in which arcs have nonnegative integer lengths. The authors also present a parallel version of the algorithm on an EREW PRAM (parallel random access machine) model. The results presented might provide a basis for the study of other facility shapes such as trees and forests of fixed sizes.<>
{"title":"Efficient algorithms for locating a core of a tree network with a specified length","authors":"S. Peng, W. Lo","doi":"10.1109/FMPC.1992.234904","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234904","url":null,"abstract":"The authors present efficient algorithms for finding a core of tree with a specified length for both sequential and parallel computational models. The algorithms can be readily extended to a tree network in which arcs have nonnegative integer lengths. The authors also present a parallel version of the algorithm on an EREW PRAM (parallel random access machine) model. The results presented might provide a basis for the study of other facility shapes such as trees and forests of fixed sizes.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"158 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133399409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}