Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234874
T. Sterling
Summary form only given. Discusses the latest in massively parallel processing (MPP) applications' results through high-resolution graphics and animation. Three themes are represented, demonstrating the relationship between massively parallel computing and scientific visualization. Results of applications computed on MPPs and visualized on graphics workstations are shown for many of the cases. Examples of result data whose image rendering are performed using parallel algorithms on MPPs are shown, and some performance measurements are given. Finally, graphics presentation of data representing the behavioral dynamics of MPPs are shown, opening the way for scientific visualization to assist in the optimization of MPP computation.<>
{"title":"Scientific visualization theatre","authors":"T. Sterling","doi":"10.1109/FMPC.1992.234874","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234874","url":null,"abstract":"Summary form only given. Discusses the latest in massively parallel processing (MPP) applications' results through high-resolution graphics and animation. Three themes are represented, demonstrating the relationship between massively parallel computing and scientific visualization. Results of applications computed on MPPs and visualized on graphics workstations are shown for many of the cases. Examples of result data whose image rendering are performed using parallel algorithms on MPPs are shown, and some performance measurements are given. Finally, graphics presentation of data representing the behavioral dynamics of MPPs are shown, opening the way for scientific visualization to assist in the optimization of MPP computation.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133031441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234882
I.D. Scherson
The task of identifying some of the basic research issues facing modern massively parallel processing is addressed. Processing element architecture, interconnection networks, languages and compilers, and software development tools are considered.<>
{"title":"The new frontiers: A workshop on future directions in massively parallel processing","authors":"I.D. Scherson","doi":"10.1109/FMPC.1992.234882","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234882","url":null,"abstract":"The task of identifying some of the basic research issues facing modern massively parallel processing is addressed. Processing element architecture, interconnection networks, languages and compilers, and software development tools are considered.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131331901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234896
S. Kratzer
The multifrontal algorithm for sparse LU factorization has been expressed as a data parallel program that is suitable for massively parallel computers. A new way of mapping data and computations to processors is used, and good processor utilization is obtained even for unstructured sparse matrices. The sparse problem is decomposed into many smaller, dense subproblems, with low overhead for communications and memory access. Performance results are provided for factorization of regular and irregular finite-element grid matrices on the MasPar MP-1.<>
{"title":"Massively parallel sparse LU factorization","authors":"S. Kratzer","doi":"10.1109/FMPC.1992.234896","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234896","url":null,"abstract":"The multifrontal algorithm for sparse LU factorization has been expressed as a data parallel program that is suitable for massively parallel computers. A new way of mapping data and computations to processors is used, and good processor utilization is obtained even for unstructured sparse matrices. The sparse problem is decomposed into many smaller, dense subproblems, with low overhead for communications and memory access. Performance results are provided for factorization of regular and irregular finite-element grid matrices on the MasPar MP-1.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131496654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234899
R. Ponnusamy, A. Choudhary, G. Fox
The authors present experimental results for communication overhead on the scalable parallel machine CM-5. It is observed that the communication latency of the data network is 88 mu s. It was also observed that the communication cost for messages that are a multiple of 16 bytes is much smaller than for messages that are not, and therefore, for better performance, a user should pad messages to make them a multiple of 16 bytes. The authors also studied the communication overhead of three complete exchange algorithms. For small message sizes, the recursive exchange algorithm performs the best, especially for large multiprocessors. However, for large message sizes, the pairwise exchange algorithm is preferable. Finally, the authors studied two algorithms for one-to-all broadcast: the linear broadcast algorithm and the recursive broadcast algorithm. Linear broadcast does not perform well; the recursive broadcast algorithm performs well.<>
{"title":"Communication overhead on the CM5: an experimental performance evaluation","authors":"R. Ponnusamy, A. Choudhary, G. Fox","doi":"10.1109/FMPC.1992.234899","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234899","url":null,"abstract":"The authors present experimental results for communication overhead on the scalable parallel machine CM-5. It is observed that the communication latency of the data network is 88 mu s. It was also observed that the communication cost for messages that are a multiple of 16 bytes is much smaller than for messages that are not, and therefore, for better performance, a user should pad messages to make them a multiple of 16 bytes. The authors also studied the communication overhead of three complete exchange algorithms. For small message sizes, the recursive exchange algorithm performs the best, especially for large multiprocessors. However, for large message sizes, the pairwise exchange algorithm is preferable. Finally, the authors studied two algorithms for one-to-all broadcast: the linear broadcast algorithm and the recursive broadcast algorithm. Linear broadcast does not perform well; the recursive broadcast algorithm performs well.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132838116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234923
D. M. Newman, D. Goeckel, R. D. Crawford, S. Abraham
The authors describe the parallel implementation of an algorithm suitable for hologram creation on a 16384 processor SIMD (single-instruction multiple-data) MasPar machine. When computing an image of typical complexity, the parallel implementation sacrifices up to 11% efficiency in data compression to gain a performance up to 250 times greater than that achieved on a uniprocessor workstation. The MasPar can achieve pattern generation more than 750 times faster than the fully optimized Sparc C code.<>
{"title":"Parallel holographic image calculation and compression","authors":"D. M. Newman, D. Goeckel, R. D. Crawford, S. Abraham","doi":"10.1109/FMPC.1992.234923","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234923","url":null,"abstract":"The authors describe the parallel implementation of an algorithm suitable for hologram creation on a 16384 processor SIMD (single-instruction multiple-data) MasPar machine. When computing an image of typical complexity, the parallel implementation sacrifices up to 11% efficiency in data compression to gain a performance up to 250 times greater than that achieved on a uniprocessor workstation. The MasPar can achieve pattern generation more than 750 times faster than the fully optimized Sparc C code.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114330091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234926
S.-Y. Lee
A feasible form of parallel architecture would be one which consists of several pipeline stages, each of which is a multiprocessor module of a large number of processing elements (PEs). In many applications, such as real-time image processing and dynamic control, the optimized computing structure would be in this form. In the present study, the performance of a parallel processing model of such an organization has been analyzed. In particular, the effect of interstage communication on throughput of the model has been investigated to suggest an efficient way of transferring data between stages. The numerical results obtained in this study could be a useful guideline for designing a parallel computer system consisting of pipeline stages each of which contains a large number of PEs.<>
{"title":"Throughput analysis of pipelined multiprocessor modules","authors":"S.-Y. Lee","doi":"10.1109/FMPC.1992.234926","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234926","url":null,"abstract":"A feasible form of parallel architecture would be one which consists of several pipeline stages, each of which is a multiprocessor module of a large number of processing elements (PEs). In many applications, such as real-time image processing and dynamic control, the optimized computing structure would be in this form. In the present study, the performance of a parallel processing model of such an organization has been analyzed. In particular, the effect of interstage communication on throughput of the model has been investigated to suggest an efficient way of transferring data between stages. The numerical results obtained in this study could be a useful guideline for designing a parallel computer system consisting of pipeline stages each of which contains a large number of PEs.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"601 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123192724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234950
A. Bellaachia, A. Youssef
The routing performance of banyan-hypercubes (BHs) is studied and compared with that of hypercubes. To evaluate the routing capabilities of BHs and hypercubes, a communication model is assumed. Based on this model, the traffic intensity of both networks is computed and the saturation probability of each network is determined. To compute the average time delay, the average queue length, the throughput, and the maximum queue size, extensive simulations were conducted for both networks for different sizes and different package generation packet rates. The saturation probability obtained through the simulation results is very close to that computed theoretically. The simulation results showed that all of the aforementioned measures are decreased when the network size gets larger. BHs with more than two levels are shown to congest faster than a hypercube of the same size, and deliver less throughput. However, a two-level BH has better performance than a hypercube of the same size. Although the BH has a better diameter and average distance, it does not necessarily have better communication capabilities than hypercubes.<>
{"title":"Traffic analysis of hypercubes and banyan-hypercubes","authors":"A. Bellaachia, A. Youssef","doi":"10.1109/FMPC.1992.234950","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234950","url":null,"abstract":"The routing performance of banyan-hypercubes (BHs) is studied and compared with that of hypercubes. To evaluate the routing capabilities of BHs and hypercubes, a communication model is assumed. Based on this model, the traffic intensity of both networks is computed and the saturation probability of each network is determined. To compute the average time delay, the average queue length, the throughput, and the maximum queue size, extensive simulations were conducted for both networks for different sizes and different package generation packet rates. The saturation probability obtained through the simulation results is very close to that computed theoretically. The simulation results showed that all of the aforementioned measures are decreased when the network size gets larger. BHs with more than two levels are shown to congest faster than a hypercube of the same size, and deliver less throughput. However, a two-level BH has better performance than a hypercube of the same size. Although the BH has a better diameter and average distance, it does not necessarily have better communication capabilities than hypercubes.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115852729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234873
T. Al-Marzooq, F. Bastani
The authors present two problems in mapping highly maintainable expressive parallel code manipulating multidimensional arrays in massively parallel computers: bottlenecks due to simultaneous accesses in the EREW model, and interprocessor communication. They present a source code transformation approach to solve the expressibility-high-performance problem for the multidimensional arrays designed with a four-level hierarchical design of the data types (aggregate, abstract, logical, and physical levels). A systematic method is developed to transform parallel high-level low-performance code into parallel low-level efficient ones. The method is illustrated with matrix multiplication. The method is also used to generate high-performance logical-level code for the backpropagation algorithm of neural networks that makes extensive use of matrices. The transformed code has a much higher performance than the code with a naive mapping.<>
{"title":"Program transformation in massively parallel systems","authors":"T. Al-Marzooq, F. Bastani","doi":"10.1109/FMPC.1992.234873","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234873","url":null,"abstract":"The authors present two problems in mapping highly maintainable expressive parallel code manipulating multidimensional arrays in massively parallel computers: bottlenecks due to simultaneous accesses in the EREW model, and interprocessor communication. They present a source code transformation approach to solve the expressibility-high-performance problem for the multidimensional arrays designed with a four-level hierarchical design of the data types (aggregate, abstract, logical, and physical levels). A systematic method is developed to transform parallel high-level low-performance code into parallel low-level efficient ones. The method is illustrated with matrix multiplication. The method is also used to generate high-performance logical-level code for the backpropagation algorithm of neural networks that makes extensive use of matrices. The transformed code has a much higher performance than the code with a naive mapping.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129613076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234925
T. C. Marek, E. Davis
Quantitative results of experiments on PE (processing element) granularities are presented. An architecture simulation workbench has been developed for experiments on PE granularities of 1, 4, 8, and 16-b. An analysis of the impact of various I/O (input/output) and communication path widths is also possible. Overall performance, communication balance, PE utilization, and operand lengths can be monitored to evaluate the merits of various granularities and feature sets. This workbench has been used to run a set of benchmark algorithms that cover a range of computation and communication requirements, a range of data sizes, and a range of problem array sizes. The authors report results for two of the algorithms studied by T.C. Marek (1992): image rotation and image resampling. The results obtained are counterintuitive. They indicate that bit-serial machines have performance advantages due to inherent bit-oriented activity, even when using multiple bit operands, and to inter-PE communication when paths are narrower than the processor granularity.<>
{"title":"Quantitative studies of processing element granularity","authors":"T. C. Marek, E. Davis","doi":"10.1109/FMPC.1992.234925","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234925","url":null,"abstract":"Quantitative results of experiments on PE (processing element) granularities are presented. An architecture simulation workbench has been developed for experiments on PE granularities of 1, 4, 8, and 16-b. An analysis of the impact of various I/O (input/output) and communication path widths is also possible. Overall performance, communication balance, PE utilization, and operand lengths can be monitored to evaluate the merits of various granularities and feature sets. This workbench has been used to run a set of benchmark algorithms that cover a range of computation and communication requirements, a range of data sizes, and a range of problem array sizes. The authors report results for two of the algorithms studied by T.C. Marek (1992): image rotation and image resampling. The results obtained are counterintuitive. They indicate that bit-serial machines have performance advantages due to inherent bit-oriented activity, even when using multiple bit operands, and to inter-PE communication when paths are narrower than the processor granularity.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"78 12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129763369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234890
M. Philippsen
An algorithm for mapping an arbitrary, multidimensional array onto an arbitrarily shaped multidimensional nearest-neighbor network of a distributed memory machine is presented. The individual dimensions of the array are labeled with high-level usage descriptors that either can be provided by the programmer or can be derived by sophisticated static compiler analysis. The presented algorithm achieves an appropriate exploitation of nearest-neighbor communication and allows for efficient address calculations. The author describes the integration of this technique into an optimizing compiler for Modula-2 and derives extensions that render efficient translation of nested parallelism possible and that provide support for thread scheduling.<>
{"title":"Automatic data distribution for nearest neighbor networks","authors":"M. Philippsen","doi":"10.1109/FMPC.1992.234890","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234890","url":null,"abstract":"An algorithm for mapping an arbitrary, multidimensional array onto an arbitrarily shaped multidimensional nearest-neighbor network of a distributed memory machine is presented. The individual dimensions of the array are labeled with high-level usage descriptors that either can be provided by the programmer or can be derived by sophisticated static compiler analysis. The presented algorithm achieves an appropriate exploitation of nearest-neighbor communication and allows for efficient address calculations. The author describes the integration of this technique into an optimizing compiler for Modula-2 and derives extensions that render efficient translation of nested parallelism possible and that provide support for thread scheduling.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116134364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}