Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234919
D. Rover, Xian-He Sun
The scaling of algorithms and machines is essential to achieve the goals of high-performance computing. Thus, scalability has become an important aspect of parallel algorithm and machine design. It is a desirable property that has been used to describe the demand for proportionate changes in performance with adjustments in system size. It should provide guidance toward an optimal choice of an architecture, algorithm, machine size, and problem size combination. However, as a performance metric, it is not yet well defined or understood. The paper summarizes several scalability metrics, including one that highlights the behavior of algorithm-machine combinations as sizes are varied under an isospeed condition. A scaling relation is presented to facilitate general mathematical and visual techniques for characterizing and comparing the scalability information of these metrics.<>
{"title":"Representing the scaling behavior of parallel algorithm-machine combinations","authors":"D. Rover, Xian-He Sun","doi":"10.1109/FMPC.1992.234919","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234919","url":null,"abstract":"The scaling of algorithms and machines is essential to achieve the goals of high-performance computing. Thus, scalability has become an important aspect of parallel algorithm and machine design. It is a desirable property that has been used to describe the demand for proportionate changes in performance with adjustments in system size. It should provide guidance toward an optimal choice of an architecture, algorithm, machine size, and problem size combination. However, as a performance metric, it is not yet well defined or understood. The paper summarizes several scalability metrics, including one that highlights the behavior of algorithm-machine combinations as sizes are varied under an isospeed condition. A scaling relation is presented to facilitate general mathematical and visual techniques for characterizing and comparing the scalability information of these metrics.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122312135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234889
S. Darbha, E. Davis
It is shown that a nearest neighbor communication network can be complimented with a log-diameter multistage network to handle different communications patterns. This is especially useful when the pattern of data movement is not uniform. The designed network is evaluated for two cases: a dense case with many processing elements communicating and a sparse case. For 32-b data, the algorithm for computing partial sums of an array improves by 2.7 times with the multistage interconnection network. In a sparse random case, the number of cycles taken to communicate 32 b is 4000 (with 10% of the nodes communicating). Thus, it is concluded that a network like a multistage omega network is very useful for SIMD (single-instruction multiple-data) massively parallel machines. This is especially true if the machine is to be used for applications where long distance and nonuniform routing patterns are needed.<>
{"title":"Network design and performance for a massively parallel SIMD system","authors":"S. Darbha, E. Davis","doi":"10.1109/FMPC.1992.234889","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234889","url":null,"abstract":"It is shown that a nearest neighbor communication network can be complimented with a log-diameter multistage network to handle different communications patterns. This is especially useful when the pattern of data movement is not uniform. The designed network is evaluated for two cases: a dense case with many processing elements communicating and a sparse case. For 32-b data, the algorithm for computing partial sums of an array improves by 2.7 times with the multistage interconnection network. In a sparse random case, the number of cycles taken to communicate 32 b is 4000 (with 10% of the nodes communicating). Thus, it is concluded that a network like a multistage omega network is very useful for SIMD (single-instruction multiple-data) massively parallel machines. This is especially true if the machine is to be used for applications where long distance and nonuniform routing patterns are needed.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122876280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234868
C.-S. Chang, G. DeTitta, H. Hauptman, R. Miller, M. Poulin, P. Thuman, C. Weeks
The authors have developed a formulation of the phase problem of X-ray crystallography in terms of a minimal function of phases and a new minimization algorithm called shake-and-bake for solving this minimal function. The implementation details of the shake-and-bake strategy on the Connection Machine CM-2 are presented. The shake-and-bake algorithm has been used to determine the atomic structure of four test structures, ranging from 28 to 317 atoms. These results indicate that shake-and-bake is effective on structures of this size.<>
{"title":"Solutions to the phase problem of X-ray crystallography on the Connection Machine CM-2","authors":"C.-S. Chang, G. DeTitta, H. Hauptman, R. Miller, M. Poulin, P. Thuman, C. Weeks","doi":"10.1109/FMPC.1992.234868","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234868","url":null,"abstract":"The authors have developed a formulation of the phase problem of X-ray crystallography in terms of a minimal function of phases and a new minimization algorithm called shake-and-bake for solving this minimal function. The implementation details of the shake-and-bake strategy on the Connection Machine CM-2 are presented. The shake-and-bake algorithm has been used to determine the atomic structure of four test structures, ranging from 28 to 317 atoms. These results indicate that shake-and-bake is effective on structures of this size.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129870625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234871
Mohammed Atiquzzaman, M. S. Akhtar
Hot spots in multistage interconnection networks (MSINs) results in performance degradation of the network. The authors develop an analytical model for the performance evaluation of unbuffered MSINs under a single hot spot, followed by a performance comparison with buffered MSINs. For uniform traffic, a buffered network performs better than an unbuffered network. For a nonuniform traffic pattern causing congestion (for example, tree saturation) in the network, an unbuffered network outperforms a buffered network. This leads the authors to suggest a hybrid network which will be capable of switching from the buffered mode to the unbuffered mode in the presence of network congestion.<>
{"title":"Effect of hot spot on the performance of multistage interconnection networks","authors":"Mohammed Atiquzzaman, M. S. Akhtar","doi":"10.1109/FMPC.1992.234871","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234871","url":null,"abstract":"Hot spots in multistage interconnection networks (MSINs) results in performance degradation of the network. The authors develop an analytical model for the performance evaluation of unbuffered MSINs under a single hot spot, followed by a performance comparison with buffered MSINs. For uniform traffic, a buffered network performs better than an unbuffered network. For a nonuniform traffic pattern causing congestion (for example, tree saturation) in the network, an unbuffered network outperforms a buffered network. This leads the authors to suggest a hybrid network which will be capable of switching from the buffered mode to the unbuffered mode in the presence of network congestion.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"168 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124684721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234864
S. Dickey, R. Kenner
A pairwise combining switch has been implemented for use in the 16*16 processor/memory interconnection network of the NYU Ultracomputer prototype. The switch design may be extended for use in very large systems by providing greater combining capability. Methods for doing so are discussed.<>
{"title":"Combining switches for the NYU Ultracomputer","authors":"S. Dickey, R. Kenner","doi":"10.1109/FMPC.1992.234864","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234864","url":null,"abstract":"A pairwise combining switch has been implemented for use in the 16*16 processor/memory interconnection network of the NYU Ultracomputer prototype. The switch design may be extended for use in very large systems by providing greater combining capability. Methods for doing so are discussed.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123897494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234920
V. Ramachandran, R. Raines, J.S. Park, N. Davis
Extends previous research efforts related to the performance modeling of the fault-tolerant Augmented Shuffle Exchange Network (ASEN). The authors examine the ASEN run-time performance characteristics in a packet switched environment. The network performance is examined under a fault-free but congested network operating environment. Network performance parameters of time-in-system, queue lengths and delays, as well as the effects of non-uniform loading of the network are presented. The cost associated with implementation of an ASEN is compared with previously published metrics for the multistage cube network operating under the same environments. The authors conclude that, for the network and operating assumptions defined, the ASEN provides better performance at lower implementation costs than the multistage cube interconnection network.<>
{"title":"Performance studies of packet switched augmented shuffle exchange networks","authors":"V. Ramachandran, R. Raines, J.S. Park, N. Davis","doi":"10.1109/FMPC.1992.234920","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234920","url":null,"abstract":"Extends previous research efforts related to the performance modeling of the fault-tolerant Augmented Shuffle Exchange Network (ASEN). The authors examine the ASEN run-time performance characteristics in a packet switched environment. The network performance is examined under a fault-free but congested network operating environment. Network performance parameters of time-in-system, queue lengths and delays, as well as the effects of non-uniform loading of the network are presented. The cost associated with implementation of an ASEN is compared with previously published metrics for the multistage cube network operating under the same environments. The authors conclude that, for the network and operating assumptions defined, the ASEN provides better performance at lower implementation costs than the multistage cube interconnection network.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"165 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120929768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234901
S. Noh, K. Dussa-Zieger
The authors introduce a new type of combined SIMD/MIMD (single-instruction multiple-data/multiple-instruction multiple-data) architecture called a hybrid system. The hybrid system consists of two components. The first component is massively parallel and consists of a large number of slow processors that are organized in an SIMD architecture. The second component consists of only a few fast processors (possibly only one) which are organized in an MIMD architecture. The authors contend that a hybrid system provides a means to adequately adjust to the characteristics of a parallel program, i.e., changing parallelism. They describe the machine and application model, and discuss the performance impact of such a system. Viewing the CM-2 with its front-end as a special case of a hybrid system, they substantiate the arguments and report measurements for a Gaussian elimination algorithm.<>
{"title":"Improving massively data parallel system performance with heterogeneity","authors":"S. Noh, K. Dussa-Zieger","doi":"10.1109/FMPC.1992.234901","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234901","url":null,"abstract":"The authors introduce a new type of combined SIMD/MIMD (single-instruction multiple-data/multiple-instruction multiple-data) architecture called a hybrid system. The hybrid system consists of two components. The first component is massively parallel and consists of a large number of slow processors that are organized in an SIMD architecture. The second component consists of only a few fast processors (possibly only one) which are organized in an MIMD architecture. The authors contend that a hybrid system provides a means to adequately adjust to the characteristics of a parallel program, i.e., changing parallelism. They describe the machine and application model, and discuss the performance impact of such a system. Viewing the CM-2 with its front-end as a special case of a hybrid system, they substantiate the arguments and report measurements for a Gaussian elimination algorithm.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"183 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121196196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234894
D. Marinescu
The author analyzes a 3-D FFT (fast Fourier transform) algorithm for a distributed memory MIMD (multiple-instruction multiple-data) system. It is shown that the communication complexity limits the efficiency even under ideal conditions. The efficiency for the optimal speedup is eta /sub opt/=0.5. Actual applications which experience load imbalance, duplication of work, and blocking are even less efficient. Therefore the speedup with P processing elements, S(P)= eta *P, is disappointingly low. Moreover, the 3-D FFT algorithm is not susceptible to massive parallelization, and the optimal number of PEs is rather low even for large problem size and fast communication. A strategy to reduce the communication complexity is presented.<>
分析了一种适用于分布式存储多指令多数据系统的三维快速傅里叶变换算法。结果表明,即使在理想条件下,通信复杂性也会限制效率。最佳加速效率为eta /sub opt/=0.5。遇到负载不平衡、重复工作和阻塞的实际应用程序甚至效率更低。因此,P个处理元素的加速,S(P)= eta *P,低得令人失望。此外,三维FFT算法不容易受到大规模并行化的影响,即使在大问题规模和快速通信的情况下,pe的最优数量也很低。提出了一种降低通信复杂度的策略。
{"title":"The speedup and efficiency of 3-D FFT on distributed memory MIMD systems","authors":"D. Marinescu","doi":"10.1109/FMPC.1992.234894","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234894","url":null,"abstract":"The author analyzes a 3-D FFT (fast Fourier transform) algorithm for a distributed memory MIMD (multiple-instruction multiple-data) system. It is shown that the communication complexity limits the efficiency even under ideal conditions. The efficiency for the optimal speedup is eta /sub opt/=0.5. Actual applications which experience load imbalance, duplication of work, and blocking are even less efficient. Therefore the speedup with P processing elements, S(P)= eta *P, is disappointingly low. Moreover, the 3-D FFT algorithm is not susceptible to massive parallelization, and the optimal number of PEs is rather low even for large problem size and fast communication. A strategy to reduce the communication complexity is presented.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121652632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234869
G. Cabodi, S. Gai, M. Reorda
A new algorithm for implementing the basic operations on BDDs (binary decision diagrams) on a massively parallel computer is presented. Each node is associated with a processor, and nodes belonging to the same level are evaluated together. An implementation of the algorithm on a Connection Machine CM2 has been done, and the prototype is being tested on a set of benchmark applications. Experimental results, showing the time required to perform the apply operation on BDDs of growing size demonstrate the exactness of the complexity analysis and the effectiveness of the approach.<>
{"title":"Boolean function manipulation on massively parallel computers","authors":"G. Cabodi, S. Gai, M. Reorda","doi":"10.1109/FMPC.1992.234869","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234869","url":null,"abstract":"A new algorithm for implementing the basic operations on BDDs (binary decision diagrams) on a massively parallel computer is presented. Each node is associated with a processor, and nodes belonging to the same level are evaluated together. An implementation of the algorithm on a Connection Machine CM2 has been done, and the prototype is being tested on a set of benchmark applications. Experimental results, showing the time required to perform the apply operation on BDDs of growing size demonstrate the exactness of the complexity analysis and the effectiveness of the approach.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127775475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234921
S. Otto, M. Wolfe
The authors are researching techniques for the programming of large-scale parallel machines for scientific computation. They use an intermediate-level language, MetaMP, that sits between High Performance Fortran (HPF) and low-level message passing. They are developing an efficient set of primitives in the intermediate language and are investigating compilation methods that can semi-automatically reason about parallel programs. The focus is on distributed memory hardware. The work has many similarities with HPF efforts although their approach is aimed at shorter-term solutions. They plan to keep the programmer centrally involved in the development and optimization of the parallel program.<>
{"title":"The MetaMP approach to parallel programming","authors":"S. Otto, M. Wolfe","doi":"10.1109/FMPC.1992.234921","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234921","url":null,"abstract":"The authors are researching techniques for the programming of large-scale parallel machines for scientific computation. They use an intermediate-level language, MetaMP, that sits between High Performance Fortran (HPF) and low-level message passing. They are developing an efficient set of primitives in the intermediate language and are investigating compilation methods that can semi-automatically reason about parallel programs. The focus is on distributed memory hardware. The work has many similarities with HPF efforts although their approach is aimed at shorter-term solutions. They plan to keep the programmer centrally involved in the development and optimization of the parallel program.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126346196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}