Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234880
B. Duzett, R. Buck
nCUBE is developing a new family of massively parallel products-the nCUBE 3 systems. These next-generation supercomputers will be the industry's first implementable multi-TeraFLOPS platforms and will be 100% compatible with previous-generation nCUBE systems. The nCUBE 3 family will carry nCUBE's philosophy of high integration and scalability to new, industry-leading levels, offering systems that scale from low-end, entry-level products to high-end, grand challenge machines. After introducing the nCUBE 3 system, the authors describe the nCUBE 3 system implementation.<>
{"title":"An overview of the nCUBE 3 supercomputer","authors":"B. Duzett, R. Buck","doi":"10.1109/FMPC.1992.234880","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234880","url":null,"abstract":"nCUBE is developing a new family of massively parallel products-the nCUBE 3 systems. These next-generation supercomputers will be the industry's first implementable multi-TeraFLOPS platforms and will be 100% compatible with previous-generation nCUBE systems. The nCUBE 3 family will carry nCUBE's philosophy of high integration and scalability to new, industry-leading levels, offering systems that scale from low-end, entry-level products to high-end, grand challenge machines. After introducing the nCUBE 3 system, the authors describe the nCUBE 3 system implementation.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114386414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234937
P. Garske, V. Narasimhan
The authors consider two novel arbitration techniques, timeout and batching arbitration, and establish the validity of their operations by using generalized and deterministic Petri net models. After a brief review of Petri net theory and the fundamentals of generalized and deterministic models, Petri net models for the timeout and batching arbitration schemes are presented, followed by a discussion of the simulation results of both of these schemes. It is found that both arbitration schemes provide a degree of fairness in that they reduce the resource allocation time but with the lack of complete resource utilization. A hybrid scheme which combines the key features of batching and timeout schemes is then presented and proven to operate correctly. Simulation of this scheme suggests that, by varying the arbiter parameters in conjunction with the priority of the processors, efficient allocation of system resources can be achieved.<>
{"title":"Petri net modeling and analysis of centralized timeout and batching arbitration units","authors":"P. Garske, V. Narasimhan","doi":"10.1109/FMPC.1992.234937","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234937","url":null,"abstract":"The authors consider two novel arbitration techniques, timeout and batching arbitration, and establish the validity of their operations by using generalized and deterministic Petri net models. After a brief review of Petri net theory and the fundamentals of generalized and deterministic models, Petri net models for the timeout and batching arbitration schemes are presented, followed by a discussion of the simulation results of both of these schemes. It is found that both arbitration schemes provide a degree of fairness in that they reduce the resource allocation time but with the lack of complete resource utilization. A hybrid scheme which combines the key features of batching and timeout schemes is then presented and proven to operate correctly. Simulation of this scheme suggests that, by varying the arbiter parameters in conjunction with the priority of the processors, efficient allocation of system resources can be achieved.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114621502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234955
É. Dujardin, M. Akil
The authors describe a novel network topology for image processing, called the hyper-pyramid network topology. This structure is hierarchical and implements local, inside-region communications at each level, and upward/downward communications in the whole structure. Intraregion communications are shown by an image processing algorithm study. The authors display the implementation of a component labeling algorithm onto a hyper-pyramid network with a computational complexity of O(log/sup 2/(n)). This complexity is the same as that of the hypercube network. It is also demonstrated that the wiring complexity is less than that of the hypercube network.<>
{"title":"A hyper-pyramid network topology for image processing","authors":"É. Dujardin, M. Akil","doi":"10.1109/FMPC.1992.234955","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234955","url":null,"abstract":"The authors describe a novel network topology for image processing, called the hyper-pyramid network topology. This structure is hierarchical and implements local, inside-region communications at each level, and upward/downward communications in the whole structure. Intraregion communications are shown by an image processing algorithm study. The authors display the implementation of a component labeling algorithm onto a hyper-pyramid network with a computational complexity of O(log/sup 2/(n)). This complexity is the same as that of the hypercube network. It is also demonstrated that the wiring complexity is less than that of the hypercube network.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123327800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234887
K. Gates
A rank-two divide and conquer algorithm is developed for calculating the eigensystem of a symmetric tridiagonal matrix. This algorithm is compared to the LAPACK recommended path for this problem and the rank-one divide and conquer algorithm. The timing results on a Sequent Symmetry S81b show that this algorithm has potential as a parallel alternative to the QR algorithm.<>
{"title":"A rank-two divide and conquer method for the symmetric tridiagonal eigenproblem","authors":"K. Gates","doi":"10.1109/FMPC.1992.234887","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234887","url":null,"abstract":"A rank-two divide and conquer algorithm is developed for calculating the eigensystem of a symmetric tridiagonal matrix. This algorithm is compared to the LAPACK recommended path for this problem and the rank-one divide and conquer algorithm. The timing results on a Sequent Symmetry S81b show that this algorithm has potential as a parallel alternative to the QR algorithm.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126061217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234897
Richard P. Brent
The author describes an implementation of the LINPACK benchmark on the Fujitsu AP 1000. Design considerations include communication primitives, data distribution, use of blocking to reduce memory references, and effective use of the cache. The LINPACK benchmark results show that the AP 1000 is a good machine for numerical linear algebra, and that one can consistently achieve close to 80 percent of its theoretical peak performance on moderate to large problems. The main reason for this is the high ratio of communication speed to floating-point speed compared to machines such as the Intel Delta and nCUBE 2. The high-bandwidth hardware row/column broadcast capability of the T-net (xbrd, ybrd) and the low latency of the synchronous communication routines are significant.<>
作者描述了在Fujitsu AP 1000上的LINPACK基准测试的实现。设计考虑因素包括通信原语、数据分布、使用阻塞来减少内存引用以及有效使用缓存。LINPACK基准测试结果表明,ap1000是数值线性代数的好机器,在中等到大型问题上,可以始终达到接近其理论峰值性能的80%。其主要原因是与Intel Delta和nCUBE 2等机器相比,其通信速度与浮点速度的比率很高。T-net (xbrd, ybrd)的高带宽硬件行/列广播能力和同步通信例程的低延迟是显著的
{"title":"The LINPACK benchmark on the Fujitsu FAP 1000","authors":"Richard P. Brent","doi":"10.1109/FMPC.1992.234897","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234897","url":null,"abstract":"The author describes an implementation of the LINPACK benchmark on the Fujitsu AP 1000. Design considerations include communication primitives, data distribution, use of blocking to reduce memory references, and effective use of the cache. The LINPACK benchmark results show that the AP 1000 is a good machine for numerical linear algebra, and that one can consistently achieve close to 80 percent of its theoretical peak performance on moderate to large problems. The main reason for this is the high ratio of communication speed to floating-point speed compared to machines such as the Intel Delta and nCUBE 2. The high-bandwidth hardware row/column broadcast capability of the T-net (xbrd, ybrd) and the low latency of the synchronous communication routines are significant.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129414465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234933
R. A. Helzerman, M. Harper, C. Zoltowski
The authors extended H. Maruyama's (1990) constraint dependency grammar (CDG) to process a lattice of sentence hypotheses instead of separate test strings. A postprocessor to a speech recognizer producing N-best hypotheses generates the word lattice representation, which is then augmented with information required for parsing. The authors summarize the CDG parsing algorithm and describe how the algorithm is extended to process the lattice on a single processor machine. They outline the CRCW P-RAM algorithm for parsing the word lattice, which requires O(n/sup 4/) processors to parse in O(k+n) time.<>
{"title":"Parallel parsing of spoken language","authors":"R. A. Helzerman, M. Harper, C. Zoltowski","doi":"10.1109/FMPC.1992.234933","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234933","url":null,"abstract":"The authors extended H. Maruyama's (1990) constraint dependency grammar (CDG) to process a lattice of sentence hypotheses instead of separate test strings. A postprocessor to a speech recognizer producing N-best hypotheses generates the word lattice representation, which is then augmented with information required for parsing. The authors summarize the CDG parsing algorithm and describe how the algorithm is extended to process the lattice on a single processor machine. They outline the CRCW P-RAM algorithm for parsing the word lattice, which requires O(n/sup 4/) processors to parse in O(k+n) time.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121975474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234870
P. Balasingam, V. Roychowdhury
A numerically intensive program for the simulation of quantum transport in small structures has been implemented on a MasPar MP-1. The high degree of parallelism inherent in numerically intensive sections of the problem has been exploited, and devices with realistic dimensions and operating conditions have been investigated.<>
{"title":"Massively parallel solution of quantum transport problems","authors":"P. Balasingam, V. Roychowdhury","doi":"10.1109/FMPC.1992.234870","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234870","url":null,"abstract":"A numerically intensive program for the simulation of quantum transport in small structures has been implemented on a MasPar MP-1. The high degree of parallelism inherent in numerically intensive sections of the problem has been exploited, and devices with realistic dimensions and operating conditions have been investigated.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122416199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234867
Chi-Kai Chien, I.D. Scherson
Fat-trees, KYKLOS, baseline and SW-banyan networks, and the TRAC and CM-5 networks belong to a family of networks called least-common-ancestor networks (LCANs). In this paper, attention is restricted to LCANs with identical switches and a uniform stage interconnect. The least common ancestor of two nodes (PEs), A and B, is the node at greatest depth that counts A and B among its descendants: this node corresponds to an LCA switch. Given a source-destination pair, communication progresses upwards to an LCA switch; the stage that it belongs to is called the LCA level. Then, routing returns downwards to the destination. Source-destination pairs are connected using as few stages as their degree of mutual locality permits. Network parameters that facilitate this routing are shown.<>
{"title":"Self-routing least common ancestor networks","authors":"Chi-Kai Chien, I.D. Scherson","doi":"10.1109/FMPC.1992.234867","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234867","url":null,"abstract":"Fat-trees, KYKLOS, baseline and SW-banyan networks, and the TRAC and CM-5 networks belong to a family of networks called least-common-ancestor networks (LCANs). In this paper, attention is restricted to LCANs with identical switches and a uniform stage interconnect. The least common ancestor of two nodes (PEs), A and B, is the node at greatest depth that counts A and B among its descendants: this node corresponds to an LCA switch. Given a source-destination pair, communication progresses upwards to an LCA switch; the stage that it belongs to is called the LCA level. Then, routing returns downwards to the destination. Source-destination pairs are connected using as few stages as their degree of mutual locality permits. Network parameters that facilitate this routing are shown.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121024626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234902
T. Bridges, S. W. Kitchel, R.M. Wehrmeister
Massively parallel computer systems based on off-the-shelf CPU chip-sets have become commercially available. The authors demonstrate a theoretical limit on the silicon (or other circuitry media) utilization of such architectures as the number of processors is scaled up. In addition, case studies of the Thinking Machines Corporation CM-5 and of the Intel Touchstone are presented in order to quantify the maximum utilization on existing machines. Based on this utilization limit, the authors examine whether computer architects' current reliance on the MIMD (multiple-instruction multiple-data) model will be practical in next-generation machines. In order to facilitate the analysis, they decouple the control parallel and data parallel models of computation from MIMD and SIMD (single-instruction multiple-data) target platforms, respectively. Utilization of control parallel paradigms executing on SIMD platforms is introduced for comparison. The authors also consider the relationship of communication overhead to machine size scaling in the presence of the need for virtual processing nodes.<>
{"title":"A CPU utilization limit for massively parallel MIMD computers","authors":"T. Bridges, S. W. Kitchel, R.M. Wehrmeister","doi":"10.1109/FMPC.1992.234902","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234902","url":null,"abstract":"Massively parallel computer systems based on off-the-shelf CPU chip-sets have become commercially available. The authors demonstrate a theoretical limit on the silicon (or other circuitry media) utilization of such architectures as the number of processors is scaled up. In addition, case studies of the Thinking Machines Corporation CM-5 and of the Intel Touchstone are presented in order to quantify the maximum utilization on existing machines. Based on this utilization limit, the authors examine whether computer architects' current reliance on the MIMD (multiple-instruction multiple-data) model will be practical in next-generation machines. In order to facilitate the analysis, they decouple the control parallel and data parallel models of computation from MIMD and SIMD (single-instruction multiple-data) target platforms, respectively. Utilization of control parallel paradigms executing on SIMD platforms is introduced for comparison. The authors also consider the relationship of communication overhead to machine size scaling in the presence of the need for virtual processing nodes.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127603650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234863
Q. Gu, J. Gu
The authors present two algorithms for the 1-1 routing problems on a mesh-connected computer. The first algorithm, with a queue size of 28, solves the 1-1 routing problem on an n*n mesh-connected computer in 2n+O(1) steps. This improves the previous result of queue size 75. The second algorithm solves the problem in 2n-2 steps with queue size 12t/sub s//s where t/sub s/ is the time for sorting an s*s mesh into a row major order for all s>or=1. This result improves the previous result of size 18.67 t/sub s//s. Both algorithms have important applications in reducing the hardware cost on a mesh-connected computer.<>
{"title":"Routing algorithms on a mesh-connected computer","authors":"Q. Gu, J. Gu","doi":"10.1109/FMPC.1992.234863","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234863","url":null,"abstract":"The authors present two algorithms for the 1-1 routing problems on a mesh-connected computer. The first algorithm, with a queue size of 28, solves the 1-1 routing problem on an n*n mesh-connected computer in 2n+O(1) steps. This improves the previous result of queue size 75. The second algorithm solves the problem in 2n-2 steps with queue size 12t/sub s//s where t/sub s/ is the time for sorting an s*s mesh into a row major order for all s>or=1. This result improves the previous result of size 18.67 t/sub s//s. Both algorithms have important applications in reducing the hardware cost on a mesh-connected computer.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132000797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}