Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234910
G. Sabot
The author describes the techniques that are used by the CM Compiler Engine to map the fine-grained array parallelism of languages such as Fortan 90 and C onto the Connection Machine (CM) architectures. The same compiler is used for node-level programming of the CM-5, for global programming of the CM-5, and for global programming of the SIMD (single-instruction multiple-data) CM-2. A new compiler phase is used to generate two classes of output code: code for a scalar control processor, which executes SPARC assembler, and code aimed at a model of the CM-5's parallel-processing elements. The model is embodied in a new RISC (reduced instruction set computer)-like vector instruction set called PEAC. The control program distributes parallel data at runtime among the processor nodes of the target machine. Each of these nodes is itself superpipelined and superscalar. An innovative scheduler overlaps the execution of multiple PEAC operations, while conventional vector processing techniques keep the pipelines filled.<>
{"title":"A compiler for a massively parallel distributed memory MIMD computer","authors":"G. Sabot","doi":"10.1109/FMPC.1992.234910","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234910","url":null,"abstract":"The author describes the techniques that are used by the CM Compiler Engine to map the fine-grained array parallelism of languages such as Fortan 90 and C onto the Connection Machine (CM) architectures. The same compiler is used for node-level programming of the CM-5, for global programming of the CM-5, and for global programming of the SIMD (single-instruction multiple-data) CM-2. A new compiler phase is used to generate two classes of output code: code for a scalar control processor, which executes SPARC assembler, and code aimed at a model of the CM-5's parallel-processing elements. The model is embodied in a new RISC (reduced instruction set computer)-like vector instruction set called PEAC. The control program distributes parallel data at runtime among the processor nodes of the target machine. Each of these nodes is itself superpipelined and superscalar. An innovative scheduler overlaps the execution of multiple PEAC operations, while conventional vector processing techniques keep the pipelines filled.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132672830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234881
T. Blank, J. Nickolls
The authors present two tales about massively parallel processors: 'Who is Fairest of Us All?' and 'The SPMD Path.' With a twist of humor, the tales discuss single-instruction multiple-data systems (SIMD), multiple-instruction multiple-data (MIMD) systems, differences, and the single program multiple data (SPMD) programming model. The first tale introduces autonomous SIMD (ASIMD), and then looks at the flexibility, programmability, cost, and effectiveness of MIMD and ASIMD systems. It is shown that ASIMD systems have the flexibility to solve real applications cost-effectively. The second tale describes the simple path that SPMD provides for programming, and why an ASIMD machine works well.<>
{"title":"A Grimm collection of MIMD fairy tales","authors":"T. Blank, J. Nickolls","doi":"10.1109/FMPC.1992.234881","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234881","url":null,"abstract":"The authors present two tales about massively parallel processors: 'Who is Fairest of Us All?' and 'The SPMD Path.' With a twist of humor, the tales discuss single-instruction multiple-data systems (SIMD), multiple-instruction multiple-data (MIMD) systems, differences, and the single program multiple data (SPMD) programming model. The first tale introduces autonomous SIMD (ASIMD), and then looks at the flexibility, programmability, cost, and effectiveness of MIMD and ASIMD systems. It is shown that ASIMD systems have the flexibility to solve real applications cost-effectively. The second tale describes the simple path that SPMD provides for programming, and why an ASIMD machine works well.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"57 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120853923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234932
T. Heywood, S. Ranka
The authors consider the performance of sorting and list ranking on the hierarchical parallel random access machine (H-PRAM), a model of computation which represents general degrees of locality (neighborhoods of activity), considering communication and synchronization simultaneously. The sorting result gives a significant improvement over that for the LPRAM (local-memory PRAM, i.e. unit-size neighborhoods), matches the best known hypercube algorithms when the H-PRAM's latency parameter l(P) is set to log P, and matches the best possible mesh algorithm when l(P)= square root P. The list ranking algorithm demonstrates fundamental limitations of the H-PRAM for nonoblivious problems which have linear-time sequential algorithms.<>
{"title":"Architecture independent analysis of sorting and list ranking on the hierarchical PRAM model","authors":"T. Heywood, S. Ranka","doi":"10.1109/FMPC.1992.234932","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234932","url":null,"abstract":"The authors consider the performance of sorting and list ranking on the hierarchical parallel random access machine (H-PRAM), a model of computation which represents general degrees of locality (neighborhoods of activity), considering communication and synchronization simultaneously. The sorting result gives a significant improvement over that for the LPRAM (local-memory PRAM, i.e. unit-size neighborhoods), matches the best known hypercube algorithms when the H-PRAM's latency parameter l(P) is set to log P, and matches the best possible mesh algorithm when l(P)= square root P. The list ranking algorithm demonstrates fundamental limitations of the H-PRAM for nonoblivious problems which have linear-time sequential algorithms.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116359738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234929
D.K. Krecker, W. Mitchell
The identification and location of ground-based radars via orbiting receivers require the correlation of pulses, the determination of time differences of arrival, and geolocation. Data rates in emitter-rich environments would swamp single-CPU processors performing this operation. The authors present an innovative parallel algorithm developed specifically for this application on massively parallel computers. The algorithm is based on the parallel computation and analysis of a matrix containing the differences in the time of arrival of all pulses received in a time window, and on the parallel proof/disproof of hypothesized emitter locations. Output contains the number of emitters and their location and PRI (pulse repetition interval) sequence. The algorithm was tested on a 16 K processor Connection Machine.<>
{"title":"Parallel pulse correlation and geolocation","authors":"D.K. Krecker, W. Mitchell","doi":"10.1109/FMPC.1992.234929","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234929","url":null,"abstract":"The identification and location of ground-based radars via orbiting receivers require the correlation of pulses, the determination of time differences of arrival, and geolocation. Data rates in emitter-rich environments would swamp single-CPU processors performing this operation. The authors present an innovative parallel algorithm developed specifically for this application on massively parallel computers. The algorithm is based on the parallel computation and analysis of a matrix containing the differences in the time of arrival of all pulses received in a time window, and on the parallel proof/disproof of hypothesized emitter locations. Output contains the number of emitters and their location and PRI (pulse repetition interval) sequence. The algorithm was tested on a 16 K processor Connection Machine.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126104879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234946
G. Bell
The developments in high-performance computers towards achieving the goal of a teraflops supercomputer that would operate at a peak speed of 10/sup 12/ floating-point operations per second are reviewed. The net result of the quest for parallelism as chronicled by the Gordon Bell Prize is that applications evolved 115% per year and will most likely achieve 1 teraflop in 1995. The physical characteristics of supercomputing alternatives available in 1992 are described. The progress of CMOS microprocessor technology to teraflop speeds is discussed. It is argued that the mainline general purpose computers will continue to be microprocessors in three forms: supercomputers, mainframes, and scalable MPs. The current scalable, multicomputers will all evolve and become multiprocessors, but with limited coherent memories in their next generation. It is also argued that the cost and time to rewrite major applications for one-of-a-kind machines is sufficiently large to make them uneconomical.<>
回顾了高性能计算机在实现每秒10/sup / 12/浮点运算峰值速度的teraflops超级计算机目标方面的发展。戈登·贝尔奖(Gordon Bell Prize)记录的对并行性的追求的最终结果是,应用程序每年发展115%,最有可能在1995年达到每秒1万亿次浮点运算。描述了1992年可用的超级计算替代方案的物理特性。讨论了CMOS微处理器技术在万亿次浮点运算速度方面的进展。有人认为,主流的通用计算机将继续是三种形式的微处理器:超级计算机、大型机和可扩展的MPs。目前可扩展的多计算机都将发展成为多处理器,但下一代的连贯存储器有限。也有人认为,为一种机器重写主要应用程序的成本和时间足够大,使它们不经济。
{"title":"Massively parallel computers: why not parallel computers for the masses?","authors":"G. Bell","doi":"10.1109/FMPC.1992.234946","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234946","url":null,"abstract":"The developments in high-performance computers towards achieving the goal of a teraflops supercomputer that would operate at a peak speed of 10/sup 12/ floating-point operations per second are reviewed. The net result of the quest for parallelism as chronicled by the Gordon Bell Prize is that applications evolved 115% per year and will most likely achieve 1 teraflop in 1995. The physical characteristics of supercomputing alternatives available in 1992 are described. The progress of CMOS microprocessor technology to teraflop speeds is discussed. It is argued that the mainline general purpose computers will continue to be microprocessors in three forms: supercomputers, mainframes, and scalable MPs. The current scalable, multicomputers will all evolve and become multiprocessors, but with limited coherent memories in their next generation. It is also argued that the cost and time to rewrite major applications for one-of-a-kind machines is sufficiently large to make them uneconomical.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122063427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234951
Clayton Ferner, K. Y. Lee
A new class of connection topologies for distributed-memory multiprocessors, hyperbanyan networks, is introduced. A hyperbanyan is a combination of the topological designs of a banyan and the hypertree networks. Since the hypertree combines the advantages of the binary tree and the hypercube, a hyperbanyan has the features of a binary tree, a hypercube, and a banyan. The hyperbanyans have a fixed degree of five, and the diameter of an (n stage*2/sup n-1/ nodes/stage) hyperbanyan is 2(n-1). A routing algorithm which is close to optimal is presented.<>
{"title":"Hyperbanyan networks: a new class of networks for distributed-memory multiprocessors","authors":"Clayton Ferner, K. Y. Lee","doi":"10.1109/FMPC.1992.234951","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234951","url":null,"abstract":"A new class of connection topologies for distributed-memory multiprocessors, hyperbanyan networks, is introduced. A hyperbanyan is a combination of the topological designs of a banyan and the hypertree networks. Since the hypertree combines the advantages of the binary tree and the hypercube, a hyperbanyan has the features of a binary tree, a hypercube, and a banyan. The hyperbanyans have a fixed degree of five, and the diameter of an (n stage*2/sup n-1/ nodes/stage) hyperbanyan is 2(n-1). A routing algorithm which is close to optimal is presented.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126990433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234930
D. Kramer, I. Scherson
The authors address the use of DP (dynamic precision) in fixed point iterative numerical algorithms. These algorithms are used in a wide range of numerically intensive scientific applications. One such algorithm, Muller's method, detects complex roots of an arbitrary function. This algorithm was implemented in DP on various architectures, including a MasPar MP-1 massively parallel processor and a Cray Y-MP vector processor. The results show that the use of DP can lead to a significant speedup of iterative algorithms on multiple-range architectures.<>
{"title":"Dynamic precision iterative algorithms","authors":"D. Kramer, I. Scherson","doi":"10.1109/FMPC.1992.234930","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234930","url":null,"abstract":"The authors address the use of DP (dynamic precision) in fixed point iterative numerical algorithms. These algorithms are used in a wide range of numerically intensive scientific applications. One such algorithm, Muller's method, detects complex roots of an arbitrary function. This algorithm was implemented in DP on various architectures, including a MasPar MP-1 massively parallel processor and a Cray Y-MP vector processor. The results show that the use of DP can lead to a significant speedup of iterative algorithms on multiple-range architectures.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128113770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234952
J. Jang, H. Park, V. Prasanna
The authors present fast parallel algorithms for computing the histogram on PARBUS and RMESH models. Compared with the approach of J. Jeng and S. Sahni (1992), the proposed algorithm improves the time complexity by using a constant amount of memory in each processing element. In the histogram modification algorithm, the entire range of h is considered. The connections used by the proposed algorithm on the PARBUS model are same as those allowed in the MRN model. Thus, this algorithm runs on this model as well. The results obtained imply that the number of 1's in a N*N 0/1 table can be counted in O(log* N) time on an N*N reconfigurable mesh and in O(log log N) time on an N*N RMESH.<>
{"title":"A fast algorithm for computing histograms on a reconfigurable mesh","authors":"J. Jang, H. Park, V. Prasanna","doi":"10.1109/FMPC.1992.234952","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234952","url":null,"abstract":"The authors present fast parallel algorithms for computing the histogram on PARBUS and RMESH models. Compared with the approach of J. Jeng and S. Sahni (1992), the proposed algorithm improves the time complexity by using a constant amount of memory in each processing element. In the histogram modification algorithm, the entire range of h is considered. The connections used by the proposed algorithm on the PARBUS model are same as those allowed in the MRN model. Thus, this algorithm runs on this model as well. The results obtained imply that the number of 1's in a N*N 0/1 table can be counted in O(log* N) time on an N*N reconfigurable mesh and in O(log log N) time on an N*N RMESH.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127628799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234939
S. Fineberg, T. Casavant, B. H. Pease
The communication latency problem is presented with special emphasis on RISC (reduced instruction set computer) based multiprocessors. An interprocessor communication model for parallel programs based on locality is presented. This model enables the programmer to manipulate locality at the language level and to take advantage of currently available system hardware to reduce latency. A hardware node architecture for a latency-tolerant RISC-based multiprocessor, called Seamless, that supports this model, is presented. The Seamless architecture includes the addition of a hardware locality manager to each processing element, as well as an integral runtime environment and compiler.<>
{"title":"Hardware support for the Seamless programming model","authors":"S. Fineberg, T. Casavant, B. H. Pease","doi":"10.1109/FMPC.1992.234939","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234939","url":null,"abstract":"The communication latency problem is presented with special emphasis on RISC (reduced instruction set computer) based multiprocessors. An interprocessor communication model for parallel programs based on locality is presented. This model enables the programmer to manipulate locality at the language level and to take advantage of currently available system hardware to reduce latency. A hardware node architecture for a latency-tolerant RISC-based multiprocessor, called Seamless, that supports this model, is presented. The Seamless architecture includes the addition of a hardware locality manager to each processing element, as well as an integral runtime environment and compiler.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126524415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-10-19DOI: 10.1109/FMPC.1992.234943
L. Verdoscia, R. Vaccaro
The authors present the ALFA architecture, a data flow machine with 16384 functional units (FUs) grouped in 128 clusters. ALFA is based on the Backus FFP computational model and uses the static data flow execution model. This machine's behavior is deterministic and asynchronous. Consequently, after compile time, instructions and data are no longer related. In this machine, even though its behavior is deterministic, no control token is generated during the computation, but only data tokens. Furthermore, during the execution phase, no memory is required to contain the partial results exchanged among FUs. A cluster with 128 FUs has been simulated, and some results are presented.<>
{"title":"ALFA: a static data flow architecture","authors":"L. Verdoscia, R. Vaccaro","doi":"10.1109/FMPC.1992.234943","DOIUrl":"https://doi.org/10.1109/FMPC.1992.234943","url":null,"abstract":"The authors present the ALFA architecture, a data flow machine with 16384 functional units (FUs) grouped in 128 clusters. ALFA is based on the Backus FFP computational model and uses the static data flow execution model. This machine's behavior is deterministic and asynchronous. Consequently, after compile time, instructions and data are no longer related. In this machine, even though its behavior is deterministic, no control token is generated during the computation, but only data tokens. Furthermore, during the execution phase, no memory is required to contain the partial results exchanged among FUs. A cluster with 128 FUs has been simulated, and some results are presented.<<ETX>>","PeriodicalId":117789,"journal":{"name":"[Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126529044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}