Pub Date : 1990-04-08DOI: 10.1109/DMCC.1990.555388
B. Abali, F. Ozguner, A. Bataineh
A parallel algorithm for sorting n elements evenly distributed over 2d = p nodes of a d-dimensional hypercube is given. The algorithm ensures that the nodes always receive equal number of elements (n/p) at the end, regardless of the skew in data distribution.
{"title":"Load balanced sort on hypercube multiprocessors","authors":"B. Abali, F. Ozguner, A. Bataineh","doi":"10.1109/DMCC.1990.555388","DOIUrl":"https://doi.org/10.1109/DMCC.1990.555388","url":null,"abstract":"A parallel algorithm for sorting n elements evenly distributed over 2d = p nodes of a d-dimensional hypercube is given. The algorithm ensures that the nodes always receive equal number of elements (n/p) at the end, regardless of the skew in data distribution.","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116285235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1990-04-08DOI: 10.1109/DMCC.1990.556328
I.F. ter, R. Overbeek
Parallel programming requires tools that implify the expression of complex algorithms, providl portability across different classes of machine, an i allow reuse of existing sequential code. We have pr viously proposed bilingual programming as a basis 1 x such tools. In particular, we have proposed the 1 se of a high-level concurrent programming language 1 such as Strand1) to construct parallel programs from ( 1 )ossibly pre-existing) sequential components. We rep1 Irt here on an applications study intended to evaluatcl the effectiveness of this approach. We describe expl !riences developing both new codes and parallel versioi s of existing codes in computational biology, weathei modeling, and automated reasoning. We find that tl e bilingual approach encourages the development of parallel programs that perform well, are portable, and (ire easy to maintain.
{"title":"Experiences with Bilingual Parallel Programming","authors":"I.F. ter, R. Overbeek","doi":"10.1109/DMCC.1990.556328","DOIUrl":"https://doi.org/10.1109/DMCC.1990.556328","url":null,"abstract":"Parallel programming requires tools that implify the expression of complex algorithms, providl portability across different classes of machine, an i allow reuse of existing sequential code. We have pr viously proposed bilingual programming as a basis 1 x such tools. In particular, we have proposed the 1 se of a high-level concurrent programming language 1 such as Strand1) to construct parallel programs from ( 1 )ossibly pre-existing) sequential components. We rep1 Irt here on an applications study intended to evaluatcl the effectiveness of this approach. We describe expl !riences developing both new codes and parallel versioi s of existing codes in computational biology, weathei modeling, and automated reasoning. We find that tl e bilingual approach encourages the development of parallel programs that perform well, are portable, and (ire easy to maintain.","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123706656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1990-04-08DOI: 10.1109/DMCC.1990.556335
J. Francioni, M. Gach
This paper describes a system tool designed for debugging the interprocess communication of a m1:ssagepassing parallel program. The tool includcs an i nteractive environment that helps the user generate a yaphical display of the program-in-question’s e pected communication behavior. This graph is consicl xed to be the program’s communication model. The debugging tool then runs the real program and comp res the aforementioned model to the program’s actui 1 communication behavior determined at run timi . The results of the comparison are displayed via a g~ phical animation that is based on the model-grap I. The debugging tool provides the user with a mechari ism for directing a debugging session based on the user s mental abstractions of a program’s communicatioi structure. Additionally, the communication model can be designed for any level of the program allow ng the user to debug the program in a top-down fashioi I.
{"title":"Design of a Communication Modeling Tool for Debugging Parallel Programs","authors":"J. Francioni, M. Gach","doi":"10.1109/DMCC.1990.556335","DOIUrl":"https://doi.org/10.1109/DMCC.1990.556335","url":null,"abstract":"This paper describes a system tool designed for debugging the interprocess communication of a m1:ssagepassing parallel program. The tool includcs an i nteractive environment that helps the user generate a yaphical display of the program-in-question’s e pected communication behavior. This graph is consicl xed to be the program’s communication model. The debugging tool then runs the real program and comp res the aforementioned model to the program’s actui 1 communication behavior determined at run timi . The results of the comparison are displayed via a g~ phical animation that is based on the model-grap I. The debugging tool provides the user with a mechari ism for directing a debugging session based on the user s mental abstractions of a program’s communicatioi structure. Additionally, the communication model can be designed for any level of the program allow ng the user to debug the program in a top-down fashioi I.","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122040195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1990-04-08DOI: 10.1109/DMCC.1990.556382
C. Baillie, Roy D. Williams
We are currently performing large-scale numerical simulations of dynamically triangulated random surfaces on several parallel computers. Herein we briefly explain the importance of random surface simulations and describe in detail our computer program that simulates such surfaces with extrinsic curvature in arbitrary dimension. As this program is an ideal benchmark of the scalar performance of a computer, we also present performance measurements for it on several sequential and parallel machines.
{"title":"Numerical Simulations of Dynamically Triangulated Random Surfaces on Parallel Computers with 100% Speedup","authors":"C. Baillie, Roy D. Williams","doi":"10.1109/DMCC.1990.556382","DOIUrl":"https://doi.org/10.1109/DMCC.1990.556382","url":null,"abstract":"We are currently performing large-scale numerical simulations of dynamically triangulated random surfaces on several parallel computers. Herein we briefly explain the importance of random surface simulations and describe in detail our computer program that simulates such surfaces with extrinsic curvature in arbitrary dimension. As this program is an ideal benchmark of the scalar performance of a computer, we also present performance measurements for it on several sequential and parallel machines.","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129900161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1990-04-08DOI: 10.1109/DMCC.1990.556279
E. Dahl
Large processing speeds may be achieved by coordinating the work of many processors in a distributed memory architecture. FOT most applications, this approach mandates the communication of data amongst the distributed memories, and the cost of this communication can offset the advantage brought by massively parallel processing. We describe an optimization strategy that addresses this problem by dramatically reducing communication costs on the Connection Machine system.
{"title":"Mapping and Compiled Communication on the Connection Machine System","authors":"E. Dahl","doi":"10.1109/DMCC.1990.556279","DOIUrl":"https://doi.org/10.1109/DMCC.1990.556279","url":null,"abstract":"Large processing speeds may be achieved by coordinating the work of many processors in a distributed memory architecture. FOT most applications, this approach mandates the communication of data amongst the distributed memories, and the cost of this communication can offset the advantage brought by massively parallel processing. We describe an optimization strategy that addresses this problem by dramatically reducing communication costs on the Connection Machine system.","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"2014 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127419714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1990-04-08DOI: 10.1109/DMCC.1990.556401
R. Sen
The problem of embedding a pyramid on a boolean hypercube has been addressed. A maximal set of edges of the pyramid having an image edge in the hypercube is found. This is based on a breadth-first search that embeds a maximal bipartite subgraph of the pyramid. It has been shown that for a pyramid 70% of its edges may always have image edges in the hypercube. These edges may be statically mapped. This would reduce run-time routing load in the hypercube computer considerably.
{"title":"Embedding A Pyramid On The Hypercube With Minimal Routing Load","authors":"R. Sen","doi":"10.1109/DMCC.1990.556401","DOIUrl":"https://doi.org/10.1109/DMCC.1990.556401","url":null,"abstract":"The problem of embedding a pyramid on a boolean hypercube has been addressed. A maximal set of edges of the pyramid having an image edge in the hypercube is found. This is based on a breadth-first search that embeds a maximal bipartite subgraph of the pyramid. It has been shown that for a pyramid 70% of its edges may always have image edges in the hypercube. These edges may be statically mapped. This would reduce run-time routing load in the hypercube computer considerably.","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129082899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1990-04-08DOI: 10.1109/DMCC.1990.555400
M. P. Sears
A set of routines has been written for dense matrix operations optimized for the NCUBE/6400 parallel processor. This work was motivated by a Sandia effort to parallelize certain electronic structure calculations [l]. Routines are included for matrix transpose, multiply, Cholesky decomposition, triangular inversion, and Householder tridiagonalization. The library is written in C and is callable from Fortran. Matrices up to order 1600 can be handled on 128 processors. For each operation, the algorithm used is presented along with typical timings and es timates of performance. Performance for order 1600 on 128 processors varies from 42 MFLOPs (Householder tridiagonalization, triangular inverse) up to 126 MFLOPs (matrix multiply). We also present performance results for communications and basic linear algebra operations (saxpy and dot products).
{"title":"Linear Algebra for Dense Matrices on a Hypercube","authors":"M. P. Sears","doi":"10.1109/DMCC.1990.555400","DOIUrl":"https://doi.org/10.1109/DMCC.1990.555400","url":null,"abstract":"A set of routines has been written for dense matrix operations optimized for the NCUBE/6400 parallel processor. This work was motivated by a Sandia effort to parallelize certain electronic structure calculations [l]. Routines are included for matrix transpose, multiply, Cholesky decomposition, triangular inversion, and Householder tridiagonalization. The library is written in C and is callable from Fortran. Matrices up to order 1600 can be handled on 128 processors. For each operation, the algorithm used is presented along with typical timings and es timates of performance. Performance for order 1600 on 128 processors varies from 42 MFLOPs (Householder tridiagonalization, triangular inverse) up to 126 MFLOPs (matrix multiply). We also present performance results for communications and basic linear algebra operations (saxpy and dot products).","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131979168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1990-04-08DOI: 10.1109/DMCC.1990.555404
T. Taha
{"title":"Solution of Periodic Tridiagonal Linear Systems on a Hypercube","authors":"T. Taha","doi":"10.1109/DMCC.1990.555404","DOIUrl":"https://doi.org/10.1109/DMCC.1990.555404","url":null,"abstract":"","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130924780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1990-04-08DOI: 10.1109/DMCC.1990.555418
R. Ferraro, P. Liewer, V. Decyk
{"title":"A 2D Electrostatic PIC Code for the Mark III Hypercube","authors":"R. Ferraro, P. Liewer, V. Decyk","doi":"10.1109/DMCC.1990.555418","DOIUrl":"https://doi.org/10.1109/DMCC.1990.555418","url":null,"abstract":"","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126218151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1990-04-08DOI: 10.1109/DMCC.1990.556381
D. Bailey, E. Barszcz, R. Fatoohi, H. Simon, S. Weeratunga
This paper describes the Intel Touchstone Gamma Prototype, a distributed memory MIMD parallel computer based on the new Intel i860 floating point processor. With 128 nodes, this system has a theoretical peak performance of over seven GFLOPS. This paper presents some initial performance results on this system, including results for individual node computation, message passing and complete applications using multiple nodes. The highest rate achieved on a multiprocessor Fortran application program is 844 MFLOPS. Overview of the Touchstone Gamma System In spring of 1989 DARPA and Intel Scientific Computers announced the Touchstone project. This project calls for the development of a series of prototype machines by Intel Scientific Computers, based on hardware and software technologies being developed by Intel in collaboration with research teams at CalTech, MIT, UC Berkeley, Princeton, and the University of Illinois. The eventual goal of this project is the Sigma prototype, a 150 GFLOPS peak parallel supercomputer, with 2000 processing nodes. One of the milestones towards the Sigma prototype is the Gamma prototype. At the end of December 1989, the Numerical Aerodynamic Simulation (NAS) Systems Division at NASA Ames Research Center took delivery of one of the first two Touchstone Gamma systems, and it became available for testing in January 1990. The Touchstone Gamma system is based on the new 64 bit i860 microprocessor by Intel [4]. The i860 has over 1 million transistors and runs at 40 MHz (the initial Touchstone Gamma systems were delivered with 33 MHz processors, but these have since been upgraded to 40 MHz). The theoretical peak speed is 80 MFLOPS in 32 bit floating point and 60 MFLOPS for 64 bit floating point operations. The i860 features 32 integer address registers, with 32 bits each, and 16 floating point registers with 64 bits each (or 32 floating point registers with 32 bits each). It also features an 8 kilobyte onchip data cache and a 4 kilobyte instruction cache. There is a 128 bit data path between cache and registers. There is a 64 bit data path between main memory and registers. The i860 has a number of advanced features to facilitate high execution rates. First of all, a number of important operations, including floating point add, multiply and fetch from main memory, are pipelined operations. This means that they are segmented into three stages, and in most cases a new operation can be initiated every 25 nanosecond clock period. Another advanced feature is the fact that multiple instructions can be executed in a single clock period. For example, a memory fetch, a floating add and a floating multiply can all be initiated in a single clock period. A single node of the Touchstone Gamma system consists of the i860, 8 megabytes (MB) of dynamic random access memory, and hardware for communication to other nodes. The Touchstone Gamma system at NASA Ames consists of 128 computational nodes. The theoretical peak performance of this system is
{"title":"Performance Results on the Intel Touchstone Gamma Prototype","authors":"D. Bailey, E. Barszcz, R. Fatoohi, H. Simon, S. Weeratunga","doi":"10.1109/DMCC.1990.556381","DOIUrl":"https://doi.org/10.1109/DMCC.1990.556381","url":null,"abstract":"This paper describes the Intel Touchstone Gamma Prototype, a distributed memory MIMD parallel computer based on the new Intel i860 floating point processor. With 128 nodes, this system has a theoretical peak performance of over seven GFLOPS. This paper presents some initial performance results on this system, including results for individual node computation, message passing and complete applications using multiple nodes. The highest rate achieved on a multiprocessor Fortran application program is 844 MFLOPS. Overview of the Touchstone Gamma System In spring of 1989 DARPA and Intel Scientific Computers announced the Touchstone project. This project calls for the development of a series of prototype machines by Intel Scientific Computers, based on hardware and software technologies being developed by Intel in collaboration with research teams at CalTech, MIT, UC Berkeley, Princeton, and the University of Illinois. The eventual goal of this project is the Sigma prototype, a 150 GFLOPS peak parallel supercomputer, with 2000 processing nodes. One of the milestones towards the Sigma prototype is the Gamma prototype. At the end of December 1989, the Numerical Aerodynamic Simulation (NAS) Systems Division at NASA Ames Research Center took delivery of one of the first two Touchstone Gamma systems, and it became available for testing in January 1990. The Touchstone Gamma system is based on the new 64 bit i860 microprocessor by Intel [4]. The i860 has over 1 million transistors and runs at 40 MHz (the initial Touchstone Gamma systems were delivered with 33 MHz processors, but these have since been upgraded to 40 MHz). The theoretical peak speed is 80 MFLOPS in 32 bit floating point and 60 MFLOPS for 64 bit floating point operations. The i860 features 32 integer address registers, with 32 bits each, and 16 floating point registers with 64 bits each (or 32 floating point registers with 32 bits each). It also features an 8 kilobyte onchip data cache and a 4 kilobyte instruction cache. There is a 128 bit data path between cache and registers. There is a 64 bit data path between main memory and registers. The i860 has a number of advanced features to facilitate high execution rates. First of all, a number of important operations, including floating point add, multiply and fetch from main memory, are pipelined operations. This means that they are segmented into three stages, and in most cases a new operation can be initiated every 25 nanosecond clock period. Another advanced feature is the fact that multiple instructions can be executed in a single clock period. For example, a memory fetch, a floating add and a floating multiply can all be initiated in a single clock period. A single node of the Touchstone Gamma system consists of the i860, 8 megabytes (MB) of dynamic random access memory, and hardware for communication to other nodes. The Touchstone Gamma system at NASA Ames consists of 128 computational nodes. The theoretical peak performance of this system is","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"240 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114239576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}