The Computational Fluid Dynamics (CFD) code FL057, which solves the 3-D Euler Equations using an explicit, finite volume, Runge-Kutta algorithm, was implemented on an Intel IPSC-MX parallel processor. Spatial decomposition was effected on the solution grid about a fighter aircraft configuration and Binary Reflected Graycodes were used to map the computational domain onto the IPSC insuring nearest neighbor communication. Results and timings of the implementation are presented with a comparison of the IPSC and a uniprocessor machine of similar classification to assess the performance of the IPSC on FL057. Suggested improvements to the current version of the parallelized code are listed to aid load balancing, vectorization, and more efficient memory use.
{"title":"Solution of the 3-D Euler equations for the flow about a fighter aircraft configuration using a hypercube parallel processor","authors":"D. Weissbein, J. F. Mangus, M. W. George","doi":"10.1145/63047.63066","DOIUrl":"https://doi.org/10.1145/63047.63066","url":null,"abstract":"The Computational Fluid Dynamics (CFD) code FL057, which solves the 3-D Euler Equations using an explicit, finite volume, Runge-Kutta algorithm, was implemented on an Intel IPSC-MX parallel processor. Spatial decomposition was effected on the solution grid about a fighter aircraft configuration and Binary Reflected Graycodes were used to map the computational domain onto the IPSC insuring nearest neighbor communication. Results and timings of the implementation are presented with a comparison of the IPSC and a uniprocessor machine of similar classification to assess the performance of the IPSC on FL057. Suggested improvements to the current version of the parallelized code are listed to aid load balancing, vectorization, and more efficient memory use.","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123253724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Astronomical data sets are beginning to live up to their name, in both their sizes and the complexity of the analysis required. Here we discuss two astronomical data analysis problems which we have begun to implement on a hypercube concurrent processor environment: The intensive image processing required in an optical interferometry project, and the large scale power spectral analysis required by a search for millisecond-period radio pulsars. In both cases the analysis proceeds largely in the Fourier domain, and we find that the problems are readily adapted to a concurrent environment. In the following report, we outline briefly the astronomical background for each problem, then discuss the general computational requirements, and finally present possible hypercube algorithms and results achieved to date.
{"title":"Hypercube data analysis in astronomy: optical interferometry and millisecond pulsar searches","authors":"P. Gorham, T. Prince, S. Anderson","doi":"10.1145/63047.63049","DOIUrl":"https://doi.org/10.1145/63047.63049","url":null,"abstract":"Astronomical data sets are beginning to live up to their name, in both their sizes and the complexity of the analysis required. Here we discuss two astronomical data analysis problems which we have begun to implement on a hypercube concurrent processor environment: The intensive image processing required in an optical interferometry project, and the large scale power spectral analysis required by a search for millisecond-period radio pulsars. In both cases the analysis proceeds largely in the Fourier domain, and we find that the problems are readily adapted to a concurrent environment. In the following report, we outline briefly the astronomical background for each problem, then discuss the general computational requirements, and finally present possible hypercube algorithms and results achieved to date.","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126260391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The region growing paradigm for image segmentation groups neighboring pixels into regions depending upon a predetermined homogeneity criteria. A parallel method for region growing on an MIMD multiprocessor system is presented. Since the region growing problem exhibits non-uniform and unpredictable load fluctuations, it requires a dynamic load balancing scheme to achieve a balanced load distribution. The results of implementing a parallel region growing algorithm on the Intel-iPSC hypercube are discussed.
{"title":"Region growing on a hypercube multiprocessor","authors":"M. Willebeek-LeMair, A. Reeves","doi":"10.1145/63047.63057","DOIUrl":"https://doi.org/10.1145/63047.63057","url":null,"abstract":"The region growing paradigm for image segmentation groups neighboring pixels into regions depending upon a predetermined homogeneity criteria. A parallel method for region growing on an MIMD multiprocessor system is presented. Since the region growing problem exhibits non-uniform and unpredictable load fluctuations, it requires a dynamic load balancing scheme to achieve a balanced load distribution. The results of implementing a parallel region growing algorithm on the Intel-iPSC hypercube are discussed.","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122348759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A parallel algorithm for solving the elliptic partial differential equation (PDE) is described in this paper through the finite difference method (FDM) The Concurrent Preconditioned Conjugate Gradient method is developed to optimize processor load balancing. This algorithm is evaluated on a hypercube-based concurrent machine, the Intel iPSC.
{"title":"The preconditioned conjugate gradient method on the hypercube","authors":"G. Abe, K. Hane","doi":"10.1145/63047.63126","DOIUrl":"https://doi.org/10.1145/63047.63126","url":null,"abstract":"A parallel algorithm for solving the elliptic partial differential equation (PDE) is described in this paper through the finite difference method (FDM) The Concurrent Preconditioned Conjugate Gradient method is developed to optimize processor load balancing. This algorithm is evaluated on a hypercube-based concurrent machine, the Intel iPSC.","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116084696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Baxter, J. Saltz, M. Schultz, S. Eisenstat, K. Crowley
High performance multiprocessor architectures differ both in the number of processors, and in the delay costs for synchronization and communication. In order to obtain good performance on a given architecture for a given problem, adequate parallelization, good balance of load and an appropriate choice of granularity are essential. We discuss the implementation of parallel version of PCGPAK for both shared memory architectures and hypercubes. Our parallel implementation is sufficiently efficient to allow us to complete the solution of our test problems on 16 processors of the Encore Multimax/320 in an amount of time that is a small multiple of that required by a single head of a Cray X/MP, despite the fact that the peak performance of the Multimax processors is not even close to the supercomputer range. We illustrate the effectiveness of our approach on a number of model problems from reservoir engineering and mathematics.
{"title":"An experimental study of methods for parallel preconditioned Krylov methods","authors":"D. Baxter, J. Saltz, M. Schultz, S. Eisenstat, K. Crowley","doi":"10.1145/63047.63128","DOIUrl":"https://doi.org/10.1145/63047.63128","url":null,"abstract":"High performance multiprocessor architectures differ both in the number of processors, and in the delay costs for synchronization and communication. In order to obtain good performance on a given architecture for a given problem, adequate parallelization, good balance of load and an appropriate choice of granularity are essential.\u0000We discuss the implementation of parallel version of PCGPAK for both shared memory architectures and hypercubes. Our parallel implementation is sufficiently efficient to allow us to complete the solution of our test problems on 16 processors of the Encore Multimax/320 in an amount of time that is a small multiple of that required by a single head of a Cray X/MP, despite the fact that the peak performance of the Multimax processors is not even close to the supercomputer range. We illustrate the effectiveness of our approach on a number of model problems from reservoir engineering and mathematics.","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121588162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wave-equation seismic modeling in two space dimensions is computationally intensive, often requiring hours of supercomputer CPU time to run typical geological models with 500 × 500 grids and 100 sources. This paper analyzes the performance of ACOUS2D, an explicit 4th-order finite-difference program, on Intel's 16-processor vector hypercube computer. The conversion of the sequential version of ACOUS2D to run on hypercube was straightforward, but time-consuming. The key consideration for optimal efficiency is load balancing. On a fairly typical geologic model, the 16-processor Intel vector hypercube computer ran ACOUS2D at 1/3 the speed of a Cray-1S.
{"title":"Hypercube performance for 2-D seismic finite-difference modeling","authors":"L. J. Baker","doi":"10.1145/63047.63068","DOIUrl":"https://doi.org/10.1145/63047.63068","url":null,"abstract":"Wave-equation seismic modeling in two space dimensions is computationally intensive, often requiring hours of supercomputer CPU time to run typical geological models with 500 × 500 grids and 100 sources. This paper analyzes the performance of ACOUS2D, an explicit 4th-order finite-difference program, on Intel's 16-processor vector hypercube computer. The conversion of the sequential version of ACOUS2D to run on hypercube was straightforward, but time-consuming. The key consideration for optimal efficiency is load balancing. On a fairly typical geologic model, the 16-processor Intel vector hypercube computer ran ACOUS2D at 1/3 the speed of a Cray-1S.","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123030423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The rule-based system has emerged as an important tool to developers of artificial intelligence programs. Because of the computational resources required to realize the MATCH-SELECT-EXECUTE cycle of rule-based systems, researchers have been trying to introduce parallelism into these systems for some time. We describe a new approach to parallel rule-based systems which exploits fine-grained hypercube hardware. The new algorithms for parallel rule matching and simultaneous execution of several rules at once are presented. Experimental results using a Connection Machine* implementation of BLITZ are presented.
{"title":"Blitz: a rule-based system for massively parallel architectures","authors":"K. Morgan","doi":"10.1145/63047.63091","DOIUrl":"https://doi.org/10.1145/63047.63091","url":null,"abstract":"The rule-based system has emerged as an important tool to developers of artificial intelligence programs. Because of the computational resources required to realize the MATCH-SELECT-EXECUTE cycle of rule-based systems, researchers have been trying to introduce parallelism into these systems for some time. We describe a new approach to parallel rule-based systems which exploits fine-grained hypercube hardware. The new algorithms for parallel rule matching and simultaneous execution of several rules at once are presented. Experimental results using a Connection Machine* implementation of BLITZ are presented.","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131190153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Three sorting algorithms are given for hypercubes with d-port communication. All of these algorithms are based on binsort at the global level. The binsort allows the movement of keys among nodes to be performed by a d-port complete exchange rather than a sequence of l-port exchanges as in other algorithms. This lowers communication costs by at least a factor of d compared to other sorting algorithms. The first algorithm assumes the keys are uniformly distributed and selects bin boundaries based on the global maximum and minimum keys. The other two algorithms make no assumption about the distribution of keys and so they sample the keys before the binsort in order to estimate their distribution. Splitting keys based on that estimate reduce the variance among the lengths of the subsequences left in the nodes after the complete exchange of bins which in turn helps to balance the computational load in each node. The performance of two of these algorithms on an FPS T-40 is given for data of various distributions and is compared to the performance of bitonic sort and hyperquicksort.
{"title":"Binsorting on hypercubes with d-port communication","authors":"S. Seidel, W. George","doi":"10.1145/63047.63102","DOIUrl":"https://doi.org/10.1145/63047.63102","url":null,"abstract":"Three sorting algorithms are given for hypercubes with d-port communication. All of these algorithms are based on binsort at the global level. The binsort allows the movement of keys among nodes to be performed by a d-port complete exchange rather than a sequence of l-port exchanges as in other algorithms. This lowers communication costs by at least a factor of d compared to other sorting algorithms. The first algorithm assumes the keys are uniformly distributed and selects bin boundaries based on the global maximum and minimum keys. The other two algorithms make no assumption about the distribution of keys and so they sample the keys before the binsort in order to estimate their distribution. Splitting keys based on that estimate reduce the variance among the lengths of the subsequences left in the nodes after the complete exchange of bins which in turn helps to balance the computational load in each node. The performance of two of these algorithms on an FPS T-40 is given for data of various distributions and is compared to the performance of bitonic sort and hyperquicksort.","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127795040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. TO copy otherwise, or to republish, requires a fee and/or specfic permission.
{"title":"Molecular dynamics simulation on an iPSC of defects in crystals","authors":"P. Flinn","doi":"10.1145/63047.63084","DOIUrl":"https://doi.org/10.1145/63047.63084","url":null,"abstract":"Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. TO copy otherwise, or to republish, requires a fee and/or specfic permission.","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133057620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Physicists believe that the world is described in terms of gauge theories. A popular technique for investigating these theories is to discretize them onto a lattice and simulate numerically by a computer, yielding so-called lattice gauge theory. Such computations require at least 1014 floating-point operations, necessitating the use of advanced architecture supercomputers such as the Connection Machine made by Thinking Machines Corporation. Currently the most important gauge theory to be solved is that describing the sub-nuclear world of high energy physics: Quantum Chromo-dynamics (QCD). The simplest example of a gauge theory is Quantum Electro-dynamics (QED), the theory which describes the interaction of electrons and photons. Simulation of QCD requires computer software very similar to that for the simpler QED problem. Our current QED code achieves a computational rate of 1.6 million lattice site updates per second for a Monte Carlo algorithm, and 7.4 million site updates per second for a microcanonical algorithm. The estimated performance for a Monte Carlo QCD code is 200,000 site updates per second (or 5.6 Gflops/sec).
物理学家相信世界是用规范理论来描述的。研究这些理论的一种流行技术是将它们离散到一个点阵上,并用计算机进行数值模拟,从而产生所谓的点阵规范理论。这样的计算需要至少1014次浮点运算,需要使用先进的架构超级计算机,如思维机器公司制造的连接机。目前最重要的有待解决的规范理论是描述高能物理的亚核世界:量子色动力学(QCD)。规范理论最简单的例子是量子电动力学(QED),该理论描述了电子和光子的相互作用。QCD的模拟需要与简单的QED问题非常相似的计算机软件。我们目前的QED代码实现了蒙特卡罗算法每秒160万格点更新的计算速率,微规范算法每秒740万点更新的计算速率。Monte Carlo QCD代码的估计性能是每秒200,000个站点更新(或5.6 Gflops/sec)。
{"title":"QED on the connection machine","authors":"C. Baillie, S. Johnsson, Luis F. Ortiz, G. Pawley","doi":"10.1145/63047.63082","DOIUrl":"https://doi.org/10.1145/63047.63082","url":null,"abstract":"Physicists believe that the world is described in terms of gauge theories. A popular technique for investigating these theories is to discretize them onto a lattice and simulate numerically by a computer, yielding so-called lattice gauge theory. Such computations require at least 1014 floating-point operations, necessitating the use of advanced architecture supercomputers such as the Connection Machine made by Thinking Machines Corporation. Currently the most important gauge theory to be solved is that describing the sub-nuclear world of high energy physics: Quantum Chromo-dynamics (QCD). The simplest example of a gauge theory is Quantum Electro-dynamics (QED), the theory which describes the interaction of electrons and photons. Simulation of QCD requires computer software very similar to that for the simpler QED problem. Our current QED code achieves a computational rate of 1.6 million lattice site updates per second for a Monte Carlo algorithm, and 7.4 million site updates per second for a microcanonical algorithm. The estimated performance for a Monte Carlo QCD code is 200,000 site updates per second (or 5.6 Gflops/sec).","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114321845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}