{"title":"Implementing finite element software on hypercube machines","authors":"G. Lyzenga, A. Raefsky, Bahram Nour-Omid","doi":"10.1145/63047.63134","DOIUrl":"https://doi.org/10.1145/63047.63134","url":null,"abstract":"","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126990877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper summarizes an analytical technique which predicts the time required to execute a given parallel program, with given data, on a given parallel architecture. For illustration purposes, the particular parallel program chosen is parallel Gaussian elimination and the particular parallel architecture chosen is a binary n-cube. The analytical technique is based upon a product-form queuing network model which is solved using an iterative method. The technique is validated by comparing performance predictions produced by the model against actual hypercube measurements.
{"title":"An analytic model for parallel Gaussian elimination on a binary N-Cube architecture","authors":"Virgílio A. F. Almeida, L. Dowdy, M. Leuze","doi":"10.1145/63047.63114","DOIUrl":"https://doi.org/10.1145/63047.63114","url":null,"abstract":"This paper summarizes an analytical technique which predicts the time required to execute a given parallel program, with given data, on a given parallel architecture. For illustration purposes, the particular parallel program chosen is parallel Gaussian elimination and the particular parallel architecture chosen is a binary n-cube. The analytical technique is based upon a product-form queuing network model which is solved using an iterative method. The technique is validated by comparing performance predictions produced by the model against actual hypercube measurements.","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127341165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many physical systems lead to initial value problems where the system of stiff ordinary differential equations is loosely coupled. Thus, in some cases the variables may be directly mapped onto sparsely connected parallel architectures such as the hypercube. This paper investigates various methods of implementing Gear's algorithm on parallel computers. Two conventional corrector methods utilize either functional or Newton Raphson iteration. We consider both alternatives and show that they exhibit similar speedups on an n node hypercube. In addition a polynomial corrector is investigated. It has the advantage of not having to solve a linear system as in the Newton Raphson method, yet it converges faster than functional iteration.
{"title":"A comparison of several methods of integrating stiff ordinary differential equations on parallel computing architectures","authors":"A. Bose, I. Nelken, J. Gelfand","doi":"10.1145/63047.63129","DOIUrl":"https://doi.org/10.1145/63047.63129","url":null,"abstract":"Many physical systems lead to initial value problems where the system of stiff ordinary differential equations is loosely coupled. Thus, in some cases the variables may be directly mapped onto sparsely connected parallel architectures such as the hypercube. This paper investigates various methods of implementing Gear's algorithm on parallel computers. Two conventional corrector methods utilize either functional or Newton Raphson iteration. We consider both alternatives and show that they exhibit similar speedups on an n node hypercube. In addition a polynomial corrector is investigated. It has the advantage of not having to solve a linear system as in the Newton Raphson method, yet it converges faster than functional iteration.","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133781095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modeling by paraxial extrapolators is applicable to wave propagation problems in which most of the energy is traveling within a restricted angular cone about a principle axis of the problem. Frequency domain finite-difference solutions are readily generated by using this technique. Input models can be described either by specifying velocities or appropriate media parameters on a two or three dimensional grid of points. For heterogeneous models, transmission and reflection coefficients are determined at structural boundaries within the media. The direct forward scattered waves are modeled with a single pass of the extrapolator operator in the paraxial direction for each frequency. The first-order back scattered energy can then be modeled by extrapolation (in the opposite direction) of the reflected field determined on the first pass. Higher order scattering can be included by sweeping through the model with more passes. The chief advantages of the paraxial approach are 1) active storage is reduced by one dimension as compared to solutions which must track both up-going and down-going waves simultaneously, thus even realistic three dimensional problems can fit on today's computers, 2) the decomposition in frequency allows the technique to be implemented on highly parallel machines such the hypercube, 3) attenuation can be modeled as an arbitrary function of frequency, and 4) only a small number of frequencies are needed to produce movie-like time slices. By using this method a wide range of seismological problems can be addressed, including strong motion analysis of waves in three-dimensional basins, the modeling of VSP reflection data, and the analysis of whole earth problems such as scattering at the core-mantle boundary or the effect of tectonic boundaries on long-period wave propagation.
{"title":"Acoustic wavefield propagation using paraxial extrapolators","authors":"R. Clayton, R. Graves","doi":"10.1145/63047.63069","DOIUrl":"https://doi.org/10.1145/63047.63069","url":null,"abstract":"Modeling by paraxial extrapolators is applicable to wave propagation problems in which most of the energy is traveling within a restricted angular cone about a principle axis of the problem. Frequency domain finite-difference solutions are readily generated by using this technique. Input models can be described either by specifying velocities or appropriate media parameters on a two or three dimensional grid of points. For heterogeneous models, transmission and reflection coefficients are determined at structural boundaries within the media. The direct forward scattered waves are modeled with a single pass of the extrapolator operator in the paraxial direction for each frequency. The first-order back scattered energy can then be modeled by extrapolation (in the opposite direction) of the reflected field determined on the first pass. Higher order scattering can be included by sweeping through the model with more passes.\u0000The chief advantages of the paraxial approach are 1) active storage is reduced by one dimension as compared to solutions which must track both up-going and down-going waves simultaneously, thus even realistic three dimensional problems can fit on today's computers, 2) the decomposition in frequency allows the technique to be implemented on highly parallel machines such the hypercube, 3) attenuation can be modeled as an arbitrary function of frequency, and 4) only a small number of frequencies are needed to produce movie-like time slices.\u0000By using this method a wide range of seismological problems can be addressed, including strong motion analysis of waves in three-dimensional basins, the modeling of VSP reflection data, and the analysis of whole earth problems such as scattering at the core-mantle boundary or the effect of tectonic boundaries on long-period wave propagation.","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131947242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lattice gauge theory, an extremely computationally intensive problem, has been run successfully on hypercubes for a number of years. Herein we give a flavor of this work, discussing both the physics and the computing behind it.
{"title":"Lattice gauge theory on the hypercube","authors":"J. Flower, J. Apostolakis, C. Baillie, H. Ding","doi":"10.1145/63047.63081","DOIUrl":"https://doi.org/10.1145/63047.63081","url":null,"abstract":"Lattice gauge theory, an extremely computationally intensive problem, has been run successfully on hypercubes for a number of years. Herein we give a flavor of this work, discussing both the physics and the computing behind it.","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130971770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Branch and Bound (BB) algorithms are a generalization of many search algorithms used in Artificial Intelligence and Operations Research. This paper presents our work on implementing BB algorithms on hypercube multiprocessors. The 0-1 integer linear programming (ILP) problem is taken as an example because it can be implemented to capture the essence of BB search algorithms without too many distracting problem specific details. A BB algorithm for the 0-1 ILP problem is discussed. Two parallel implementations of the algorithm on hypercube multiprocessors are presented. The two implementations demonstrate some of the tradeoffs involved in implementing these algorithms on multiprocessors with no shared memory, such as hypercubes. Experimental results from the NCUBE/six show the performance of the two implementations of the algorithm. Future research work is discussed.
{"title":"Parallel branch and bound algorithms on hypercube multiprocessors","authors":"Tarek Saad Abdel-Rahman, T. Mudge","doi":"10.1145/63047.63106","DOIUrl":"https://doi.org/10.1145/63047.63106","url":null,"abstract":"Branch and Bound (BB) algorithms are a generalization of many search algorithms used in Artificial Intelligence and Operations Research. This paper presents our work on implementing BB algorithms on hypercube multiprocessors. The 0-1 integer linear programming (ILP) problem is taken as an example because it can be implemented to capture the essence of BB search algorithms without too many distracting problem specific details. A BB algorithm for the 0-1 ILP problem is discussed. Two parallel implementations of the algorithm on hypercube multiprocessors are presented. The two implementations demonstrate some of the tradeoffs involved in implementing these algorithms on multiprocessors with no shared memory, such as hypercubes. Experimental results from the NCUBE/six show the performance of the two implementations of the algorithm. Future research work is discussed.","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131092158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A concurrent algorithm for multiple target tracking is presented. The underlying tracking formalism is first described by way of a sequential program, and the issues in generalizing the tracker for efficient concurrent implementations are discussed in detail. Typical tracking results on the Mark III hypercube are presented.
{"title":"Concurrent multiple target tracking","authors":"T. D. Gottschalk","doi":"10.1145/63047.63079","DOIUrl":"https://doi.org/10.1145/63047.63079","url":null,"abstract":"A concurrent algorithm for multiple target tracking is presented. The underlying tracking formalism is first described by way of a sequential program, and the issues in generalizing the tracker for efficient concurrent implementations are discussed in detail. Typical tracking results on the Mark III hypercube are presented.","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115436195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The multiplication of (large) matrices allocated evenly on Boolean cube configured multiprocessors poses several interesting trade-offs with respect to communication time, processor utilization, and storage requirement. In [7] we investigated several algorithms for different degrees of parallelization, and showed how the choice of algorithm with respect to performance depends on the matrix shape, and the multiprocessor parameters, and how processors should be allocated optimally to the different loops. In this paper the focus is on expressing the algorithms in shared memory type primitives. We assume that all processors share the same global address space, and present communication primitives both for nearest-neighbor communication, and global operations such as broadcasting from one processor to a set of processors, the reverse operation of plus-reduction, and matrix transposition (dimension permutation). We consider both the case where communication is restricted to one processor port at a time, or concurrent communication on all processor ports. The communication algorithms are provably optimal within a factor of two. We describe both constant storage algorithms, and algorithms with reduced communication time, but a storage need proportional to the number of processors and the matrix sizes (for a one-dimensional partitioning of the matrices).
{"title":"Expressing Boolean cube matrix algorithms in shared memory primitives","authors":"S. Johnsson, C. T. Ho","doi":"10.1145/63047.63121","DOIUrl":"https://doi.org/10.1145/63047.63121","url":null,"abstract":"The multiplication of (large) matrices allocated evenly on Boolean cube configured multiprocessors poses several interesting trade-offs with respect to communication time, processor utilization, and storage requirement. In [7] we investigated several algorithms for different degrees of parallelization, and showed how the choice of algorithm with respect to performance depends on the matrix shape, and the multiprocessor parameters, and how processors should be allocated optimally to the different loops.\u0000In this paper the focus is on expressing the algorithms in shared memory type primitives. We assume that all processors share the same global address space, and present communication primitives both for nearest-neighbor communication, and global operations such as broadcasting from one processor to a set of processors, the reverse operation of plus-reduction, and matrix transposition (dimension permutation). We consider both the case where communication is restricted to one processor port at a time, or concurrent communication on all processor ports. The communication algorithms are provably optimal within a factor of two. We describe both constant storage algorithms, and algorithms with reduced communication time, but a storage need proportional to the number of processors and the matrix sizes (for a one-dimensional partitioning of the matrices).","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124330365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The gravitational N-body algorithm of Barnes and Hut [1] has been successfully implemented on a hypercube concurrent processor. The novel approach of their sequential algorithm has demonstrated itself to be well suited to hypercube architectures. The sequential code achieves O (NlogN) speed by recursively dividing space into subcells, thereby creating a hierarchical grouping of particles. Computing interactions between these groups dramatically reduces the amount of communication between processors, as well as the number of force calculations. Parallelism is achieved through an irregular spatial grid decomposition. Since the decomposition topology is not simple, a general loosely synchronous communication routine has been developed. Operations are simplified if the conventional grey code decomposition is modified so that the bits are taken alternately from each Cartesian dimension. A speedup of 180 has been achieved for a 500,000 particle two-dimensional calculation on 256 processors. A speedup of 65 has been obtained for a 64,000 particle three-dimensional calculation on 256 processors.
{"title":"An O(NlogN) hypercube N-body integrator","authors":"M. Warren, J. Salmon","doi":"10.1145/63047.63051","DOIUrl":"https://doi.org/10.1145/63047.63051","url":null,"abstract":"The gravitational N-body algorithm of Barnes and Hut [1] has been successfully implemented on a hypercube concurrent processor. The novel approach of their sequential algorithm has demonstrated itself to be well suited to hypercube architectures. The sequential code achieves O (NlogN) speed by recursively dividing space into subcells, thereby creating a hierarchical grouping of particles. Computing interactions between these groups dramatically reduces the amount of communication between processors, as well as the number of force calculations. Parallelism is achieved through an irregular spatial grid decomposition. Since the decomposition topology is not simple, a general loosely synchronous communication routine has been developed. Operations are simplified if the conventional grey code decomposition is modified so that the bits are taken alternately from each Cartesian dimension. A speedup of 180 has been achieved for a 500,000 particle two-dimensional calculation on 256 processors. A speedup of 65 has been obtained for a 64,000 particle three-dimensional calculation on 256 processors.","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"82 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116370211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parallel implementation of domain decomposition techniques for elliptic PDEs in rectangular regions is considered. This technique is well suited for parallel processing, since in the solution process the subproblems either are independent or can be easily converted into decoupled problems. More than 80% of execution time is spent on solving these independent and decoupled problems. The hypercube architecture is used for concurrent execution. The performance of the parallel algorithm is compared against the sequential version. The speed-up, efficiency, and communication factors are studied as functions the number of processors. Extensive tests are performed to find, for a given mesh size, the number of subregions and nodes that minimize the overall execution time.
{"title":"Parallel implementation of domain decomposition techniques on Intel's hypercube","authors":"M. Haghoo, W. Proskurowski","doi":"10.1145/63047.63132","DOIUrl":"https://doi.org/10.1145/63047.63132","url":null,"abstract":"Parallel implementation of domain decomposition techniques for elliptic PDEs in rectangular regions is considered. This technique is well suited for parallel processing, since in the solution process the subproblems either are independent or can be easily converted into decoupled problems. More than 80% of execution time is spent on solving these independent and decoupled problems.\u0000The hypercube architecture is used for concurrent execution. The performance of the parallel algorithm is compared against the sequential version. The speed-up, efficiency, and communication factors are studied as functions the number of processors. Extensive tests are performed to find, for a given mesh size, the number of subregions and nodes that minimize the overall execution time.","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123342669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}