Pub Date : 1990-04-08DOI: 10.1109/DMCC.1990.555419
C. Grosch, M. Ghose, S.N. Gupta, T. L. Jackson, M. Zubair
We present a systematic study of the applicability of massively parallel computers, the AMT DAP-510/610 and the TMC CM-2, to the solution of the two-dimensional unsteady Euler equa tions using a compact high-order scheme. The performance of these machines is compared to that of the Cray-2 and the Cray-YMP/832 using the same algorithm and for the same test problem.
{"title":"Massively Parallel Computation of the Euler Equations","authors":"C. Grosch, M. Ghose, S.N. Gupta, T. L. Jackson, M. Zubair","doi":"10.1109/DMCC.1990.555419","DOIUrl":"https://doi.org/10.1109/DMCC.1990.555419","url":null,"abstract":"We present a systematic study of the applicability of massively parallel computers, the AMT DAP-510/610 and the TMC CM-2, to the solution of the two-dimensional unsteady Euler equa tions using a compact high-order scheme. The performance of these machines is compared to that of the Cray-2 and the Cray-YMP/832 using the same algorithm and for the same test problem.","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115264077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1990-04-08DOI: 10.1109/DMCC.1990.555358
J. Blanc, D. Trystram, J. Ryckbosch
LMC-IIVLAG EDF-DER Abstract. For the past 20 years, an increasing interest has been devoted to the sequential Conjugate Gradient Method for solving large linear systems arising from the modeling of physical problems (especially for very large systems with sparse matrices). This paper deals with the implementation on parallel supercomputers of a preconditioned conjugate gradient method for solving the corrective switching problem obtained while modeling the behavior of power systems in electrical networks. This problem consists in finding the successive solutions of many close linear systems (not too large) with very ill-conditioned matrices (sometimes even singular). We present a new method based on the Preconditioned Conjugate Gradient algorithm with an original preconditioning and study its parallelization on both shared and distributed memory computers. 1. Setting of the problem During the control of electrical networks, the operator must ensure the system to bc in a safc state (i.e. to be able to protect the system against incidents liable to occur in real time). The demand and the possibility of the plants are such that nuclear energy between two plants flows from various nodes of the network. The loss of one element could jeopardize the security of the whole system by a chain tripping: in such case, an overload line occurs and without any operation the protective devices will act and the line will trip out. In actual operations conditions, the switching actions that the operator applies to the electrical network ensure that overloads will disappear before the delayed protective devices go into action. Such actions are shown on the picture at the end of the paper. The computation of switching actions is a combinatorial problem, very hard to solve. The connections of the switching elements are described as discrete variables. The corrective switching problem corresponds to determine the various possible solutions of the load flow calculation. Each such situation requires to solve a linear system where the matrices have only a few elements which differ from each other. Let us consider the N consecutive linear systems below: (Si) Ajx; = b;, lG
LMC-IIVLAG EDF-DER摘要。在过去的20年里,序列共轭梯度法在求解大型线性系统(特别是具有稀疏矩阵的非常大的系统)的物理问题建模中引起了越来越多的兴趣。本文研究了一种预条件共轭梯度法在并行超级计算机上的实现,该方法用于求解电网中电力系统行为建模时得到的校正开关问题。这个问题包括寻找许多具有非常病态矩阵(有时甚至是奇异矩阵)的紧密线性系统(不太大)的连续解。提出了一种基于预条件共轭梯度算法的新方法,并对其在共享和分布式存储计算机上的并行化进行了研究。1. 在电网控制过程中,操作员必须确保系统处于安全状态(即能够保护系统免受实时可能发生的事故的影响)。电厂的需求和可能性是这样的,两个电厂之间的核能从网络的不同节点流动。一个元件的丢失可能会因链式跳闸而危及整个系统的安全:在这种情况下,线路发生过载,不需要任何操作,保护装置就会起作用,线路就会跳闸。在实际运行条件下,操作人员对电网的切换动作确保在延迟保护装置动作之前过载消失。这些动作在文章末尾的图片中都有显示。开关动作的计算是一个很难解决的组合问题。开关元件的连接被描述为离散变量。纠偏切换问题对应于确定潮流计算的各种可能解。每个这样的情况都需要求解一个线性系统,其中矩阵只有几个元素彼此不同。让我们考虑下面的N个连续线性系统:(Si) Ajx;= b;, lG
{"title":"Parallel Distributed-Memory Implementation of the Corrective Switching Problem","authors":"J. Blanc, D. Trystram, J. Ryckbosch","doi":"10.1109/DMCC.1990.555358","DOIUrl":"https://doi.org/10.1109/DMCC.1990.555358","url":null,"abstract":"LMC-IIVLAG EDF-DER Abstract. For the past 20 years, an increasing interest has been devoted to the sequential Conjugate Gradient Method for solving large linear systems arising from the modeling of physical problems (especially for very large systems with sparse matrices). This paper deals with the implementation on parallel supercomputers of a preconditioned conjugate gradient method for solving the corrective switching problem obtained while modeling the behavior of power systems in electrical networks. This problem consists in finding the successive solutions of many close linear systems (not too large) with very ill-conditioned matrices (sometimes even singular). We present a new method based on the Preconditioned Conjugate Gradient algorithm with an original preconditioning and study its parallelization on both shared and distributed memory computers. 1. Setting of the problem During the control of electrical networks, the operator must ensure the system to bc in a safc state (i.e. to be able to protect the system against incidents liable to occur in real time). The demand and the possibility of the plants are such that nuclear energy between two plants flows from various nodes of the network. The loss of one element could jeopardize the security of the whole system by a chain tripping: in such case, an overload line occurs and without any operation the protective devices will act and the line will trip out. In actual operations conditions, the switching actions that the operator applies to the electrical network ensure that overloads will disappear before the delayed protective devices go into action. Such actions are shown on the picture at the end of the paper. The computation of switching actions is a combinatorial problem, very hard to solve. The connections of the switching elements are described as discrete variables. The corrective switching problem corresponds to determine the various possible solutions of the load flow calculation. Each such situation requires to solve a linear system where the matrices have only a few elements which differ from each other. Let us consider the N consecutive linear systems below: (Si) Ajx; = b;, lG<N where the matrices Ai (of size n by n) are \"close\" to each other, viz, A;+1 = Ai+Ai, with Ai of small norm. The solutions xi will be close to each other in this sense, and we want to take full advantage of this. Note that this problem also occurs in Adaptive Filtering or Finite Element modeling.","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"158 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116306272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1990-04-08DOI: 10.1109/DMCC.1990.556322
C. Koelbel, P. Mehrotra, J. Saltz, H. Berryman
Any programming environment for distributed memory machines that allows the user to specify pdwallel do loops over globally defined data structures requires optimizations that go beyond the specification of Lrppropriate data and workload partitionings. In this paper, we consider optimizations that are required for efficient execution of a code segment that consists of pmallel loops over distributed data Structures. On distributed memory machines it is typically very expensive tci fetch individual data elements. Instead, before a parallirl loop executes, it is desirable to prefetch all off-processor data required in the loop. We specify a scheme for s boring copies of fetched data along with a scheme for accessing copies of off-processor data during the computafJ ion of the loop. The performance of such optimizations rm the iPSC/2 and the NCUBE is also presented.
{"title":"Parallel Loops on Distributed Machines","authors":"C. Koelbel, P. Mehrotra, J. Saltz, H. Berryman","doi":"10.1109/DMCC.1990.556322","DOIUrl":"https://doi.org/10.1109/DMCC.1990.556322","url":null,"abstract":"Any programming environment for distributed memory machines that allows the user to specify pdwallel do loops over globally defined data structures requires optimizations that go beyond the specification of Lrppropriate data and workload partitionings. In this paper, we consider optimizations that are required for efficient execution of a code segment that consists of pmallel loops over distributed data Structures. On distributed memory machines it is typically very expensive tci fetch individual data elements. Instead, before a parallirl loop executes, it is desirable to prefetch all off-processor data required in the loop. We specify a scheme for s boring copies of fetched data along with a scheme for accessing copies of off-processor data during the computafJ ion of the loop. The performance of such optimizations rm the iPSC/2 and the NCUBE is also presented.","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116018461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1990-04-08DOI: 10.1109/DMCC.1990.555382
R. Battiti
Recently multigrid techniques have been proposed for solving low-level vision problems in optimal time (i.e. time proportional to the number of pixels). In the present work this method is extended to incorporate a discontinuity detection process cooperating with the smoothing phase on all scales. Activation of line element detectors that signal the presence of relevant discontinuities is based on information gathered from neighboring points at the same and different scales. Because the required computation is local, parallelism can be profitably used. A mapping of the required data structure onto a two dimensional mesh of processors is suggested. Domain decomposition is shown to be efficient on MIMD computers capable of containing many individual cells in each processor. Some examples of the proposed multiscale solution techniques are shown for two different applications. In the first case a surface is reconstructed from first derivative information (extracted from the intensity data), in the second case from noisy depth constraints.
{"title":"Surface Reconstruction and Discontinuity Detection: A Fast Hierarchical Approach on a Two-Dimensional Mesh","authors":"R. Battiti","doi":"10.1109/DMCC.1990.555382","DOIUrl":"https://doi.org/10.1109/DMCC.1990.555382","url":null,"abstract":"Recently multigrid techniques have been proposed for solving low-level vision problems in optimal time (i.e. time proportional to the number of pixels). In the present work this method is extended to incorporate a discontinuity detection process cooperating with the smoothing phase on all scales. Activation of line element detectors that signal the presence of relevant discontinuities is based on information gathered from neighboring points at the same and different scales. Because the required computation is local, parallelism can be profitably used. A mapping of the required data structure onto a two dimensional mesh of processors is suggested. Domain decomposition is shown to be efficient on MIMD computers capable of containing many individual cells in each processor. Some examples of the proposed multiscale solution techniques are shown for two different applications. In the first case a surface is reconstructed from first derivative information (extracted from the intensity data), in the second case from noisy depth constraints.","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121840684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1990-04-08DOI: 10.1109/DMCC.1990.556397
B. Mak, O. Egecioglu
The communication complexity on Intel’s second generation iPSC/2 hypercube and its effect on parallelization of Back Propagation type training algorithms for neural networks are explored. On iPSC/2 , different broadcasting methods are tested and three inter-node communication schemes are evaluated based on their performance on vector addition. These communication schemes are then utilized on parallel versions of the Back Propagation training algorithm. The performance of the resulting parallel variants of Back Propagation are analyzed using two medium size problems: vowel classification and English text-to-speech conversion (NETtalk data).
{"title":"Communication Parameter Tests and Parallel Back Propagation Algorithms on iPSC/2 Hypercube Multiprocessor","authors":"B. Mak, O. Egecioglu","doi":"10.1109/DMCC.1990.556397","DOIUrl":"https://doi.org/10.1109/DMCC.1990.556397","url":null,"abstract":"The communication complexity on Intel’s second generation iPSC/2 hypercube and its effect on parallelization of Back Propagation type training algorithms for neural networks are explored. On iPSC/2 , different broadcasting methods are tested and three inter-node communication schemes are evaluated based on their performance on vector addition. These communication schemes are then utilized on parallel versions of the Back Propagation training algorithm. The performance of the resulting parallel variants of Back Propagation are analyzed using two medium size problems: vowel classification and English text-to-speech conversion (NETtalk data).","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123338190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1990-04-08DOI: 10.1109/DMCC.1990.556337
M. Heath
In this talk we show how graphical animation of the behavior of parallel algorithms can facilitate the design and performance enhancement of algorithms for matrix computations on parallel computer architectures. Using a portable instrumented communication library and a graphical animation package developed at Oak Ridge National Laboratory, we illustrate the effects of various strategies in parallel algorithm design, including interconnection topologies, global communication patterns, data mapping schemes, load balancing, and pipelining techniques for overlapping communication with computation. In this talk we focus on distributed-memory parallel architectures in which the processors communicate by passing messages. The linear algebra problems we consider include matrix factorization and the solution of triangular systems.
{"title":"Visual Animation of Parallel Algorithms for Matrix Computations","authors":"M. Heath","doi":"10.1109/DMCC.1990.556337","DOIUrl":"https://doi.org/10.1109/DMCC.1990.556337","url":null,"abstract":"In this talk we show how graphical animation of the behavior of parallel algorithms can facilitate the design and performance enhancement of algorithms for matrix computations on parallel computer architectures. Using a portable instrumented communication library and a graphical animation package developed at Oak Ridge National Laboratory, we illustrate the effects of various strategies in parallel algorithm design, including interconnection topologies, global communication patterns, data mapping schemes, load balancing, and pipelining techniques for overlapping communication with computation. In this talk we focus on distributed-memory parallel architectures in which the processors communicate by passing messages. The linear algebra problems we consider include matrix factorization and the solution of triangular systems.","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114588854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1990-04-08DOI: 10.1109/DMCC.1990.556399
Woei-kae Chen, Matthias F. Stallmann
The hypercube embedding problem, a restricted ver- sion of the general mapping problem, is the problem of mapping a set of communicating processes to a hy- percube multiprocessor. The goal is to find a map- ping that minimizes the average length of the paths between communicating processes. Iterative improve- ment heuristics for hypercube embedding, including a local search, a Kernighan-Lin, and a simulated an- nealing, are evaluated under different options includ- ing neighborhoods (all-swaps versus cube-neighbors), initial solutions (random versus greedy), and enhance- ments on terminating conditions (flat moves and up- hill moves). By varying these options we obtain a wide range of tradeoffs between execution time and solution quality.
{"title":"Local Search Variants for Hypercube Embedding","authors":"Woei-kae Chen, Matthias F. Stallmann","doi":"10.1109/DMCC.1990.556399","DOIUrl":"https://doi.org/10.1109/DMCC.1990.556399","url":null,"abstract":"The hypercube embedding problem, a restricted ver- sion of the general mapping problem, is the problem of mapping a set of communicating processes to a hy- percube multiprocessor. The goal is to find a map- ping that minimizes the average length of the paths between communicating processes. Iterative improve- ment heuristics for hypercube embedding, including a local search, a Kernighan-Lin, and a simulated an- nealing, are evaluated under different options includ- ing neighborhoods (all-swaps versus cube-neighbors), initial solutions (random versus greedy), and enhance- ments on terminating conditions (flat moves and up- hill moves). By varying these options we obtain a wide range of tradeoffs between execution time and solution quality.","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128421556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1990-04-08DOI: 10.1109/DMCC.1990.556302
P. Liewer, E. W. Leaver, V. Decyk, J. Dawson
Dynamic load balancing has been implemented in a concurrent one-dimensional electromagnetic plasma particle-in-cell (PIC) simulation code using a method which adds very little overhead to the parallel code. In PIC codes, the orbits of many interacting plasma electrons and ions are followed as an initial value problem as the particles move in electromagnetic fields calculated self-consistently from the particle motions. The code was implemented using the GCPIC algorithm in which the particles are divided among processors by partitioning the spatial domain of the simulation. The problem is load-balanced by partitioning the spatial domain so that each partition has approximately the same number of particles. During the simulation, the partitions are dynamically recreated as the spatial distribution of the particles changes in order to maintain processor load balance.
{"title":"Dynamic Load Balancing in a Concurrent Plasma PIC Code on the JPL/Caltech Mark III Hypercube","authors":"P. Liewer, E. W. Leaver, V. Decyk, J. Dawson","doi":"10.1109/DMCC.1990.556302","DOIUrl":"https://doi.org/10.1109/DMCC.1990.556302","url":null,"abstract":"Dynamic load balancing has been implemented in a concurrent one-dimensional electromagnetic plasma particle-in-cell (PIC) simulation code using a method which adds very little overhead to the parallel code. In PIC codes, the orbits of many interacting plasma electrons and ions are followed as an initial value problem as the particles move in electromagnetic fields calculated self-consistently from the particle motions. The code was implemented using the GCPIC algorithm in which the particles are divided among processors by partitioning the spatial domain of the simulation. The problem is load-balanced by partitioning the spatial domain so that each partition has approximately the same number of particles. During the simulation, the partitions are dynamically recreated as the spatial distribution of the particles changes in order to maintain processor load balance.","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130498587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1990-04-08DOI: 10.1109/DMCC.1990.555423
S. Plimpton
Two parallel algorithms for classical molecular dynamics are presented. The first assigns each processor to a subset of particles; the second assigns each to a fixed region of 3d space. The algorithms are implemented on 1024-node hypercubes for problems characterized by short-range forces, diffusion (so that each particle’s neighbors change in time), and problem size ranging from 250 to 10000 particles. Timings for the algorithms on the 1024-node NCUBE/ten and the newer NCUBE 2 hypercubes are given. The latter is found to be competitive with a CRAY-XMP, running an optimized serial algorithm. For smaller problems the NCUBE 2 and CRAY-XMP are roughly the same; for larger ones the NCUBE 2 (with 1024 nodes) is up to twice as fast. Parallel efficiencies of the algorithms and communication parameters for the two hypercubes are also examined.
{"title":"Molecular Dynamics Simulations of Short-Range Force Systems on 1024-Node Hypercubes","authors":"S. Plimpton","doi":"10.1109/DMCC.1990.555423","DOIUrl":"https://doi.org/10.1109/DMCC.1990.555423","url":null,"abstract":"Two parallel algorithms for classical molecular dynamics are presented. The first assigns each processor to a subset of particles; the second assigns each to a fixed region of 3d space. The algorithms are implemented on 1024-node hypercubes for problems characterized by short-range forces, diffusion (so that each particle’s neighbors change in time), and problem size ranging from 250 to 10000 particles. Timings for the algorithms on the 1024-node NCUBE/ten and the newer NCUBE 2 hypercubes are given. The latter is found to be competitive with a CRAY-XMP, running an optimized serial algorithm. For smaller problems the NCUBE 2 and CRAY-XMP are roughly the same; for larger ones the NCUBE 2 (with 1024 nodes) is up to twice as fast. Parallel efficiencies of the algorithms and communication parameters for the two hypercubes are also examined.","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130773478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1990-04-08DOI: 10.1109/DMCC.1990.556305
D. King, E. Wegman
This paper reports on the results of a preliminary study in dynamic load balancing on an Intel Hypercube. The purpose of this research is to provide experimental data in how parallel algorithms should be constructed to obtain maximal utilization of a parallel architecture. This study is one aspect of an ongoing research project into the construction of an automated parallelization tool. This tool will take FORTRAN source as input, and construct a parallel algorithm that will produce the same results as the original serial input. The focus of this paper is on the load balancing aspect of that project. The basic idea is to reserve a certain percentage of the computation task, subdivide that percentage into arbitrarily fine tasks, and dole those small tasks out to nodes on request. Ij” the percentage is chosen correctly, then a minority of nodes should be involved in consuming the filler tasks, and the overall throughput of the job should increase as a result of the individual node efJciencies having increased. This paper will outline our approach to performing dynamic load balancing on an Intel iPSC/2. We take the view that the problem of load balancing is really a problem of dividing a “computational task” into smaller components, each of roughly equal complexity, and each an independent event. After this is done, the components of the task can be sent to a node for execution. The key to an optimally balanced load across all computational nodes is the ability to form a statistical profile of the individual components of each computational task. This statistical profile will determine an initial sequence of execution. Our experience indicates that a speedup on the order of 80% is achievable with the judicious use of profiled load balancing. During the process of execution, the initial profile will be altered according to the actual behavior exhibited by the nodes. The difference between the actual and expected performance will be used to determine how much additional time should be devoted to altering the current execution schedule. Currently, our work involves statically setting the load balancing parameters. Our load balancing system determines the execution schedule
{"title":"Hypercube Dynamic Load Balancing","authors":"D. King, E. Wegman","doi":"10.1109/DMCC.1990.556305","DOIUrl":"https://doi.org/10.1109/DMCC.1990.556305","url":null,"abstract":"This paper reports on the results of a preliminary study in dynamic load balancing on an Intel Hypercube. The purpose of this research is to provide experimental data in how parallel algorithms should be constructed to obtain maximal utilization of a parallel architecture. This study is one aspect of an ongoing research project into the construction of an automated parallelization tool. This tool will take FORTRAN source as input, and construct a parallel algorithm that will produce the same results as the original serial input. The focus of this paper is on the load balancing aspect of that project. The basic idea is to reserve a certain percentage of the computation task, subdivide that percentage into arbitrarily fine tasks, and dole those small tasks out to nodes on request. Ij” the percentage is chosen correctly, then a minority of nodes should be involved in consuming the filler tasks, and the overall throughput of the job should increase as a result of the individual node efJciencies having increased. This paper will outline our approach to performing dynamic load balancing on an Intel iPSC/2. We take the view that the problem of load balancing is really a problem of dividing a “computational task” into smaller components, each of roughly equal complexity, and each an independent event. After this is done, the components of the task can be sent to a node for execution. The key to an optimally balanced load across all computational nodes is the ability to form a statistical profile of the individual components of each computational task. This statistical profile will determine an initial sequence of execution. Our experience indicates that a speedup on the order of 80% is achievable with the judicious use of profiled load balancing. During the process of execution, the initial profile will be altered according to the actual behavior exhibited by the nodes. The difference between the actual and expected performance will be used to determine how much additional time should be devoted to altering the current execution schedule. Currently, our work involves statically setting the load balancing parameters. Our load balancing system determines the execution schedule","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128696924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}