Pub Date : 1991-04-28DOI: 10.1109/DMCC.1991.633164
T. Gilman, T. Huntsberger, P. Sharma
Many attempts have been made t o simulate the motion of non-rigid objects. While there have been many successes in this area, the animation of fluids is still a relatively unconquered frontier. This paper describes a distributed model for fluid flow study based on behavioral simulation of atom-like particles. These particles define the size and shape of the fluid. In addiiion, these particles have inertia and respond to attraction, repulsion and gravitation. Unlike previous fluid flow systems, inter-particle forces are explicitly included an the model. A distributed mapping of the particle database similar to recent load-balanced PIC studies [5, 61 allows large numbers of particles to be included in the model. We also present the results of some experimental studies performed on the NCUBE/lD system at the University of South Carolina.
{"title":"Distributed particle based fluid flow simulation","authors":"T. Gilman, T. Huntsberger, P. Sharma","doi":"10.1109/DMCC.1991.633164","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633164","url":null,"abstract":"Many attempts have been made t o simulate the motion of non-rigid objects. While there have been many successes in this area, the animation of fluids is still a relatively unconquered frontier. This paper describes a distributed model for fluid flow study based on behavioral simulation of atom-like particles. These particles define the size and shape of the fluid. In addiiion, these particles have inertia and respond to attraction, repulsion and gravitation. Unlike previous fluid flow systems, inter-particle forces are explicitly included an the model. A distributed mapping of the particle database similar to recent load-balanced PIC studies [5, 61 allows large numbers of particles to be included in the model. We also present the results of some experimental studies performed on the NCUBE/lD system at the University of South Carolina.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129349447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-04-28DOI: 10.1109/DMCC.1991.633361
J. Adamo, J. Benneville, C. Bonello, L. Trejo
This work is pad of a high-level environment we are developing for a reconfigurable transputer-based machine. It deals with the design of a virtual channel monitor. A protocol is described which, among other things, allows pre-emption of communications and possible failure of the links to be handled consistently.
{"title":"Fault Tolerant Communication in the C.NET High Levell Programming Environment","authors":"J. Adamo, J. Benneville, C. Bonello, L. Trejo","doi":"10.1109/DMCC.1991.633361","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633361","url":null,"abstract":"This work is pad of a high-level environment we are developing for a reconfigurable transputer-based machine. It deals with the design of a virtual channel monitor. A protocol is described which, among other things, allows pre-emption of communications and possible failure of the links to be handled consistently.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133065115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-04-28DOI: 10.1109/DMCC.1991.633218
G. A. Sawyer, G. Lamont
Parallel rule execution (agenda parallelism) represents a relatively unexplored method for increasing the execution speed ojr production systems on parallel computer architectures. Agenda parallelism possesses the potential .for increasing the execution speed o f parallel production systems b y an orde,r of magnitude. However, agenda parallelism also introduces a number of significant overhead factors that must be contended with. This paper presents an overview of AFIT’s initial research on agenda parallelism; it includes a discussion ojf the advaniiages and liabilities associated with this decomposition approach based on formal proofs, problem analysis and actual implementation.
{"title":"On Implementing Agenda Parallelism in Production Systems","authors":"G. A. Sawyer, G. Lamont","doi":"10.1109/DMCC.1991.633218","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633218","url":null,"abstract":"Parallel rule execution (agenda parallelism) represents a relatively unexplored method for increasing the execution speed ojr production systems on parallel computer architectures. Agenda parallelism possesses the potential .for increasing the execution speed o f parallel production systems b y an orde,r of magnitude. However, agenda parallelism also introduces a number of significant overhead factors that must be contended with. This paper presents an overview of AFIT’s initial research on agenda parallelism; it includes a discussion ojf the advaniiages and liabilities associated with this decomposition approach based on formal proofs, problem analysis and actual implementation.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124147767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-04-28DOI: 10.1109/DMCC.1991.633203
De-Lei Lee, M. A. Aboelaze
Winogradi’s matrix multiplication algorithm halves the number of multiplication operations required of the conventional 0 ( N 3 ) matrix multiplication algoirithm by slightly increasing the number of addition operations. Such it technique can be computatiorially advantageous when the machine performing the matrix computation takes much more time for multiplication over addition operations. This is overwhelmingly the case in the massively parallel computing paradigm, where each processor is extremely simple by itself and the computing power is obtained by the use of a large number of such processors. In this paper, we describe a parallel version of Winograd’s imatrix multiplication algorithm using an array processor and show how to achieve nearly linear speedup over its sequential counterpart.
{"title":"Linear Speedup of Winograd's Matrix Multiplication Algorithm Using an Array Processor","authors":"De-Lei Lee, M. A. Aboelaze","doi":"10.1109/DMCC.1991.633203","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633203","url":null,"abstract":"Winogradi’s matrix multiplication algorithm halves the number of multiplication operations required of the conventional 0 ( N 3 ) matrix multiplication algoirithm by slightly increasing the number of addition operations. Such it technique can be computatiorially advantageous when the machine performing the matrix computation takes much more time for multiplication over addition operations. This is overwhelmingly the case in the massively parallel computing paradigm, where each processor is extremely simple by itself and the computing power is obtained by the use of a large number of such processors. In this paper, we describe a parallel version of Winograd’s imatrix multiplication algorithm using an array processor and show how to achieve nearly linear speedup over its sequential counterpart.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"604 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116373773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-04-28DOI: 10.1109/DMCC.1991.633127
A. W. Kwan, L. Bic
Programming paradigms have been advocated as a method of abstraction for viewing parallel algorithms. By viewing such paradigms as a method of algorithm chwijication, we have used paradigms as a technque f i r structuring certain types of algorithm on distributed memory computers, allowing f i r separation of computation and synchronization. The structuring technique assists the parallel programmer with synchronization, allowing the programmer to concentrate more on developing code f i r computatwn. Experiments with the compute-aggregate-broa&ast paradigm indicate that such a structuring technique can be utilized for diflerentprograms, andcan be efficient.
{"title":"Using Parallel Programming Paradigms for Structuring Programs on Distributed Memory Computers","authors":"A. W. Kwan, L. Bic","doi":"10.1109/DMCC.1991.633127","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633127","url":null,"abstract":"Programming paradigms have been advocated as a method of abstraction for viewing parallel algorithms. By viewing such paradigms as a method of algorithm chwijication, we have used paradigms as a technque f i r structuring certain types of algorithm on distributed memory computers, allowing f i r separation of computation and synchronization. The structuring technique assists the parallel programmer with synchronization, allowing the programmer to concentrate more on developing code f i r computatwn. Experiments with the compute-aggregate-broa&ast paradigm indicate that such a structuring technique can be utilized for diflerentprograms, andcan be efficient.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123713165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-04-28DOI: 10.1109/DMCC.1991.633307
T. Taha
Periodic tridiagonal linear systems of equations typi- cally arise from discretizing second order differential equations with periodic boundary conditions. In this paper a parallel-vector algorithm is introduced to solve such systems. Implementation of the new algorithm is carried out on an Intel iPSC/2 hypercube with vector processor boards attached to each node processor. It is to be noted that t his algorithm can be extended to solve other periodic banded linear systems.
{"title":"A Parallel-Vector Algorithm for Solving Periodic Tridiagonal Linear Systems of Equations","authors":"T. Taha","doi":"10.1109/DMCC.1991.633307","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633307","url":null,"abstract":"Periodic tridiagonal linear systems of equations typi- cally arise from discretizing second order differential equations with periodic boundary conditions. In this paper a parallel-vector algorithm is introduced to solve such systems. Implementation of the new algorithm is carried out on an Intel iPSC/2 hypercube with vector processor boards attached to each node processor. It is to be noted that t his algorithm can be extended to solve other periodic banded linear systems.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124865387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-04-28DOI: 10.1109/DMCC.1991.633158
J. Parker, T. Cwik, R. Ferraro, P. Liewer, P. Lyster, J. Patterson
The large distributed memory capacities of hypercube computers are exploited by a finite element application which computes the scattered electromagetic field from heterogeneous objects with size large compared to a wavelength. Such problems scale well with hypercube dimension fo r large objects: by using the Recursive Inertial Partitioning algorithm and an iterative solver, the work done by each processor is nearly equal and communication overhead for the system set-up and solution is low. The application has been integrated into a user-friendly eirvironment on a graphics workstation in a local area network including hypercube host machines. Users need never know their solutions are obtained via a parallel computer. Scaling is shown by computing solutions for a series of models which double the number of variables for each increment of hypercube dimension. Timings are compared for the JPLICaltech Mark IIIfp Hypercube and the Intel iPSCI860 hypercube. Acceptable quality of solutions is obtained for object domains of hundreds of square wavelengths and resulting sparse matrix systems with order of 100,000 complex unknowns.
利用超立方体计算机庞大的分布式存储容量,实现了对尺寸大于波长的异质物体散射电磁场的有限元计算。这类问题在超立方体维度下可以很好地扩展到大型对象:通过使用递归惯性划分算法和迭代求解器,每个处理器所做的工作几乎相等,并且系统设置和解决方案的通信开销很低。该应用程序已集成到包括hypercube主机在内的局域网图形工作站的用户友好环境中。用户永远不需要知道他们的解是通过并行计算机得到的。通过计算一系列模型的解决方案来显示缩放,这些模型每增加一个超立方体维度,变量的数量就增加一倍。比较了JPLICaltech Mark IIIfp Hypercube和Intel iPSCI860 Hypercube的时序。对于数百平方波长的目标域和100,000阶复杂未知数的稀疏矩阵系统,获得了可接受的解质量。
{"title":"Helmholtz Finite Elements Performance On Mark III and Intel iPSC/860 Hypercubes","authors":"J. Parker, T. Cwik, R. Ferraro, P. Liewer, P. Lyster, J. Patterson","doi":"10.1109/DMCC.1991.633158","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633158","url":null,"abstract":"The large distributed memory capacities of hypercube computers are exploited by a finite element application which computes the scattered electromagetic field from heterogeneous objects with size large compared to a wavelength. Such problems scale well with hypercube dimension fo r large objects: by using the Recursive Inertial Partitioning algorithm and an iterative solver, the work done by each processor is nearly equal and communication overhead for the system set-up and solution is low. The application has been integrated into a user-friendly eirvironment on a graphics workstation in a local area network including hypercube host machines. Users need never know their solutions are obtained via a parallel computer. Scaling is shown by computing solutions for a series of models which double the number of variables for each increment of hypercube dimension. Timings are compared for the JPLICaltech Mark IIIfp Hypercube and the Intel iPSCI860 hypercube. Acceptable quality of solutions is obtained for object domains of hundreds of square wavelengths and resulting sparse matrix systems with order of 100,000 complex unknowns.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116692409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-04-28DOI: 10.1109/DMCC.1991.633174
D. Scott
Some application programs on distributed memory parallel computers occasionally require an "all-to-all" communication pattern, where each compute node must send a distinct message to each other compute node. Assuming that each node can send and receive only one message at a t ime, the all-to-all pattern must be implemented as a sequence of phases in which certain nodes send and receive messages. r f there are p compute nodes, then at least p-1 phases are needed to complete the operation. A proof of a schedule achieving this lower bound on a circuit switched hypercube with fuced routing is given. This lower bound cannot be achieved on a 2 dimensional mesh. On an axa mesh, dl4 is shown to be a lower bound and a schedule with this number of phases is given. Whether hypercubes or meshes are better for this algorithm depends on the relative bandwidths of the communication channels.
{"title":"Efficient All-to-All Communication Patterns in Hypercube and Mesh Topologies","authors":"D. Scott","doi":"10.1109/DMCC.1991.633174","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633174","url":null,"abstract":"Some application programs on distributed memory parallel computers occasionally require an \"all-to-all\" communication pattern, where each compute node must send a distinct message to each other compute node. Assuming that each node can send and receive only one message at a t ime, the all-to-all pattern must be implemented as a sequence of phases in which certain nodes send and receive messages. r f there are p compute nodes, then at least p-1 phases are needed to complete the operation. A proof of a schedule achieving this lower bound on a circuit switched hypercube with fuced routing is given. This lower bound cannot be achieved on a 2 dimensional mesh. On an axa mesh, dl4 is shown to be a lower bound and a schedule with this number of phases is given. Whether hypercubes or meshes are better for this algorithm depends on the relative bandwidths of the communication channels.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132329335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-04-28DOI: 10.1109/DMCC.1991.633200
S. Breit
The TC.2000 is a MIMD parallel processor wi,th memory that is physically distributed memory, but logically shared. Interprocessor covnmunication, and therefore access to shared memory, is sufficiently fast that most applications can be ported to the TC.2000 without rewriting the code from scratch. This paper shows how this was done for the Perfect ARC'2D benchmark. The code was first restructured by changing the order of subroutine calls so that interprocessor communication would be reduced to the equivalent of three full transposes ofthe data per iteration. The parallel implementation was then completed by inserting shared data declarations and parallel extensions provided by the TC.2000 Fortran language. Thi:F approach was easier to implement than a domain decomposition technique, but requires more interprocessor communication. It is feasible only (because of the TC.2000'~ highspeed interprocessor communications network. References to shared memory take about 25% of the totai execution time for the parallel version of ARC2D. an acceptable amount considering the code did not have to be completely rewritten. High parallel efficiency was obtained using up
{"title":"Implementing the Perfect ARC2D Benchmark on the BBN TC2000 Parallel Supercomputer","authors":"S. Breit","doi":"10.1109/DMCC.1991.633200","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633200","url":null,"abstract":"The TC.2000 is a MIMD parallel processor wi,th memory that is physically distributed memory, but logically shared. Interprocessor covnmunication, and therefore access to shared memory, is sufficiently fast that most applications can be ported to the TC.2000 without rewriting the code from scratch. This paper shows how this was done for the Perfect ARC'2D benchmark. The code was first restructured by changing the order of subroutine calls so that interprocessor communication would be reduced to the equivalent of three full transposes ofthe data per iteration. The parallel implementation was then completed by inserting shared data declarations and parallel extensions provided by the TC.2000 Fortran language. Thi:F approach was easier to implement than a domain decomposition technique, but requires more interprocessor communication. It is feasible only (because of the TC.2000'~ highspeed interprocessor communications network. References to shared memory take about 25% of the totai execution time for the parallel version of ARC2D. an acceptable amount considering the code did not have to be completely rewritten. High parallel efficiency was obtained using up","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130116534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-04-28DOI: 10.1109/DMCC.1991.633170
E. Castro-Leon, M. L. Barton, E. Kushner
The Prosolver-SE:? software i s one of the direct equation solvers available for the iPSC@16160. It uses skyline storage of matrix elements, and is applicable to linear systems that do not require pivoting. The product is available as a library thzt includes additional' operations to support Finite Element Method applications. This paper discusses the software architecture and some of the high performance algorithms.
{"title":"Software En ineering Aspects of the ProSolver -SES Skyline Solver","authors":"E. Castro-Leon, M. L. Barton, E. Kushner","doi":"10.1109/DMCC.1991.633170","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633170","url":null,"abstract":"The Prosolver-SE:? software i s one of the direct equation solvers available for the iPSC@16160. It uses skyline storage of matrix elements, and is applicable to linear systems that do not require pivoting. The product is available as a library thzt includes additional' operations to support Finite Element Method applications. This paper discusses the software architecture and some of the high performance algorithms.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130494114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}