Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262910
Usha Sridhar, A. Basu
The parallel implementation of the revised simplex algorithm (RSA) using eta-factorization holds the promise of significant improvement in the execution time by virtue of the existence of a high degree of parallelism in the computation within an iteration of the algorithm. However, the scheme employed to partition key data structures in a distributed memory parallel processor has a great impact on the achievable performance. The paper explores the trade-offs between block-row and block-column partitioning schemes for the matrix of constraint coefficients vis-a-vis the communication overheads and granularity of parallel computations. The results of an approximate analysis of the compute-communication balance are compared with measurements from practical implementation of the partitioning schemes on C-DAC's PARAM 8000 distributed memory parallel processor.<>
{"title":"Data partitioning schemes for the parallel implementation of the revised simplex algorithm for LP problems","authors":"Usha Sridhar, A. Basu","doi":"10.1109/IPPS.1993.262910","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262910","url":null,"abstract":"The parallel implementation of the revised simplex algorithm (RSA) using eta-factorization holds the promise of significant improvement in the execution time by virtue of the existence of a high degree of parallelism in the computation within an iteration of the algorithm. However, the scheme employed to partition key data structures in a distributed memory parallel processor has a great impact on the achievable performance. The paper explores the trade-offs between block-row and block-column partitioning schemes for the matrix of constraint coefficients vis-a-vis the communication overheads and granularity of parallel computations. The results of an approximate analysis of the compute-communication balance are compared with measurements from practical implementation of the partitioning schemes on C-DAC's PARAM 8000 distributed memory parallel processor.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"299 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123469421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262887
Amitabh Sinha, L. Kalé
Load balancing is a critical factor in achieving optimal performance in parallel applications where tasks are created in a dynamic fashion. In many computations, such as state space search problems, tasks have priorities, and solutions to the computation may be achieved more efficiently if these priorities are adhered to in the parallel execution of the tasks. For such tasks, a load balancing scheme that only seeks to balance load, without balancing high priority tasks over the entire system, might result in the concentration of high priority tasks (even in a balanced-load environment) on a few processors, thereby leading to low priority work being done. In such situations a load balancing scheme is desired which would balance both load and high priority tasks over the system. The authors describe the development of a more efficient prioritized load balancing strategy.<>
{"title":"A load balancing strategy for prioritized execution of tasks","authors":"Amitabh Sinha, L. Kalé","doi":"10.1109/IPPS.1993.262887","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262887","url":null,"abstract":"Load balancing is a critical factor in achieving optimal performance in parallel applications where tasks are created in a dynamic fashion. In many computations, such as state space search problems, tasks have priorities, and solutions to the computation may be achieved more efficiently if these priorities are adhered to in the parallel execution of the tasks. For such tasks, a load balancing scheme that only seeks to balance load, without balancing high priority tasks over the entire system, might result in the concentration of high priority tasks (even in a balanced-load environment) on a few processors, thereby leading to low priority work being done. In such situations a load balancing scheme is desired which would balance both load and high priority tasks over the system. The authors describe the development of a more efficient prioritized load balancing strategy.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122066739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262865
P. Markenscoff, Yong Yuan Li
The authors consider the problem of optimally scheduling the subtasks of a computational task modeled by a dag (directed acyclic graph) on parallel systems with identical processors. Execution of the subtasks (nodes) must satisfy precedence constraints that are met via data exchanges among processors which introduce communication delays. The optimization criterion used is the minimization of the processing time and the authors assume that there is no restriction on the number of processors needed and that a node may be replicated. They prove that the optimal scheduling problem can be solved in polynomial amount of time when the computational graph is a two-level dag. For a general dag they develop an algorithm that significantly reduces the search space over exhaustive search and can work very fast in many cases (the problem is NP-complete).<>
{"title":"Scheduling a computational dag on a parallel system with communication delays and replication of node execution","authors":"P. Markenscoff, Yong Yuan Li","doi":"10.1109/IPPS.1993.262865","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262865","url":null,"abstract":"The authors consider the problem of optimally scheduling the subtasks of a computational task modeled by a dag (directed acyclic graph) on parallel systems with identical processors. Execution of the subtasks (nodes) must satisfy precedence constraints that are met via data exchanges among processors which introduce communication delays. The optimization criterion used is the minimization of the processing time and the authors assume that there is no restriction on the number of processors needed and that a node may be replicated. They prove that the optimal scheduling problem can be solved in polynomial amount of time when the computational graph is a two-level dag. For a general dag they develop an algorithm that significantly reduces the search space over exhaustive search and can work very fast in many cases (the problem is NP-complete).<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128289632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262893
Y. Tseng, D. Panda, T. Lai
This paper considers the single-source and multi-source multicasting problem in wormhole-routed networks. A general trip-based model is proposed for any network having at least 2 virtual channels per physical channel. The underlying concept of this model is a node sequence called skirt, which always exists in graphs of any topology. The strength of this model is demonstrated by its capabilities: (a) the resulting routing scheme is simple, adaptive, distributed and deadlock-free; (b) the model is independent of the network topology, regular or irregular; (c) the minimum number of virtual channels required is constant as the network grows in size; and (d) it can tolerate faults easily. Using 2 virtual channels/physical channel, it is shown how to construct a single trip in faulty hypercubes and multiple trips in fault-free meshes. Simulation experimental results indicate the potential of the model to tolerate faults with very little performance degradation and to reduce multicast latency with multiple trips.<>
{"title":"A trip-based multicasting model for wormhole-routed networks with virtual channels","authors":"Y. Tseng, D. Panda, T. Lai","doi":"10.1109/IPPS.1993.262893","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262893","url":null,"abstract":"This paper considers the single-source and multi-source multicasting problem in wormhole-routed networks. A general trip-based model is proposed for any network having at least 2 virtual channels per physical channel. The underlying concept of this model is a node sequence called skirt, which always exists in graphs of any topology. The strength of this model is demonstrated by its capabilities: (a) the resulting routing scheme is simple, adaptive, distributed and deadlock-free; (b) the model is independent of the network topology, regular or irregular; (c) the minimum number of virtual channels required is constant as the network grows in size; and (d) it can tolerate faults easily. Using 2 virtual channels/physical channel, it is shown how to construct a single trip in faulty hypercubes and multiple trips in fault-free meshes. Simulation experimental results indicate the potential of the model to tolerate faults with very little performance degradation and to reduce multicast latency with multiple trips.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128409879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262803
E. Ganesan, D. Pradhan
The order-(m, n) hyper-deBruijn graph H D(m, n) is the direct product of an order-m hypercube and an order-n deBruijn graph. The hyper-deBruijn graph offers flexibility in terms of connections per node and the level of fault-tolerance. These networks as well possess logarithmic diameter, simple routing algorithms and support many computationally important subgraphs and admit efficient implementation. The authors present asymptotically optimal one-to-all (OTA) broadcasting scheme for these networks, assuming packet switched routing and concurrent communication on all ports. The product structure of the hyper-deBruijn graphs is exploited to construct an optimal number of edge-disjoint spanning trees to achieve this. Also, as an intermediate result they present a technique to construct an optimal number of spanning trees with heights bounded by the diameter in binary deBruijn graphs. This result is used to achieve the fastest OTA broadcasting scheme for binary deBruijn networks. The recent renewed interest of binary deBruijn networks makes this result valuable.<>
{"title":"Optimal broadcasting in binary de Bruijn networks and hyper-deBruijn networks","authors":"E. Ganesan, D. Pradhan","doi":"10.1109/IPPS.1993.262803","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262803","url":null,"abstract":"The order-(m, n) hyper-deBruijn graph H D(m, n) is the direct product of an order-m hypercube and an order-n deBruijn graph. The hyper-deBruijn graph offers flexibility in terms of connections per node and the level of fault-tolerance. These networks as well possess logarithmic diameter, simple routing algorithms and support many computationally important subgraphs and admit efficient implementation. The authors present asymptotically optimal one-to-all (OTA) broadcasting scheme for these networks, assuming packet switched routing and concurrent communication on all ports. The product structure of the hyper-deBruijn graphs is exploited to construct an optimal number of edge-disjoint spanning trees to achieve this. Also, as an intermediate result they present a technique to construct an optimal number of spanning trees with heights bounded by the diameter in binary deBruijn graphs. This result is used to achieve the fastest OTA broadcasting scheme for binary deBruijn networks. The recent renewed interest of binary deBruijn networks makes this result valuable.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128652768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262883
W. Ligon, U. Ramachandran
The authors investigate the relationship between application program characteristics and interconnection network (ICN) performance using an execution driven simulation testbed: the reconfigurable architecture workbench (RAW). RAW simulates a wide variety of parallel architectures including both fine and coarse grain; SIMD, MIMD, and hybrid machines; and a wide variety of ICNs. They present RAW's network model, the structure of RAW's network simulator, a model for k-ary n-cube networks which are currently popular in the literature, and the results of experiments using the simulator. Their results show that application program characteristics can have a profound effect on network performance: a revelation that points out the benefits of studying interconnection networks in the context of overall application performance.<>
{"title":"Simulating interconnection networks in RAW","authors":"W. Ligon, U. Ramachandran","doi":"10.1109/IPPS.1993.262883","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262883","url":null,"abstract":"The authors investigate the relationship between application program characteristics and interconnection network (ICN) performance using an execution driven simulation testbed: the reconfigurable architecture workbench (RAW). RAW simulates a wide variety of parallel architectures including both fine and coarse grain; SIMD, MIMD, and hybrid machines; and a wide variety of ICNs. They present RAW's network model, the structure of RAW's network simulator, a model for k-ary n-cube networks which are currently popular in the literature, and the results of experiments using the simulator. Their results show that application program characteristics can have a profound effect on network performance: a revelation that points out the benefits of studying interconnection networks in the context of overall application performance.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124692391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262796
O. Schwarzkopf
Although the computation of two dimensional convolutions is one of the basic computational tools for the processing of digitized images, and although it is general knowledge that convolutions can be efficiently computed sequentially with the aid of Fourier transforms, previous work on the parallel computation of convolutions has not been based on Fourier transforms. This is probably due to the fact that the fast Fourier transform cannot be implemented efficiently on simple structures such as the mesh, the mesh with broadcasting the mesh of trees, or the pyramid computer. It is shown that it makes sense to use the Fourier transform, even on such simpler structures, obtaining nearly optimal algorithms for the computation of convolutions on the parallel structures listed above. As an application, an algorithm is given that computes the digitized configuration space of a robot with translation only in the plane.<>
{"title":"Computing convolutions on mesh-like structures","authors":"O. Schwarzkopf","doi":"10.1109/IPPS.1993.262796","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262796","url":null,"abstract":"Although the computation of two dimensional convolutions is one of the basic computational tools for the processing of digitized images, and although it is general knowledge that convolutions can be efficiently computed sequentially with the aid of Fourier transforms, previous work on the parallel computation of convolutions has not been based on Fourier transforms. This is probably due to the fact that the fast Fourier transform cannot be implemented efficiently on simple structures such as the mesh, the mesh with broadcasting the mesh of trees, or the pyramid computer. It is shown that it makes sense to use the Fourier transform, even on such simpler structures, obtaining nearly optimal algorithms for the computation of convolutions on the parallel structures listed above. As an application, an algorithm is given that computes the digitized configuration space of a robot with translation only in the plane.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129665703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262858
M. Nigam, S. Sahni
The authors show that by folding data from an n*n mesh onto an n*(n/k) submesh, sorting on the submesh, and finally unfolding back onto the entire n*n mesh it is possible to sort on bidirectional and strict unidirectional meshes using a number of routing steps that is very close to the distance lower bound for these architectures. The technique may also be applied to reconfigurable bus architectures to obtain faster sorting algorithms.<>
{"title":"Sorting n/sup 2/ numbers on n*n meshes","authors":"M. Nigam, S. Sahni","doi":"10.1109/IPPS.1993.262858","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262858","url":null,"abstract":"The authors show that by folding data from an n*n mesh onto an n*(n/k) submesh, sorting on the submesh, and finally unfolding back onto the entire n*n mesh it is possible to sort on bidirectional and strict unidirectional meshes using a number of routing steps that is very close to the distance lower bound for these architectures. The technique may also be applied to reconfigurable bus architectures to obtain faster sorting algorithms.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126700972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262788
B. Cong, S. Zheng, S. Sharma
The Fibonacci cube was proposed recently as an interconnection network. It has been shown that this new network topology possesses many interesting properties that are important in network design and applications. This paper addresses the following network simulation problem: Given a linear array, a ring or a two-dimensional mesh, how can be assign its nodes to the Fibonacci cube nodes so as to keep their adjacent nodes near each other in the Fibonacci cube. The authors first show a simple fact that there is a Hamiltonian path in any Fibonacci cube. They prove that any ring structure can be embedded into its corresponding optimum Fibonacci cube (the smallest Fibonacci cube with at least the number of nodes in the ring) with dilation 2, which is optimum for most cases. Then, they describe dilation 1 embeddings of a class of meshes into their corresponding optimum Fibonacci cubes. Finally, it is shown that an arbitrary mesh can be embedded into its corresponding optimum or near-optimum Fibonacci cube with dilation 2.<>
{"title":"On simulations of linear arrays, rings and 2D meshes on Fibonacci cube networks","authors":"B. Cong, S. Zheng, S. Sharma","doi":"10.1109/IPPS.1993.262788","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262788","url":null,"abstract":"The Fibonacci cube was proposed recently as an interconnection network. It has been shown that this new network topology possesses many interesting properties that are important in network design and applications. This paper addresses the following network simulation problem: Given a linear array, a ring or a two-dimensional mesh, how can be assign its nodes to the Fibonacci cube nodes so as to keep their adjacent nodes near each other in the Fibonacci cube. The authors first show a simple fact that there is a Hamiltonian path in any Fibonacci cube. They prove that any ring structure can be embedded into its corresponding optimum Fibonacci cube (the smallest Fibonacci cube with at least the number of nodes in the ring) with dilation 2, which is optimum for most cases. Then, they describe dilation 1 embeddings of a class of meshes into their corresponding optimum Fibonacci cubes. Finally, it is shown that an arbitrary mesh can be embedded into its corresponding optimum or near-optimum Fibonacci cube with dilation 2.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133912622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262849
J. Bodeveix, Érick Bizouarn
This paper presents a parallel all-solution extension of Prolog integrating AND parallelism and a restricted form of OR parallelism, both explicitly declared by the user. Parallel sub-goals may share variables and incrementally communicate partially instantiated terms via their common variables, thus allowing stream AND parallelism. Furthermore, the communication direction does not need to be declared by the user or deduced by a static analysis. The resolution model is detailed and its completeness proven. The authors describe a transputer network implementation.<>
{"title":"A parallel Prolog execution model: theoretical approach and experimental results","authors":"J. Bodeveix, Érick Bizouarn","doi":"10.1109/IPPS.1993.262849","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262849","url":null,"abstract":"This paper presents a parallel all-solution extension of Prolog integrating AND parallelism and a restricted form of OR parallelism, both explicitly declared by the user. Parallel sub-goals may share variables and incrementally communicate partially instantiated terms via their common variables, thus allowing stream AND parallelism. Furthermore, the communication direction does not need to be declared by the user or deduced by a static analysis. The resolution model is detailed and its completeness proven. The authors describe a transputer network implementation.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133951278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}