Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262807
K. Chandy
This paper explores the questions: Is writing correct parallel programs harder than writing correct sequential programs? If so, why? What can be done to help in developing reliable parallel programs?.<>
{"title":"Writing correct parallel programs","authors":"K. Chandy","doi":"10.1109/IPPS.1993.262807","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262807","url":null,"abstract":"This paper explores the questions: Is writing correct parallel programs harder than writing correct sequential programs? If so, why? What can be done to help in developing reliable parallel programs?.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126555158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262856
E. Sha, K. Steiglitz
The authors present an on-line distributed reconfiguration algorithm for finding a new maximum matching incrementally after some nodes have failed. Their algorithm is deadlock free, and with k failures maintains at least M-k matching pairs during the reconfiguration process, where M is the size of the original maximum matching. The algorithm tolerates failures that occur during reconfiguration. The worst-case reconfiguration time is O(k min( mod A mod , mod B mod )) after k failures, where A and B are the node sets, but simulations show that the average-case reconfiguration time is much better. The algorithm is also simple enough to be implemented in hardware.<>
提出了一种在线分布式重构算法,用于在某些节点失效后,增量地寻找新的最大匹配。他们的算法是无死锁的,并且在k次失败的情况下,在重新配置过程中保持至少M-k对匹配,其中M是原始最大匹配的大小。该算法允许在重新配置过程中出现故障。在k次失败后,最坏情况下的重构时间为O(k min(A mod, B mod)),其中A和B为节点集,但仿真表明,平均情况下的重构时间要好得多。该算法也足够简单,可以在硬件上实现。
{"title":"Maintaining bipartite matchings in the presence of failures","authors":"E. Sha, K. Steiglitz","doi":"10.1109/IPPS.1993.262856","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262856","url":null,"abstract":"The authors present an on-line distributed reconfiguration algorithm for finding a new maximum matching incrementally after some nodes have failed. Their algorithm is deadlock free, and with k failures maintains at least M-k matching pairs during the reconfiguration process, where M is the size of the original maximum matching. The algorithm tolerates failures that occur during reconfiguration. The worst-case reconfiguration time is O(k min( mod A mod , mod B mod )) after k failures, where A and B are the node sets, but simulations show that the average-case reconfiguration time is much better. The algorithm is also simple enough to be implemented in hardware.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129309315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262822
Q. Malluhi, M. Bayoumi, T. Rao
The paper explores the hierarchical hypercube (HHC) interconnection network, suitable for building massively parallel systems with thousands of processors. HHC is self-embedded, that is, an HHC can embed HHCs of lower dimensions. In addition, HHC is a communication-efficient architecture. Two algorithms for data communication in the HHC are presented. The first algorithm is for one-to-one transfer and the second is for one-to-all broadcasting. Both algorithms take O(log k), where, k is the total number of processors in the system. Moreover, the paper shows that the HHC VLSI layout has a relatively small area which is O((log log k).k/sup 2//log k).<>
{"title":"On the hierarchical hypercube interconnection network","authors":"Q. Malluhi, M. Bayoumi, T. Rao","doi":"10.1109/IPPS.1993.262822","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262822","url":null,"abstract":"The paper explores the hierarchical hypercube (HHC) interconnection network, suitable for building massively parallel systems with thousands of processors. HHC is self-embedded, that is, an HHC can embed HHCs of lower dimensions. In addition, HHC is a communication-efficient architecture. Two algorithms for data communication in the HHC are presented. The first algorithm is for one-to-one transfer and the second is for one-to-all broadcasting. Both algorithms take O(log k), where, k is the total number of processors in the system. Moreover, the paper shows that the HHC VLSI layout has a relatively small area which is O((log log k).k/sup 2//log k).<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116036357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262816
H. Alnuweiri
This paper presents constant-time algorithms for labeling the connected components of images on a network of processors with a wide reconfigurable bus. The algorithms are based on a processor indexing scheme which employs constant-weight codes. The use of such codes enables identifying a single representative processor for each component in a constant number of steps. The proposed algorithms can label an N*N image or an N-vertex graph in O(1) time using Theta (N/sup 2/) processors, which is optimal. Furthermore, the proposed techniques lead to O(log N/log log N)-time labeling algorithms on a network of N/sup 2/ processors with a reconfigurable bus of width O(log N) bits.<>
{"title":"Fast algorithms for image labeling on a reconfigurable network of processors","authors":"H. Alnuweiri","doi":"10.1109/IPPS.1993.262816","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262816","url":null,"abstract":"This paper presents constant-time algorithms for labeling the connected components of images on a network of processors with a wide reconfigurable bus. The algorithms are based on a processor indexing scheme which employs constant-weight codes. The use of such codes enables identifying a single representative processor for each component in a constant number of steps. The proposed algorithms can label an N*N image or an N-vertex graph in O(1) time using Theta (N/sup 2/) processors, which is optimal. Furthermore, the proposed techniques lead to O(log N/log log N)-time labeling algorithms on a network of N/sup 2/ processors with a reconfigurable bus of width O(log N) bits.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123586992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262800
I. Yen, R. Dubash, F. Bastani
Lee's (1961) maze-routing algorithm has been a popular method for routing wires in VLSI circuits. It can also be applied to a variety of other problems, such as robot path planning. Although the algorithm is simple and easy to implement, its computation time can be quite high. Therefore, it is a very attractive candidate for implementation on parallel systems. The major issue in parallelizing this algorithm is mapping the grid space of the problem to the processor space. The communication cost and processor utilization can be greatly affected by the mapping strategy used. Won and Sahni (1987) have studied a class of mapping strategies for Lee's algorithm and analyzed their performance. The authors propose two new mapping strategies. First, they modify Won and Sahni's mapping algorithm by using the concept of mirror images to allow higher processor utilization while reducing the number of boundary cells. The new algorithm is shown to be better than the original one in an obstacle-free grid space. Then, they propose a dynamic mapping algorithm. This new mapping algorithm is shown to give an optimal mapping in an obstacle-free grid space. Also, they performed simulation to study the relative performance of these mapping algorithms for grid spaces with obstacles. The results show that the new algorithms are substantially faster than the earlier ones.<>
{"title":"Strategies for mapping Lee's maze routing algorithm onto parallel architectures","authors":"I. Yen, R. Dubash, F. Bastani","doi":"10.1109/IPPS.1993.262800","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262800","url":null,"abstract":"Lee's (1961) maze-routing algorithm has been a popular method for routing wires in VLSI circuits. It can also be applied to a variety of other problems, such as robot path planning. Although the algorithm is simple and easy to implement, its computation time can be quite high. Therefore, it is a very attractive candidate for implementation on parallel systems. The major issue in parallelizing this algorithm is mapping the grid space of the problem to the processor space. The communication cost and processor utilization can be greatly affected by the mapping strategy used. Won and Sahni (1987) have studied a class of mapping strategies for Lee's algorithm and analyzed their performance. The authors propose two new mapping strategies. First, they modify Won and Sahni's mapping algorithm by using the concept of mirror images to allow higher processor utilization while reducing the number of boundary cells. The new algorithm is shown to be better than the original one in an obstacle-free grid space. Then, they propose a dynamic mapping algorithm. This new mapping algorithm is shown to give an optimal mapping in an obstacle-free grid space. Also, they performed simulation to study the relative performance of these mapping algorithms for grid spaces with obstacles. The results show that the new algorithms are substantially faster than the earlier ones.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124037251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262795
R. Sastry, N. Ranganathan, R. Jain
Depth recovery from grey-scale images is an important topic in the field of computer and robot vision. Intensity gradient analysis (IGA) is a robust technique for inferring depth information from a sequence of images acquired by a sensor undergoing translational motion. IGA obviates the need for explicitly solving the correspondence problem and hence is an efficient technique for range estimation. The design of special purpose hardware could significantly speed up the computations in IGA, which is a computationally intensive task. The authors propose two VLSI architectures for high-speed range estimation based on IGA. The architectures fully utilize the principles of pipelining and parallelism in order to obtain high speed and throughput. The designs are conceptually simple and suitable for implementation in VLSI.<>
{"title":"VLSI architectures for depth estimation using intensity gradient analysis","authors":"R. Sastry, N. Ranganathan, R. Jain","doi":"10.1109/IPPS.1993.262795","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262795","url":null,"abstract":"Depth recovery from grey-scale images is an important topic in the field of computer and robot vision. Intensity gradient analysis (IGA) is a robust technique for inferring depth information from a sequence of images acquired by a sensor undergoing translational motion. IGA obviates the need for explicitly solving the correspondence problem and hence is an efficient technique for range estimation. The design of special purpose hardware could significantly speed up the computations in IGA, which is a computationally intensive task. The authors propose two VLSI architectures for high-speed range estimation based on IGA. The architectures fully utilize the principles of pipelining and parallelism in order to obtain high speed and throughput. The designs are conceptually simple and suitable for implementation in VLSI.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124206879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262863
G. Megson
The production of regular computations using algorithmic engineering techniques is beginning to play an important role in the synthesis of massively parallel and VLSI processor arrays. The author widens the class of algorithms that can be formally synthesized by introducing a mapping theorem for a class of algorithms with run-time dependencies. The technique is illustrated by deriving uniform recurrences for the so-called knapsack problem, the resulting systolic array is known to be optimal.<>
{"title":"Mapping a class of run-time dependencies onto regular arrays","authors":"G. Megson","doi":"10.1109/IPPS.1993.262863","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262863","url":null,"abstract":"The production of regular computations using algorithmic engineering techniques is beginning to play an important role in the synthesis of massively parallel and VLSI processor arrays. The author widens the class of algorithms that can be formally synthesized by introducing a mapping theorem for a class of algorithms with run-time dependencies. The technique is illustrated by deriving uniform recurrences for the so-called knapsack problem, the resulting systolic array is known to be optimal.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122791193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262899
M. Goodrich, Yossi Matias, U. Vishkin
The authors address two fundamental problems in parallel algorithm design-parallel prefix sums and integer sorting-and show that both of them can be approximately solved very quickly on a randomized CRCW PRAM. In the case of prefix sums the approximation is in terms of the accuracy of the sums and in the case of integer sorting it is in terms of allowing some gaps between consecutive elements in the ordered list. By introducing approximation in these ways the authors are able to solve these problems in o(lg lg n) time, and thus avoid the near-logarithmic lower bounds by Beame and Hastad that hold for the exact versions of these problems. Nevertheless, they demonstrate that these approximations are strong enough to be used as subroutines in fast randomized algorithms for some well-known problems in parallel computational geometry. Perhaps the most succinct way to describe the power of the new tools which are presented is by observing that prior to this work it was known how to solve the interval allocation problem fast. The authors show how to solve the ordered version of the problem.<>
{"title":"Approximate parallel prefix computation and its applications","authors":"M. Goodrich, Yossi Matias, U. Vishkin","doi":"10.1109/IPPS.1993.262899","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262899","url":null,"abstract":"The authors address two fundamental problems in parallel algorithm design-parallel prefix sums and integer sorting-and show that both of them can be approximately solved very quickly on a randomized CRCW PRAM. In the case of prefix sums the approximation is in terms of the accuracy of the sums and in the case of integer sorting it is in terms of allowing some gaps between consecutive elements in the ordered list. By introducing approximation in these ways the authors are able to solve these problems in o(lg lg n) time, and thus avoid the near-logarithmic lower bounds by Beame and Hastad that hold for the exact versions of these problems. Nevertheless, they demonstrate that these approximations are strong enough to be used as subroutines in fast randomized algorithms for some well-known problems in parallel computational geometry. Perhaps the most succinct way to describe the power of the new tools which are presented is by observing that prior to this work it was known how to solve the interval allocation problem fast. The authors show how to solve the ordered version of the problem.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125634331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262817
G. Karypis, Vipin Kumar
The authors are concerned with dynamic programming (DP) algorithms whose solution is given by a recurrence relation similar to that for the matrix parenthesization problem. Guibas, Kung and Thompson (1979), presented a systolic array algorithm for this problem that uses O(n/sup 2/) processing cells and solves the problem in O(n) time. The authors present three different mappings of this systolic algorithm on a mesh connected parallel computer. The first two mappings use commonly known techniques for mapping systolic arrays to mesh computers. Both of them are able to obtain only a fraction of maximum possible performance. The primary reason for the poor performance of these formulations is that different nodes at different levels in the multistage graph in the DP formulation require different amounts of computation. Any adaptation has to take this into consideration and evenly distribute the work among the processors. The third mapping balances the work load among processors and thus is capable of providing efficiency approximately equal to 1 (i.e., speedup approximately equal to the number of processors) for any number of processors and sufficiently large problem. They experimentally evaluate these mappings on a mesh embedded onto a 256 processor nCUBE/2.<>
{"title":"Efficient parallel mappings of a dynamic programming algorithm: a summary of results","authors":"G. Karypis, Vipin Kumar","doi":"10.1109/IPPS.1993.262817","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262817","url":null,"abstract":"The authors are concerned with dynamic programming (DP) algorithms whose solution is given by a recurrence relation similar to that for the matrix parenthesization problem. Guibas, Kung and Thompson (1979), presented a systolic array algorithm for this problem that uses O(n/sup 2/) processing cells and solves the problem in O(n) time. The authors present three different mappings of this systolic algorithm on a mesh connected parallel computer. The first two mappings use commonly known techniques for mapping systolic arrays to mesh computers. Both of them are able to obtain only a fraction of maximum possible performance. The primary reason for the poor performance of these formulations is that different nodes at different levels in the multistage graph in the DP formulation require different amounts of computation. Any adaptation has to take this into consideration and evenly distribute the work among the processors. The third mapping balances the work load among processors and thus is capable of providing efficiency approximately equal to 1 (i.e., speedup approximately equal to the number of processors) for any number of processors and sufficiently large problem. They experimentally evaluate these mappings on a mesh embedded onto a 256 processor nCUBE/2.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127184420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262852
M. Thapar, B. Delagi, M. Flynn
This paper presents a singly-linked distributed directory (SDD) cache coherence protocol and compares the performance of the SDD protocol with the fully mapped centralized directory protocol and the IEEE SCI Standard protocol. To maintain coherence, the SDD protocol uses a linked list of cache lines that contain shared copies of the data. The protocol has scalable cost. Coherency related messages are not required to be delivered in order, thus allowing adaptive routing, making the performance more robust in the presence of congested networks. The authors analysis shows that the SDD protocol has generally better performance in the presence of memory and interconnect contention. They discuss the various factors, such as memory reference behavior and interconnect traffic, that affect the performance of these protocols.<>
{"title":"Linked list cache coherence for scalable shared memory multiprocessors","authors":"M. Thapar, B. Delagi, M. Flynn","doi":"10.1109/IPPS.1993.262852","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262852","url":null,"abstract":"This paper presents a singly-linked distributed directory (SDD) cache coherence protocol and compares the performance of the SDD protocol with the fully mapped centralized directory protocol and the IEEE SCI Standard protocol. To maintain coherence, the SDD protocol uses a linked list of cache lines that contain shared copies of the data. The protocol has scalable cost. Coherency related messages are not required to be delivered in order, thus allowing adaptive routing, making the performance more robust in the presence of congested networks. The authors analysis shows that the SDD protocol has generally better performance in the presence of memory and interconnect contention. They discuss the various factors, such as memory reference behavior and interconnect traffic, that affect the performance of these protocols.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126804060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}