Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242729
L. Higham, Eric Schenk
The authors introduce and evaluate a novel model of parallel computation, called the parallel asynchronous recursion (PAR) model. This model offers distinct advantages to the program designer and the parallel machine architect, while avoiding some of the parallel random-access machine; (PRAM's) shortcomings. The PAR model can be thought of as a procedural programming language augmented with a process control structure that can, in parallel, recursively fork independent processes and merge their results. The unique aspect of the PAR model lies in its memory semantics, which differ substantially from both global and distributed memory models. It provides a high level of abstraction that removes the tasks of explicit processor scheduling and synchronization. Efficient simulations of the PAR model on well-established models confirm that the PAR model's advantages can be obtained at a reasonable cost.<>
{"title":"The Parallel Asynchronous Recursion model","authors":"L. Higham, Eric Schenk","doi":"10.1109/SPDP.1992.242729","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242729","url":null,"abstract":"The authors introduce and evaluate a novel model of parallel computation, called the parallel asynchronous recursion (PAR) model. This model offers distinct advantages to the program designer and the parallel machine architect, while avoiding some of the parallel random-access machine; (PRAM's) shortcomings. The PAR model can be thought of as a procedural programming language augmented with a process control structure that can, in parallel, recursively fork independent processes and merge their results. The unique aspect of the PAR model lies in its memory semantics, which differ substantially from both global and distributed memory models. It provides a high level of abstraction that removes the tasks of explicit processor scheduling and synchronization. Efficient simulations of the PAR model on well-established models confirm that the PAR model's advantages can be obtained at a reasonable cost.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121839959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242748
Yuval Caspi, E. Dekel
The authors propose a novel framework for designing efficient parallel algorithms on series parallel graphs. Recently, a novel approach for recognizing series parallel graphs was presented by D. Eppstein. Eppstein explored characterizations of the ear decomposition of series parallel graphs, which can be identified efficiently, in parallel. The authors extend Eppstein's results and show in a unified manner how to solve problems on series parallel graphs efficiently, in parallel, by finding a special ear decomposition of the graph. They demonstrate the utility of their novel framework by presenting O(log n) concurrent read exclusive write (CREW) parallel random access machine (PRAM) algorithms for the construction of a depth-first spanning tree, st-numbering, and a breadth-first spanning tree on series parallel graphs.<>
{"title":"A new framework for designing parallel algorithms on series parallel graphs","authors":"Yuval Caspi, E. Dekel","doi":"10.1109/SPDP.1992.242748","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242748","url":null,"abstract":"The authors propose a novel framework for designing efficient parallel algorithms on series parallel graphs. Recently, a novel approach for recognizing series parallel graphs was presented by D. Eppstein. Eppstein explored characterizations of the ear decomposition of series parallel graphs, which can be identified efficiently, in parallel. The authors extend Eppstein's results and show in a unified manner how to solve problems on series parallel graphs efficiently, in parallel, by finding a special ear decomposition of the graph. They demonstrate the utility of their novel framework by presenting O(log n) concurrent read exclusive write (CREW) parallel random access machine (PRAM) algorithms for the construction of a depth-first spanning tree, st-numbering, and a breadth-first spanning tree on series parallel graphs.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127854094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242766
R. Govindarajan, S. Nemawarkar
The authors propose a multithreaded architecture that performs synchronization efficiently by following a layered approach, exploits larger locality by using large, resident activations, and reduces the number of load stalls with the help of a novel high-speed buffer organization. The performance of the proposed architecture is evaluated using deterministic discrete-event simulation. Initial simulation results indicate that the architecture can achieve high performance in terms of both speedup and processor utilization.<>
{"title":"SMALL: a scalable multithreaded architecture to exploit large locality","authors":"R. Govindarajan, S. Nemawarkar","doi":"10.1109/SPDP.1992.242766","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242766","url":null,"abstract":"The authors propose a multithreaded architecture that performs synchronization efficiently by following a layered approach, exploits larger locality by using large, resident activations, and reduces the number of load stalls with the help of a novel high-speed buffer organization. The performance of the proposed architecture is evaluated using deterministic discrete-event simulation. Initial simulation results indicate that the architecture can achieve high performance in terms of both speedup and processor utilization.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115549440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242701
J. Ding, L. Bhuyan
The authors examine the feasibility of building cache coherent shared memory multiprocessor systems on hypercube. Various shared memory schemes are investigated and compared with each other. The schemes considered are based on memory coherence algorithms for distributed shared memory and cache coherence protocols for other shared memory architectures. It is concluded that efficient cache coherent architectures can be built using hypercubes.<>
{"title":"Cache coherent shared memory hypercube multiprocessors","authors":"J. Ding, L. Bhuyan","doi":"10.1109/SPDP.1992.242701","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242701","url":null,"abstract":"The authors examine the feasibility of building cache coherent shared memory multiprocessor systems on hypercube. Various shared memory schemes are investigated and compared with each other. The schemes considered are based on memory coherence algorithms for distributed shared memory and cache coherence protocols for other shared memory architectures. It is concluded that efficient cache coherent architectures can be built using hypercubes.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115471267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242711
Jong Won Park, D. Harper
A memory system is introduced for the efficient construction of a Gaussian pyramid. The memory system consists of an address calculating circuit, an address routing circuit, a data routing circuit, a memory module selection circuit, and 2/sup n/+1 memory modules. The memory system provides parallel access to 2/sup n/ image points whose patterns are a block, a row, or a column, where the interval of the block or column is one and the interval of the row is one or two. The performance of a generic SIMD (single-instruction multiple-data) processor using the proposed memory system is compared with that of one using an interleaved memory system for the recursive construction of a Gaussian pyramid.<>
{"title":"Memory architecture support for the SIMD construction of a Gaussian pyramid","authors":"Jong Won Park, D. Harper","doi":"10.1109/SPDP.1992.242711","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242711","url":null,"abstract":"A memory system is introduced for the efficient construction of a Gaussian pyramid. The memory system consists of an address calculating circuit, an address routing circuit, a data routing circuit, a memory module selection circuit, and 2/sup n/+1 memory modules. The memory system provides parallel access to 2/sup n/ image points whose patterns are a block, a row, or a column, where the interval of the block or column is one and the interval of the row is one or two. The performance of a generic SIMD (single-instruction multiple-data) processor using the proposed memory system is compared with that of one using an interleaved memory system for the recursive construction of a Gaussian pyramid.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115886129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242759
S. Rampal, D. Agrawal
The authors introduce dictionary-based image sequence coding (DISC) as a new approach to the problem of compression of image sequence data. The DISC algorithm is an adaptation of textual data compression techniques for image sequence data. The algorithm is extremely well suited for parallel implementation on standard configurations such as the rectangular mesh and the hypercube. For N*N images, the authors present SIMD (single-instruction multiple-data) algorithms of time complexities approximately theta (DN) for the mesh and theta (D log N+log/sup 2/ N) for the hypercube (D is proportional to dictionary size). The DISC approach has the additional advantage of involving essentially only simple data movement and lookup operations. Simulation results indicate that moderate to high compression ratios can be achieved along with good visual fidelity and quality of reconstruction.<>
{"title":"Parallel image sequence coding on multiprocessor systems","authors":"S. Rampal, D. Agrawal","doi":"10.1109/SPDP.1992.242759","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242759","url":null,"abstract":"The authors introduce dictionary-based image sequence coding (DISC) as a new approach to the problem of compression of image sequence data. The DISC algorithm is an adaptation of textual data compression techniques for image sequence data. The algorithm is extremely well suited for parallel implementation on standard configurations such as the rectangular mesh and the hypercube. For N*N images, the authors present SIMD (single-instruction multiple-data) algorithms of time complexities approximately theta (DN) for the mesh and theta (D log N+log/sup 2/ N) for the hypercube (D is proportional to dictionary size). The DISC approach has the additional advantage of involving essentially only simple data movement and lookup operations. Simulation results indicate that moderate to high compression ratios can be achieved along with good visual fidelity and quality of reconstruction.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123693186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242753
Weijia Shang, E. Hodzic, Zhigang Chen
The authors consider the problem of transforming irregular data dependence structures of algorithms with nested loops into more regular ones. Algorithms under consideration are n-dimensional algorithms (algorithms with n nested loops) with affine dependences where dependences are linear functions of index variables of the loop. Methods are proposed to transform these algorithms into uniform dependence algorithms where dependences are independent of the index variables (constant). Some parallelism might be lost due to making them uniform. The parallelism preserved by the uniformity is measured by (1) the total execution time by the optimal linear schedule which assigns each computation in the algorithm an execution time according to a linear function of the index of the computation and (2) the size of the cone spanned by the dependence vectors after achieving uniformity. The objective of making the dependence uniform is to maximize parallelism preserved by the uniformity or to minimize the number of dependences after uniformity.<>
{"title":"On uniformization of affine dependence algorithms","authors":"Weijia Shang, E. Hodzic, Zhigang Chen","doi":"10.1109/SPDP.1992.242753","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242753","url":null,"abstract":"The authors consider the problem of transforming irregular data dependence structures of algorithms with nested loops into more regular ones. Algorithms under consideration are n-dimensional algorithms (algorithms with n nested loops) with affine dependences where dependences are linear functions of index variables of the loop. Methods are proposed to transform these algorithms into uniform dependence algorithms where dependences are independent of the index variables (constant). Some parallelism might be lost due to making them uniform. The parallelism preserved by the uniformity is measured by (1) the total execution time by the optimal linear schedule which assigns each computation in the algorithm an execution time according to a linear function of the index of the computation and (2) the size of the cone spanned by the dependence vectors after achieving uniformity. The objective of making the dependence uniform is to maximize parallelism preserved by the uniformity or to minimize the number of dependences after uniformity.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"40 23","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113936944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242731
Raéd Yousef Sindaha
In any Or-parallel system which implements the full Prolog language, such as Aurora, there is the problem of processing time being wasted in regions of the search tree which are later pruned away. The author proposes the Dharma scheduler, which introduces a new concept in scheduling for Aurora. Rather than performing scheduling based on the nodes in the search tree, the Dharma scheduler uses the branches of the tree. The author believes that scheduling at this higher level of abstraction has a number of advantages and will make it possible to tackle the problem of wasted speculative work. Early performance results suggest that the Dharma scheduler is faster than any other existing scheduler for Aurora in applications where only the first solution is required. The author presents the design of the Dharma scheduler and performance results.<>
{"title":"The Dharma scheduler-definitive scheduling in Aurora on multiprocessors architecture","authors":"Raéd Yousef Sindaha","doi":"10.1109/SPDP.1992.242731","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242731","url":null,"abstract":"In any Or-parallel system which implements the full Prolog language, such as Aurora, there is the problem of processing time being wasted in regions of the search tree which are later pruned away. The author proposes the Dharma scheduler, which introduces a new concept in scheduling for Aurora. Rather than performing scheduling based on the nodes in the search tree, the Dharma scheduler uses the branches of the tree. The author believes that scheduling at this higher level of abstraction has a number of advantages and will make it possible to tackle the problem of wasted speculative work. Early performance results suggest that the Dharma scheduler is faster than any other existing scheduler for Aurora in applications where only the first solution is required. The author presents the design of the Dharma scheduler and performance results.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124301558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242735
A. Winckler
The author introduces and explains two algorithms, OFCup and OFCdown, allowing one to calculate the global state of a decentralized distributed system by interpreting measurements that are easy to obtain to facilitate cooperative optimal load balancing without a central job dispatcher. The information required is exchanged using the communication protocol of a receiver-initiated load balancing policy and does not induce any additional message transmission overhead. The author presents and interprets measurements from simulation. These studies show that the performance of systems applying any of the OFCx algorithms is significantly better than a no-information policy called 'random routing' and induces only little additional waiting time compared to the M/D/n model. This is true even for high transmission times relative to the mean time between system state changes. Both algorithms are shown to perform equally well under normal conditions with better variance values of OFC-down, but the degradation of OFCdown is significantly worse than that of OFCup, if the not-accept-counter is not incremented at the time expected.<>
{"title":"Two system state calculation algorithms for optimal load balancing","authors":"A. Winckler","doi":"10.1109/SPDP.1992.242735","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242735","url":null,"abstract":"The author introduces and explains two algorithms, OFCup and OFCdown, allowing one to calculate the global state of a decentralized distributed system by interpreting measurements that are easy to obtain to facilitate cooperative optimal load balancing without a central job dispatcher. The information required is exchanged using the communication protocol of a receiver-initiated load balancing policy and does not induce any additional message transmission overhead. The author presents and interprets measurements from simulation. These studies show that the performance of systems applying any of the OFCx algorithms is significantly better than a no-information policy called 'random routing' and induces only little additional waiting time compared to the M/D/n model. This is true even for high transmission times relative to the mean time between system state changes. Both algorithms are shown to perform equally well under normal conditions with better variance values of OFC-down, but the degradation of OFCdown is significantly worse than that of OFCup, if the not-accept-counter is not incremented at the time expected.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122541096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242760
Jyh-Jong Tsay
The author shows how to parallelize tree-structured computations for d-dimensional (d>or=1) mesh-connected arrays of processors. A tree-structured computation T consists of n computational tasks whose dependencies form a task tree T of n constant degree nodes. Each task can be executed in unit time and sends one value to its parent task after it has been executed. The author presents linear time algorithms for partitioning and mapping the task tree T onto a p/sup 1/d/*. . .*p/sup 1/d/ mesh-connected array of processors so that one can schedule the processors to perform computation T in O(n/p) time, for p>
{"title":"Mapping tree-structured computations onto mesh-connected arrays of processors","authors":"Jyh-Jong Tsay","doi":"10.1109/SPDP.1992.242760","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242760","url":null,"abstract":"The author shows how to parallelize tree-structured computations for d-dimensional (d>or=1) mesh-connected arrays of processors. A tree-structured computation T consists of n computational tasks whose dependencies form a task tree T of n constant degree nodes. Each task can be executed in unit time and sends one value to its parent task after it has been executed. The author presents linear time algorithms for partitioning and mapping the task tree T onto a p/sup 1/d/*. . .*p/sup 1/d/ mesh-connected array of processors so that one can schedule the processors to perform computation T in O(n/p) time, for p<or= min(n/h, n/sup d/(d+1)/). It is shown that one can schedule a p/sup 1/d/ * . .* p/sup 1/d/ mesh to evaluate an n-node expression tree of associative operators in O(n/p) optimal time, for p<or= n/sup d/(d+1)/.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124814635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}