Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242730
A. Piper, R. Prager
An object-oriented framework for the divide-and-conquer (D&C) paradigm is presented. The framework enables a D&C representation of a problem to be built up for subsequent evaluation. This evaluation can be delayed until the maximum amount of computation that can be performed in one D&C pass has been integrated into the representation. This framework does not require a parallelizing compiler and therefore provides an environment that is flexible and easily extensible. D&C thus provides a structure suitable for parallel implementation and object-oriented programming techniques provide a means to encapsulate the D&C semantics and provide a uniform interface to the end-user. Results are presented for an implementation of the back-propagation algorithm.<>
{"title":"A high-level, object-oriented approach to divider-and-conquer","authors":"A. Piper, R. Prager","doi":"10.1109/SPDP.1992.242730","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242730","url":null,"abstract":"An object-oriented framework for the divide-and-conquer (D&C) paradigm is presented. The framework enables a D&C representation of a problem to be built up for subsequent evaluation. This evaluation can be delayed until the maximum amount of computation that can be performed in one D&C pass has been integrated into the representation. This framework does not require a parallelizing compiler and therefore provides an environment that is flexible and easily extensible. D&C thus provides a structure suitable for parallel implementation and object-oriented programming techniques provide a means to encapsulate the D&C semantics and provide a uniform interface to the end-user. Results are presented for an implementation of the back-propagation algorithm.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129591971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242749
Beomsu Kim, H. Youn, K. Kavi
The authors develop a Markov model which can effectively estimate the successful routing probability and internode distance of hierarchical interconnection networks in the presence of faults. A BH/BH (binary hypercube/binary hypercube) network is tested using the model and verified by computer simulations. Comparisons with computer simulation reveal that the proposed model is very accurate. The network performance, when all nodes generate messages, is also expected to be effectively evaluated with the model.<>
{"title":"Hierarchical interconnection networks: routing performance in the presence of faults","authors":"Beomsu Kim, H. Youn, K. Kavi","doi":"10.1109/SPDP.1992.242749","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242749","url":null,"abstract":"The authors develop a Markov model which can effectively estimate the successful routing probability and internode distance of hierarchical interconnection networks in the presence of faults. A BH/BH (binary hypercube/binary hypercube) network is tested using the model and verified by computer simulations. Comparisons with computer simulation reveal that the proposed model is very accurate. The network performance, when all nodes generate messages, is also expected to be effectively evaluated with the model.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129812369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242710
J. Adamo, L. Trejo
The authors present a programming environment called C-NET developed for the reconfigurable SuperNode multiprocessor. It allows the implementation of variable-topology programs that are referred to as phase-reconfigurable programs. The design decisions concerning dynamic-reconfiguration handling are discussed with regard to the architectural constraints of the machine. It provides three specialized languages: PPL (phase programming language), for the development of phase-reconfigurable programs: GCL (graph-construction language), for the construction of graphs on which the phases are to be executed; and CPL components programming language), for coding the software components. The first example on which the programming environment was tested was the conjugate-gradient (CG) algorithm. The results are encouraging. Phase-reconfigurable implementation of CG was developed and compared with a fixed topology implementation (8*4 torus).<>
{"title":"Programming environment for phase-reconfigurable parallel programming on SuperNode","authors":"J. Adamo, L. Trejo","doi":"10.1109/SPDP.1992.242710","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242710","url":null,"abstract":"The authors present a programming environment called C-NET developed for the reconfigurable SuperNode multiprocessor. It allows the implementation of variable-topology programs that are referred to as phase-reconfigurable programs. The design decisions concerning dynamic-reconfiguration handling are discussed with regard to the architectural constraints of the machine. It provides three specialized languages: PPL (phase programming language), for the development of phase-reconfigurable programs: GCL (graph-construction language), for the construction of graphs on which the phases are to be executed; and CPL components programming language), for coding the software components. The first example on which the programming environment was tested was the conjugate-gradient (CG) algorithm. The results are encouraging. Phase-reconfigurable implementation of CG was developed and compared with a fixed topology implementation (8*4 torus).<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"C-20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126771849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242744
Chengzhong Xu, F. Lau
An efficient algorithm for termination detection of loosely synchronized computations is proposed. The proposed algorithm is fully symmetric in that all processes are syntactically identical and can detect global termination simultaneously. It is better in terms of the delay for termination detection than other related algorithms, and is optimal in a number of regular structures. For the hypercube structure of any dimension, the proposed algorithm takes two iteration steps to detect termination after global termination has occurred. In the chain, ring, mesh and torus structures, the improvement is about 50% over its principal competitor. The proposed algorithm requires that the graph be edge-colored and that the color-diameter be known to the processes in advance.<>
{"title":"Distributed termination detection of loosely synchronized computations","authors":"Chengzhong Xu, F. Lau","doi":"10.1109/SPDP.1992.242744","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242744","url":null,"abstract":"An efficient algorithm for termination detection of loosely synchronized computations is proposed. The proposed algorithm is fully symmetric in that all processes are syntactically identical and can detect global termination simultaneously. It is better in terms of the delay for termination detection than other related algorithms, and is optimal in a number of regular structures. For the hypercube structure of any dimension, the proposed algorithm takes two iteration steps to detect termination after global termination has occurred. In the chain, ring, mesh and torus structures, the improvement is about 50% over its principal competitor. The proposed algorithm requires that the graph be edge-colored and that the color-diameter be known to the processes in advance.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133566983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242739
Ajay K. Gupta, Hong Wang
It is pointed out that the problem of efficiently embedding a k-ary tree into hypercube with k>or=3 has largely remained unsolved, even though optimal embeddings (i.e. embeddings achieving minimum delta , lambda , and in ) of complete and incomplete binary trees into hypercubes have been known for some time. Thus, in their quest for designing efficient embeddings of k-ary trees into hypercube for arbitrary k, the authors present some preliminary results that give efficient embeddings for the situations when k=3, 2/sup p/, 3/sup p/, 2/sup p/*3/sup q/ and p, q>0. The embedding of complete ternary trees and the embedding of complete k-ary trees are considered.<>
{"title":"On embedding ternary trees into Boolean hypercubes","authors":"Ajay K. Gupta, Hong Wang","doi":"10.1109/SPDP.1992.242739","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242739","url":null,"abstract":"It is pointed out that the problem of efficiently embedding a k-ary tree into hypercube with k>or=3 has largely remained unsolved, even though optimal embeddings (i.e. embeddings achieving minimum delta , lambda , and in ) of complete and incomplete binary trees into hypercubes have been known for some time. Thus, in their quest for designing efficient embeddings of k-ary trees into hypercube for arbitrary k, the authors present some preliminary results that give efficient embeddings for the situations when k=3, 2/sup p/, 3/sup p/, 2/sup p/*3/sup q/ and p, q>0. The embedding of complete ternary trees and the embedding of complete k-ary trees are considered.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"274 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114105862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242756
Marcin Skubiszewski
The author presents a faithful hardware implementation (built on the top of DECPeRLe-1, a reconfigurable coprocessor closely coupled with its host machine, a DECstation 500) of the Boltzmann machine. The prototype performs 505 megasynapses (million of additions and multiplications) per second, using 16-b fixed-point weights. It can emulate fully connected instances of the Boltzmann machine containing up to 1438 variables. This specialized hardware only executes the simplest part of the Boltzmann machine algorithm, namely, multiplying matrices of numbers by vectors of bits. The other operations (which are complicated, but only require a modest amount of computation) are performed by the host processor. It is noted that the key point of this work resides in establishing the right design choices. Among these, the most important ones are the rejection of 'neural parallelism', which makes the implementation exact, and the algorithm used to generate random numbers in software, which allows the hardware to be simple. The fact that DECPeRLe-1 makes hardware development cheap and fast was essential in this work.<>
{"title":"An exact hardware implementation of the Boltzmann machine","authors":"Marcin Skubiszewski","doi":"10.1109/SPDP.1992.242756","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242756","url":null,"abstract":"The author presents a faithful hardware implementation (built on the top of DECPeRLe-1, a reconfigurable coprocessor closely coupled with its host machine, a DECstation 500) of the Boltzmann machine. The prototype performs 505 megasynapses (million of additions and multiplications) per second, using 16-b fixed-point weights. It can emulate fully connected instances of the Boltzmann machine containing up to 1438 variables. This specialized hardware only executes the simplest part of the Boltzmann machine algorithm, namely, multiplying matrices of numbers by vectors of bits. The other operations (which are complicated, but only require a modest amount of computation) are performed by the host processor. It is noted that the key point of this work resides in establishing the right design choices. Among these, the most important ones are the rejection of 'neural parallelism', which makes the implementation exact, and the algorithm used to generate random numbers in software, which allows the hardware to be simple. The fact that DECPeRLe-1 makes hardware development cheap and fast was essential in this work.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116580273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242718
Yoshifumi Manabe, S. Aoyagi
The authors describe a debugger for distributed programs based on a replay technique. Distributed programs may dynamically fork child processes and open and close communication channels between processes. This debugger features breakpoint setting and selective trace commands with global predicate conditions called conjunctive predicate and disjunctive predicate, which are related to multiple processes. It can halt or test the processes at the first global state for a given conjunctive predicate breakpoint condition. The authors have developed a prototype distributed debugger ddbx-p on UNIX 4.2 BSD.<>
{"title":"Debugging dynamic distributed programs using global predicates","authors":"Yoshifumi Manabe, S. Aoyagi","doi":"10.1109/SPDP.1992.242718","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242718","url":null,"abstract":"The authors describe a debugger for distributed programs based on a replay technique. Distributed programs may dynamically fork child processes and open and close communication channels between processes. This debugger features breakpoint setting and selective trace commands with global predicate conditions called conjunctive predicate and disjunctive predicate, which are related to multiple processes. It can halt or test the processes at the first global state for a given conjunctive predicate breakpoint condition. The authors have developed a prototype distributed debugger ddbx-p on UNIX 4.2 BSD.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121420255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242708
Jae H. Kim, A. Chien
Network performance can be improved by allowing adaptive routing, but doing so introduces new possibilities of deadlock which can overwhelm the flexibility advantages. Planar-adaptive routing resolves this tension by limiting adaptive routing to a series of two-dimensional planes, reducing hardware requirements for deadlock prevention. The authors explore the performance of planar-adaptive routers for two, three, and four-dimensional networks. Under nonuniform traffic loads, the planar-adaptive router significantly outperforms the dimension-order router, while giving comparable performance under uniform loads. With equal resources, the planar-adaptive router provides performance superior to fully adaptive routers because it requires less resources for deadlock prevention, freeing resources to increase the number of virtual lanes.<>
{"title":"An evaluation of planar-adaptive routing (PAR)","authors":"Jae H. Kim, A. Chien","doi":"10.1109/SPDP.1992.242708","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242708","url":null,"abstract":"Network performance can be improved by allowing adaptive routing, but doing so introduces new possibilities of deadlock which can overwhelm the flexibility advantages. Planar-adaptive routing resolves this tension by limiting adaptive routing to a series of two-dimensional planes, reducing hardware requirements for deadlock prevention. The authors explore the performance of planar-adaptive routers for two, three, and four-dimensional networks. Under nonuniform traffic loads, the planar-adaptive router significantly outperforms the dimension-order router, while giving comparable performance under uniform loads. With equal resources, the planar-adaptive router provides performance superior to fully adaptive routers because it requires less resources for deadlock prevention, freeing resources to increase the number of virtual lanes.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122620073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242736
T. LeBlanc, E. Markatos
It is argued that the choice between the shared-memory and message-passing models depends on two factors: the relative cost of communication and computation as implemented by the hardware, and the degree of load imbalance inherent in the application. Two representative applications are used to illustrate the performance advantages of each programming model on several different shared-memory machines, including the BBN Butterfly, Sequent Symmetry, Encore Multimax and Silicon Graphics Iris multiprocessors. It is shown that applications implemented in the shared-memory model perform better on the previous generation of multiprocessors, while applications implemented in the message-passing model perform better on modern multiprocessors. It is argued that both models have performance advantages, and that the factors that influence the choice of model may not be known at compile-time. As a compromise solution, the authors propose an alternative programming model, which has the load balancing properties of the shared-memory model and the locality properties of the message-passing model, and show that this new model performs better than the other two alternatives.<>
{"title":"Shared memory vs. message passing in shared-memory multiprocessors","authors":"T. LeBlanc, E. Markatos","doi":"10.1109/SPDP.1992.242736","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242736","url":null,"abstract":"It is argued that the choice between the shared-memory and message-passing models depends on two factors: the relative cost of communication and computation as implemented by the hardware, and the degree of load imbalance inherent in the application. Two representative applications are used to illustrate the performance advantages of each programming model on several different shared-memory machines, including the BBN Butterfly, Sequent Symmetry, Encore Multimax and Silicon Graphics Iris multiprocessors. It is shown that applications implemented in the shared-memory model perform better on the previous generation of multiprocessors, while applications implemented in the message-passing model perform better on modern multiprocessors. It is argued that both models have performance advantages, and that the factors that influence the choice of model may not be known at compile-time. As a compromise solution, the authors propose an alternative programming model, which has the load balancing properties of the shared-memory model and the locality properties of the message-passing model, and show that this new model performs better than the other two alternatives.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126169770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242719
M. Malkawi, D. Knox, M. Abaza
The authors introduce three page replacement, and page out policies, in distributed virtual memory systems. Two of the replacement policies, the least recently brought and the global recently used or brought, are adapted versions of the least recently used policy, which is well known in conventional virtual memory systems. Trace driven simulation was used to evaluate the performance of the replacement policies and the RR (round robin), LAN (least active neighbor), and LLN (least loaded neighbor) page out policies. The results suggest that when the cost of internode faults is considerably higher than local memory access, global and remote policies are superior to the local one. When the cost of bringing a page from the immediate neighbor is considerably low compared to the cost of accessing the local memory, the local policy performs as well as the global and the remote. Among the page out policies, round robin is the least efficient. LLN generates lower cost than LAN when the size of the local memory is relatively large. Under high memory contention, LAN shows better performance.<>
{"title":"Page replacement in distributed virtual memory systems","authors":"M. Malkawi, D. Knox, M. Abaza","doi":"10.1109/SPDP.1992.242719","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242719","url":null,"abstract":"The authors introduce three page replacement, and page out policies, in distributed virtual memory systems. Two of the replacement policies, the least recently brought and the global recently used or brought, are adapted versions of the least recently used policy, which is well known in conventional virtual memory systems. Trace driven simulation was used to evaluate the performance of the replacement policies and the RR (round robin), LAN (least active neighbor), and LLN (least loaded neighbor) page out policies. The results suggest that when the cost of internode faults is considerably higher than local memory access, global and remote policies are superior to the local one. When the cost of bringing a page from the immediate neighbor is considerably low compared to the cost of accessing the local memory, the local policy performs as well as the global and the remote. Among the page out policies, round robin is the least efficient. LLN generates lower cost than LAN when the size of the local memory is relatively large. Under high memory contention, LAN shows better performance.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117182832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}