Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242727
C. Leopold
The author models the internal structure of memory by a tree, where nodes represent memory modules (like cache, disks), and edges represent buses between them. The modules have smaller access time, capacity, and block size the nearer they are to the root. All buses may transmit blocks of data in parallel. The author gives a deterministic sorting algorithm based on greed-sort. Its running time is shown to be optimal up to a constant factor. The bound implies the number of parallel modules necessary at each hierarchy level to overcome the I/O bottlenecks of sorting. The proposed algorithm also applies to the less general models UMH (uniform memory hierarchies) and P-UMH.<>
{"title":"A fast sort using parallelism within memory","authors":"C. Leopold","doi":"10.1109/SPDP.1992.242727","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242727","url":null,"abstract":"The author models the internal structure of memory by a tree, where nodes represent memory modules (like cache, disks), and edges represent buses between them. The modules have smaller access time, capacity, and block size the nearer they are to the root. All buses may transmit blocks of data in parallel. The author gives a deterministic sorting algorithm based on greed-sort. Its running time is shown to be optimal up to a constant factor. The bound implies the number of parallel modules necessary at each hierarchy level to overcome the I/O bottlenecks of sorting. The proposed algorithm also applies to the less general models UMH (uniform memory hierarchies) and P-UMH.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114925674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242758
Ralf Diekmann, Reinhard Lüling, J. Simon
The authors present a problem-independent general-purpose parallel implementation of simulated annealing (SA) on distributed message-passing multiprocessor systems. The sequential algorithm is studied, and a classification of combinatorial optimization problems together with their neighborhood structures is given. Several parallelization approaches are examined, considering their suitability for problems of the various classes. For typical representatives of the different classes, good parallel SA implementations are presented. A novel parallel SA algorithm that works simultaneously on several Markov chains and decreases the number of chains dynamically is presented. This method yields good results with a parallel self-adapting cooling schedule. All algorithms are implemented in OCCAM-2 on a free configurable transputer system. Measurements on various numbers of processors up to 128 transputers are presented.<>
{"title":"A general purpose distributed implementation of simulated annealing","authors":"Ralf Diekmann, Reinhard Lüling, J. Simon","doi":"10.1109/SPDP.1992.242758","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242758","url":null,"abstract":"The authors present a problem-independent general-purpose parallel implementation of simulated annealing (SA) on distributed message-passing multiprocessor systems. The sequential algorithm is studied, and a classification of combinatorial optimization problems together with their neighborhood structures is given. Several parallelization approaches are examined, considering their suitability for problems of the various classes. For typical representatives of the different classes, good parallel SA implementations are presented. A novel parallel SA algorithm that works simultaneously on several Markov chains and decreases the number of chains dynamically is presented. This method yields good results with a parallel self-adapting cooling schedule. All algorithms are implemented in OCCAM-2 on a free configurable transputer system. Measurements on various numbers of processors up to 128 transputers are presented.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130433651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242713
K. C. Posch, R. Posch
Public key cryptography and parallel algorithms are considered. Special attention is paid to algorithms using long integer modulo arithmetic. A modification of the commonly known RSA algorithm is taken as a candidate. So far all implementations have been more or less sequential in the sense that no partitions of a long integer among various processing elements have been performed. The proposed approach allows the use of a dedicated processor for each group of about 30 to 50 bits of a long integer. Efficiency is primarily gained when special-purpose processors are used. In this regard this work is the basis of a VLSI approach to a multiprocessor-based cryptographic design with 15 to 100 processors involved.<>
{"title":"Residue number systems: a key to parallelism in public key cryptography","authors":"K. C. Posch, R. Posch","doi":"10.1109/SPDP.1992.242713","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242713","url":null,"abstract":"Public key cryptography and parallel algorithms are considered. Special attention is paid to algorithms using long integer modulo arithmetic. A modification of the commonly known RSA algorithm is taken as a candidate. So far all implementations have been more or less sequential in the sense that no partitions of a long integer among various processing elements have been performed. The proposed approach allows the use of a dedicated processor for each group of about 30 to 50 bits of a long integer. Efficiency is primarily gained when special-purpose processors are used. In this regard this work is the basis of a VLSI approach to a multiprocessor-based cryptographic design with 15 to 100 processors involved.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131656323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242703
H. Nilsson, P. Stenström
The problem of cache coherence in large-scale shared-memory multiprocessors has been addressed using directory-schemes. Two problems arise when the number of processors increases: the network latency increases and the implementation cost must be kept acceptable. The authors present a tree-based cache coherence protocol called the scalable tree protocol (STP). They show that it can be implemented at a reasonable implementation cost and that the write latency is logarithmic to the size of the sharing set. How to maintain an optimal tree structure and how to handle replacements efficiently are critical issues the authors address for this type of protocol. They compare the performance of the STP with that of the scalable coherent interface (SCI) (IEEE standard P1596) by considering a classical matrix-oriented algorithm targeted for large-scale parallel processing. They show that the STP manages to reduce the execution time considerably by reducing the write latency.<>
{"title":"The Scalable Tree Protocol-a cache coherence approach for large-scale multiprocessors","authors":"H. Nilsson, P. Stenström","doi":"10.1109/SPDP.1992.242703","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242703","url":null,"abstract":"The problem of cache coherence in large-scale shared-memory multiprocessors has been addressed using directory-schemes. Two problems arise when the number of processors increases: the network latency increases and the implementation cost must be kept acceptable. The authors present a tree-based cache coherence protocol called the scalable tree protocol (STP). They show that it can be implemented at a reasonable implementation cost and that the write latency is logarithmic to the size of the sharing set. How to maintain an optimal tree structure and how to handle replacements efficiently are critical issues the authors address for this type of protocol. They compare the performance of the STP with that of the scalable coherent interface (SCI) (IEEE standard P1596) by considering a classical matrix-oriented algorithm targeted for large-scale parallel processing. They show that the STP manages to reduce the execution time considerably by reducing the write latency.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"123 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134623294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242700
R. Bianchini, T. LeBlanc
The authors explore the utility of software caching on a machine with coherent caches. In particular, they show that by caching at the application level one can avoid the problem of false sharing on cache-coherent machines. They compare the performance of software caching with that of other techniques for alleviating false sharing and show that software caching performs better than the alternatives when the reference behavior of an application changes dynamically. It is concluded that software caching, as well as other techniques developed for noncoherent shared-memory multiprocessors, can be profitably used on machines with hardware coherent caches and that programs based on these techniques are efficient across a variety of shared-memory machines.<>
{"title":"Software caching on cache-coherent multiprocessors","authors":"R. Bianchini, T. LeBlanc","doi":"10.1109/SPDP.1992.242700","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242700","url":null,"abstract":"The authors explore the utility of software caching on a machine with coherent caches. In particular, they show that by caching at the application level one can avoid the problem of false sharing on cache-coherent machines. They compare the performance of software caching with that of other techniques for alleviating false sharing and show that software caching performs better than the alternatives when the reference behavior of an application changes dynamically. It is concluded that software caching, as well as other techniques developed for noncoherent shared-memory multiprocessors, can be profitably used on machines with hardware coherent caches and that programs based on these techniques are efficient across a variety of shared-memory machines.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123908765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242722
Cho-Chin Lin, V. Prasanna
The diameter of a packed exponential connections (PEC) network on N nodes is shown to be theta ( square root log N*2 square root /sup (2log/ /sup N)/, where log N denotes log to the base 2. The present results can be extended to the case of two-dimensional PEC networks.<>
{"title":"A tight bound on the diameter of one dimensional PEC networks","authors":"Cho-Chin Lin, V. Prasanna","doi":"10.1109/SPDP.1992.242722","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242722","url":null,"abstract":"The diameter of a packed exponential connections (PEC) network on N nodes is shown to be theta ( square root log N*2 square root /sup (2log/ /sup N)/, where log N denotes log to the base 2. The present results can be extended to the case of two-dimensional PEC networks.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129198511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242712
S. Gupta, S. Kaushik, Chua-Huang Huang, John R. Johnson, Rodney W. Johnson, P. Sadayappan
The authors present an algebraic theory, based on the tensor product for describing the semantics of regular data distributions such as block, cyclic, and block-cyclic distributions. These distributions have been proposed in high performance Fortran, an ongoing effort for developing a Fortran extension for massively parallel computing. This algebraic theory has been used for designing and implementing block recursive algorithms on shared-memory and vector multiprocessors. In the present work, the authors extend this theory to generate programs with explicit data distribution commands from tensor product formulas. A methodology to generate data distributions that optimize communication is described. This methodology is demonstrated by generating efficient programs with data distribution for the fast Fourier transform.<>
{"title":"A methodology for generating data distributions to optimize communication","authors":"S. Gupta, S. Kaushik, Chua-Huang Huang, John R. Johnson, Rodney W. Johnson, P. Sadayappan","doi":"10.1109/SPDP.1992.242712","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242712","url":null,"abstract":"The authors present an algebraic theory, based on the tensor product for describing the semantics of regular data distributions such as block, cyclic, and block-cyclic distributions. These distributions have been proposed in high performance Fortran, an ongoing effort for developing a Fortran extension for massively parallel computing. This algebraic theory has been used for designing and implementing block recursive algorithms on shared-memory and vector multiprocessors. In the present work, the authors extend this theory to generate programs with explicit data distribution commands from tensor product formulas. A methodology to generate data distributions that optimize communication is described. This methodology is demonstrated by generating efficient programs with data distribution for the fast Fourier transform.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125425544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242738
M. Crovella, R. Bianchini, T. LeBlanc
The authors goal is to be able to predict the performance of a parallel program early in the program development process; to that end they require prediction methods that can be based on incomplete programs. They describe how a single method based on communication-to-computation (C/C) ratio can be used to predict performance accurately and yet fairly simply in some commonly encountered cases. They show how C/C-ratio-based methods are accomplished for both distributed-memory and coherent-memory multiprocessors. They show that focusing on C/C ratio simplifies the use of theory, machine benchmarking and application measurement necessary to provide good parallel performance prediction. In addition, the methods demonstrated are useful because they can be applied to program fragments, or serially executed code.<>
{"title":"Using communication-to-computation ratio in parallel program design and performance prediction","authors":"M. Crovella, R. Bianchini, T. LeBlanc","doi":"10.1109/SPDP.1992.242738","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242738","url":null,"abstract":"The authors goal is to be able to predict the performance of a parallel program early in the program development process; to that end they require prediction methods that can be based on incomplete programs. They describe how a single method based on communication-to-computation (C/C) ratio can be used to predict performance accurately and yet fairly simply in some commonly encountered cases. They show how C/C-ratio-based methods are accomplished for both distributed-memory and coherent-memory multiprocessors. They show that focusing on C/C ratio simplifies the use of theory, machine benchmarking and application measurement necessary to provide good parallel performance prediction. In addition, the methods demonstrated are useful because they can be applied to program fragments, or serially executed code.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126392749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242752
D. Tao
An important and meaningful criterion for evaluating a VLSI processor array incorporating an ABFT (algorithm-based fault tolerance) technique is identified. A reliability model which can be used to accurately compute the reliability improvement of a fault-tolerant processor array is established. Examples showing that, when an ABFT technique is incorporated, the reliability improvement depends on the size of the processor array, the nature of the failure, and the failure rate are presented. Therefore, by using the reliability model and methods discussed here, a system designer will be able to determine whether it is beneficial to incorporate an ABFT technique a priori. Moreover, if the reliability of an ABFT processor array cannot meet the specified requirement, the proposed method can also be used as a guide to partition it into smaller ones so that this ABFT technique is still effective and a minimal amount of overhead is introduced.<>
{"title":"Evaluating reliability improvements of fault tolerant VLSI processor arrays","authors":"D. Tao","doi":"10.1109/SPDP.1992.242752","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242752","url":null,"abstract":"An important and meaningful criterion for evaluating a VLSI processor array incorporating an ABFT (algorithm-based fault tolerance) technique is identified. A reliability model which can be used to accurately compute the reliability improvement of a fault-tolerant processor array is established. Examples showing that, when an ABFT technique is incorporated, the reliability improvement depends on the size of the processor array, the nature of the failure, and the failure rate are presented. Therefore, by using the reliability model and methods discussed here, a system designer will be able to determine whether it is beneficial to incorporate an ABFT technique a priori. Moreover, if the reliability of an ABFT processor array cannot meet the specified requirement, the proposed method can also be used as a guide to partition it into smaller ones so that this ABFT technique is still effective and a minimal amount of overhead is introduced.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116477876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-12-01DOI: 10.1109/SPDP.1992.242721
Michael Kaufmann, J. F. Sibeyn
The authors analyze the routing of k-permutations on circular processor arrays connected by bidirectional links. In contrast to linear processor arrays, it is hard to prove lower bounds for the routing time or to construct efficient algorithms for routing k-permutations on circular arrays (except for the case k=1). The authors prove nontrivial lower bounds for routing with global knowledge and for routing with local knowledge. They present deterministic algorithms that use local information only. The best of these algorithms requires only k*n/4+emsn routing steps for all k>or=4. This almost matches the k*n/4 lower bound. Special attention is given to the cases k=2 and 3.<>
{"title":"Deterministic routing on circular arrays","authors":"Michael Kaufmann, J. F. Sibeyn","doi":"10.1109/SPDP.1992.242721","DOIUrl":"https://doi.org/10.1109/SPDP.1992.242721","url":null,"abstract":"The authors analyze the routing of k-permutations on circular processor arrays connected by bidirectional links. In contrast to linear processor arrays, it is hard to prove lower bounds for the routing time or to construct efficient algorithms for routing k-permutations on circular arrays (except for the case k=1). The authors prove nontrivial lower bounds for routing with global knowledge and for routing with local knowledge. They present deterministic algorithms that use local information only. The best of these algorithms requires only k*n/4+emsn routing steps for all k>or=4. This almost matches the k*n/4 lower bound. Special attention is given to the cases k=2 and 3.<<ETX>>","PeriodicalId":265469,"journal":{"name":"[1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133808990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}