Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262873
M. Barnett, R. Littlefield, D. G. Payne, R. V. D. Geijn
Several algorithms are discussed for implementing global combine (summation) on distributed memory computers using a two-dimensional mesh interconnect with wormhole routing. These include algorithms that are asymptotically optimal for short vectors (O(log(p)) for p processing nodes) and for long vectors (O(n) for n data elements per node), as well as hybrid algorithms that are superior for intermediate n. Performance models are developed that include the effects of link conflicts and other characteristics of the underlying communication system. The models are validated using experimental data from the Intel Touchstone DELTA computer. Each of the combine algorithms is shown to be superior under some circumstances.<>
{"title":"Global combine on mesh architectures with wormhole routing","authors":"M. Barnett, R. Littlefield, D. G. Payne, R. V. D. Geijn","doi":"10.1109/IPPS.1993.262873","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262873","url":null,"abstract":"Several algorithms are discussed for implementing global combine (summation) on distributed memory computers using a two-dimensional mesh interconnect with wormhole routing. These include algorithms that are asymptotically optimal for short vectors (O(log(p)) for p processing nodes) and for long vectors (O(n) for n data elements per node), as well as hybrid algorithms that are superior for intermediate n. Performance models are developed that include the effects of link conflicts and other characteristics of the underlying communication system. The models are validated using experimental data from the Intel Touchstone DELTA computer. Each of the combine algorithms is shown to be superior under some circumstances.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131295694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262905
Dugki Min, M. Mutka
The authors develop expressions for predicting contention delay for wormhole-routed 2-D mesh multicomputers. The detrimental effect of contention caused by interference within jobs has led them to analyze two different kinds of communication contention. Starting contention occurs when a processor attempts to access the network at the first hop on its route from the source to destination. Intermediate contention has different characteristics, and is the contention facing a communication path as the message arrives at intermediate nodes on its path from its source to destination. They describe how their expressions are developed and relate them to the problem of evaluating interference within a job assigned to a multicomputer.<>
{"title":"A framework for predicting delay due to job interactions in a 2-D mesh multicomputer","authors":"Dugki Min, M. Mutka","doi":"10.1109/IPPS.1993.262905","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262905","url":null,"abstract":"The authors develop expressions for predicting contention delay for wormhole-routed 2-D mesh multicomputers. The detrimental effect of contention caused by interference within jobs has led them to analyze two different kinds of communication contention. Starting contention occurs when a processor attempts to access the network at the first hop on its route from the source to destination. Intermediate contention has different characteristics, and is the contention facing a communication path as the message arrives at intermediate nodes on its path from its source to destination. They describe how their expressions are developed and relate them to the problem of evaluating interference within a job assigned to a multicomputer.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117212447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262901
C. Wittenbrink, A. Somani
Spatial image warping is useful for image processing and graphics. The authors present optimal concurrent-read-exclusive-write (CREW) and exclusive-read-exclusive-write (EREW) parallel-random-access-machine (PRAM) algorithms that achieve O(1) asymptotic run time. The significant result is the creative processor assignment that results in an EREW PRAM forward direct warp algorithm. The forward algorithm calculates any nonscaling affine transform. The EREW algorithm is the most efficient in practice, and 16k processor MasPar MP-1 can rotate a 4 million element image in under a second and a 2 million element volume in 1/2 of a second. This high performance allows interactive viewing of volumes from arbitrary viewpoints and illustrates linear speedup.<>
{"title":"2D and 3D optimal parallel image warping","authors":"C. Wittenbrink, A. Somani","doi":"10.1109/IPPS.1993.262901","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262901","url":null,"abstract":"Spatial image warping is useful for image processing and graphics. The authors present optimal concurrent-read-exclusive-write (CREW) and exclusive-read-exclusive-write (EREW) parallel-random-access-machine (PRAM) algorithms that achieve O(1) asymptotic run time. The significant result is the creative processor assignment that results in an EREW PRAM forward direct warp algorithm. The forward algorithm calculates any nonscaling affine transform. The EREW algorithm is the most efficient in practice, and 16k processor MasPar MP-1 can rotate a 4 million element image in under a second and a 2 million element volume in 1/2 of a second. This high performance allows interactive viewing of volumes from arbitrary viewpoints and illustrates linear speedup.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124492329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262860
P. MacKenzie
The author proves separations between two models of the reconfigurable mesh (rmesh), the cross-over model and the non-cross-over model. Specifically he shows that in the non-cross-over model, a k*n rmesh requires Omega ((log n)/k) time to compute the parity of n bits stored one per column, and a square root n* square root n rmesh requires Omega (log*n) time to compute the parity of n bits stored one per processor. In the cross-over model, in either case, the parity can be computed in constant time. The lower bounds given in this paper are the first separations demonstrated between the cross-over and non-cross-over model. These lower bounds do not rely on the bandwidth constraints of the mesh and do not restrict the instruction sets of the processors. Moreover, they are the first lower bounds for the rmesh which require only binary inputs.<>
{"title":"A separation between reconfigurable mesh models","authors":"P. MacKenzie","doi":"10.1109/IPPS.1993.262860","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262860","url":null,"abstract":"The author proves separations between two models of the reconfigurable mesh (rmesh), the cross-over model and the non-cross-over model. Specifically he shows that in the non-cross-over model, a k*n rmesh requires Omega ((log n)/k) time to compute the parity of n bits stored one per column, and a square root n* square root n rmesh requires Omega (log*n) time to compute the parity of n bits stored one per processor. In the cross-over model, in either case, the parity can be computed in constant time. The lower bounds given in this paper are the first separations demonstrated between the cross-over and non-cross-over model. These lower bounds do not rely on the bandwidth constraints of the mesh and do not restrict the instruction sets of the processors. Moreover, they are the first lower bounds for the rmesh which require only binary inputs.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127475332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262813
D. L. Bright, S. Fineberg, B. H. Pease, M. L. Roderick, S. Sundaram, T. Casavant
An analytical study of potential pathological performance areas of the Seamless architecture is presented. Seamless is a latency-tolerant, distributed memory, multiprocessor architecture. A key component of the philosophy of Seamless, however, is the use of standard, commodity components for a large part of the system. A discussion of the unavoidable implementation compromises imposed by this decision is presented, followed by a summary of some optimistic performance studies. Then an analytical study that parameterizes the predicts the worst-case impact of using standard components is provided. Finally, it is shown that these bottlenecks are manageable via careful generation of target machine code so that the optimistic performance studies become realistic expectations for a range of program behaviors and granularities.<>
{"title":"Critical performance path analysis, and efficient code generation issues, for the Seamless architecture","authors":"D. L. Bright, S. Fineberg, B. H. Pease, M. L. Roderick, S. Sundaram, T. Casavant","doi":"10.1109/IPPS.1993.262813","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262813","url":null,"abstract":"An analytical study of potential pathological performance areas of the Seamless architecture is presented. Seamless is a latency-tolerant, distributed memory, multiprocessor architecture. A key component of the philosophy of Seamless, however, is the use of standard, commodity components for a large part of the system. A discussion of the unavoidable implementation compromises imposed by this decision is presented, followed by a summary of some optimistic performance studies. Then an analytical study that parameterizes the predicts the worst-case impact of using standard components is provided. Finally, it is shown that these bottlenecks are manageable via careful generation of target machine code so that the optimistic performance studies become realistic expectations for a range of program behaviors and granularities.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132474606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262872
S. Inohara, Kazuhiko Kato, T. Masuda
The performance of threads is limited primarily by the overhead of two kinds of switching: vertical switching (user/kernel domain switching) and horizontal switching (context switching between threads). Although these switchings are indispensable in some situations, existing thread mechanisms involve unnecessary switchings on multiprogrammed systems, because of inappropriate interfaces between the operating system kernel and user-level programs. This paper presents a set of interfaces between the kernel and user-level programs that minimizes the overhead of the two kinds of switchings. The kernel provides 'unstable threads,' which are controlled solely by the kernel, while each user-level program monitors them and gives suggestions on their activities to the kernel through a shared memory area between the kernel and user address spaces. This new way of separating thread management minimizes the overhead of vertical and horizontal switchings.<>
{"title":"'Unstable threads' kernel interface for minimizing the overhead of thread switching","authors":"S. Inohara, Kazuhiko Kato, T. Masuda","doi":"10.1109/IPPS.1993.262872","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262872","url":null,"abstract":"The performance of threads is limited primarily by the overhead of two kinds of switching: vertical switching (user/kernel domain switching) and horizontal switching (context switching between threads). Although these switchings are indispensable in some situations, existing thread mechanisms involve unnecessary switchings on multiprogrammed systems, because of inappropriate interfaces between the operating system kernel and user-level programs. This paper presents a set of interfaces between the kernel and user-level programs that minimizes the overhead of the two kinds of switchings. The kernel provides 'unstable threads,' which are controlled solely by the kernel, while each user-level program monitors them and gives suggestions on their activities to the kernel through a shared memory area between the kernel and user address spaces. This new way of separating thread management minimizes the overhead of vertical and horizontal switchings.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132086659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262854
K. Petersen, Kai Li
This paper presents a software cache coherence scheme that uses virtual memory (VM) support to maintain cache coherency for shared memory multiprocessors. Traditional VM translation hardware in each processor is used to detect memory access attempts that would violate cache coherence and system software is used to enforce coherence. The implementation of this class of coherence schemes is very economical: it requires neither special multiprocessor hardware nor compiler support, and easily incorporates different consistency models. The authors evaluated two consistency models for the VM-based approach: sequential consistency and lazy release consistency. The VM-based schemes are compared with a bus based snoopy caching architecture, and the authors' trace-driven simulation results show that the VM-based cache coherence schemes are practical for small-scale, shared memory multiprocessors.<>
{"title":"Cache coherence for shared memory multiprocessors based on virtual memory support","authors":"K. Petersen, Kai Li","doi":"10.1109/IPPS.1993.262854","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262854","url":null,"abstract":"This paper presents a software cache coherence scheme that uses virtual memory (VM) support to maintain cache coherency for shared memory multiprocessors. Traditional VM translation hardware in each processor is used to detect memory access attempts that would violate cache coherence and system software is used to enforce coherence. The implementation of this class of coherence schemes is very economical: it requires neither special multiprocessor hardware nor compiler support, and easily incorporates different consistency models. The authors evaluated two consistency models for the VM-based approach: sequential consistency and lazy release consistency. The VM-based schemes are compared with a bus based snoopy caching architecture, and the authors' trace-driven simulation results show that the VM-based cache coherence schemes are practical for small-scale, shared memory multiprocessors.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132100407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262907
S. Kon'ya, T. Satoh
The authors propose a new task scheduling algorithm, which takes communication delays and link contentions into account to meet the requirements of a communication model of a hypercube. It assigns a priority which includes communication delays to each task and selects the processor where the task will be allocated in order to minimize link contentions. Evaluation has been carried out by using randomly generated graphs. The results show that almost linear speed-up is obtained when the number of tasks is 1024 and the number of processors ranges between 2 and 32. A ratio of communication time to processing time (C/P), which indicates the difficulty of scheduling task graphs with communication, is introduced and verifies the effectiveness of the proposed algorithm.<>
{"title":"Task scheduling on a hypercube with link contentions","authors":"S. Kon'ya, T. Satoh","doi":"10.1109/IPPS.1993.262907","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262907","url":null,"abstract":"The authors propose a new task scheduling algorithm, which takes communication delays and link contentions into account to meet the requirements of a communication model of a hypercube. It assigns a priority which includes communication delays to each task and selects the processor where the task will be allocated in order to minimize link contentions. Evaluation has been carried out by using randomly generated graphs. The results show that almost linear speed-up is obtained when the number of tasks is 1024 and the number of processors ranges between 2 and 32. A ratio of communication time to processing time (C/P), which indicates the difficulty of scheduling task graphs with communication, is introduced and verifies the effectiveness of the proposed algorithm.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130406184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262780
Sesh Venugopal, V. Naik
The authors examine the effect of two partitioning parameters on the performance of block-based distributed sparse Cholesky factorization. They present result to show the trends in the effect of these parameters on the computation speeds, communication costs, extent of processor idling because of load imbalances, and bookkeeping overheads. These results provide a better understanding in selecting the partitioning parameters so as to reduce the computation and communication costs without increasing the overhead costs or the load imbalance among the processors. Experimental results from a 32-processor iPSC/860 are presented.<>
{"title":"Towards understanding block partitioning for sparse Cholesky factorization","authors":"Sesh Venugopal, V. Naik","doi":"10.1109/IPPS.1993.262780","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262780","url":null,"abstract":"The authors examine the effect of two partitioning parameters on the performance of block-based distributed sparse Cholesky factorization. They present result to show the trends in the effect of these parameters on the computation speeds, communication costs, extent of processor idling because of load imbalances, and bookkeeping overheads. These results provide a better understanding in selecting the partitioning parameters so as to reduce the computation and communication costs without increasing the overhead costs or the load imbalance among the processors. Experimental results from a 32-processor iPSC/860 are presented.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123179877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-04-13DOI: 10.1109/IPPS.1993.262919
R. Boppana, S. Chalasani
Development of wormhole routing techniques so far has been largely independent of the results available for store-and-forward routing in literature. The authors provide a general result which enables them to design deadlock-free wormhole routing algorithms from store-and-forward routing algorithms that satisfy certain criteria. They illustrate this result by developing fully-adaptive deadlock-free wormhole routing algorithms from two well-known store-and-forward algorithms: the positive- and negative-hop algorithms based on the number of hops taken by messages. They compare the negative-hop algorithm with the commonly used non-adaptive e-cube and recently proposed partially adaptive north-last algorithm.<>
{"title":"New wormhole routing algorithms for multicomputers","authors":"R. Boppana, S. Chalasani","doi":"10.1109/IPPS.1993.262919","DOIUrl":"https://doi.org/10.1109/IPPS.1993.262919","url":null,"abstract":"Development of wormhole routing techniques so far has been largely independent of the results available for store-and-forward routing in literature. The authors provide a general result which enables them to design deadlock-free wormhole routing algorithms from store-and-forward routing algorithms that satisfy certain criteria. They illustrate this result by developing fully-adaptive deadlock-free wormhole routing algorithms from two well-known store-and-forward algorithms: the positive- and negative-hop algorithms based on the number of hops taken by messages. They compare the negative-hop algorithm with the commonly used non-adaptive e-cube and recently proposed partially adaptive north-last algorithm.<<ETX>>","PeriodicalId":248927,"journal":{"name":"[1993] Proceedings Seventh International Parallel Processing Symposium","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122906702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}