Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580918
M. Kaufmann, U. Meyer, J. F. Sibeyn
Matrix transpose is a fundamental communication operation which is not dealt with optimally by general purpose routing schemes. For two dimensional meshes, the first optimal routing schedule is given. The strategy is simple enough to be implemented, but details of the available hardware are not favorable. However, alternative algorithms, designed along the same lines, give an improvement on the Intel Paragon.
{"title":"Matrix transpose on meshes: theory and practice","authors":"M. Kaufmann, U. Meyer, J. F. Sibeyn","doi":"10.1109/IPPS.1997.580918","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580918","url":null,"abstract":"Matrix transpose is a fundamental communication operation which is not dealt with optimally by general purpose routing schemes. For two dimensional meshes, the first optimal routing schedule is given. The strategy is simple enough to be implemented, but details of the available hardware are not favorable. However, alternative algorithms, designed along the same lines, give an improvement on the Intel Paragon.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121087019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580847
R. Cypher, Ambrose Kofi Laing
A pipeline is a linear array of processors with an input node at one end and an output node at the other end. This paper presents k-gracefully-degradable graphs which, given any set of up to k faults, contain a pipeline that uses all the healthy processor nodes. Our constructions are designed to tolerate faulty input and output nodes, but they can be adapted to provide solutions when the input and output nodes are guaranteed to be healthy. All of our constructions are optimal in terms of the number of nodes and the maximum degree of the processor nodes.
{"title":"Gracefully degradable pipeline networks","authors":"R. Cypher, Ambrose Kofi Laing","doi":"10.1109/IPPS.1997.580847","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580847","url":null,"abstract":"A pipeline is a linear array of processors with an input node at one end and an output node at the other end. This paper presents k-gracefully-degradable graphs which, given any set of up to k faults, contain a pipeline that uses all the healthy processor nodes. Our constructions are designed to tolerate faulty input and output nodes, but they can be adapted to provide solutions when the input and output nodes are guaranteed to be healthy. All of our constructions are optimal in terms of the number of nodes and the maximum degree of the processor nodes.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"61 19","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134411649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580866
J. Vetter, K. Schwan
Computational steering allows researchers to monitor and manage long running, resource intensive applications at runtime. Limited research has addressed high performance computational steering. High performance in computational steering is necessary for three reasons. First, a computational steering system must act intelligently at runtime in order to minimize its perturbation of the target application. Second, monitoring information extracted from the target must be analyzed and forwarded to the user in a timely fashion to allow fast decision making. Finally, steering actions must be executed with low latency to prevent undesirable feedback. The paper describes the use of language constructs, coined ACSL, within a system for computational steering. The steering system interprets ACSL statements and optimizes the requests for steering and monitoring. Specifically, the steering system, called Magellan, utilizes ACSL to intelligently control multithreaded, asynchronous steering servers that cooperatively steer applications. These results compare favorably to our previous Progress steering system.
{"title":"High performance computational steering of physical simulations","authors":"J. Vetter, K. Schwan","doi":"10.1109/IPPS.1997.580866","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580866","url":null,"abstract":"Computational steering allows researchers to monitor and manage long running, resource intensive applications at runtime. Limited research has addressed high performance computational steering. High performance in computational steering is necessary for three reasons. First, a computational steering system must act intelligently at runtime in order to minimize its perturbation of the target application. Second, monitoring information extracted from the target must be analyzed and forwarded to the user in a timely fashion to allow fast decision making. Finally, steering actions must be executed with low latency to prevent undesirable feedback. The paper describes the use of language constructs, coined ACSL, within a system for computational steering. The steering system interprets ACSL statements and optimizes the requests for steering and monitoring. Specifically, the steering system, called Magellan, utilizes ACSL to intelligently control multithreaded, asynchronous steering servers that cooperatively steer applications. These results compare favorably to our previous Progress steering system.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132858858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580963
Chin-Wen Ho, S. Hsieh, Gen-Huey Chen
The authors develop a parallel strategy to compute K-terminal reliability in 2-trees and partial 2-trees. They also solve the problem of finding the most vital edge with respect to K-terminal reliability in partial 2-trees. The algorithms take O(log n) time with C(m,n) processors on a CRCW PRAM, where C(m,n) is the number of processors required to find connected components of a graph with m edges and n vertices in logarithmic time.
{"title":"An efficient parallel strategy for computing K-terminal reliability and finding most vital edge in 2-trees and partial 2-trees","authors":"Chin-Wen Ho, S. Hsieh, Gen-Huey Chen","doi":"10.1109/IPPS.1997.580963","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580963","url":null,"abstract":"The authors develop a parallel strategy to compute K-terminal reliability in 2-trees and partial 2-trees. They also solve the problem of finding the most vital edge with respect to K-terminal reliability in partial 2-trees. The algorithms take O(log n) time with C(m,n) processors on a CRCW PRAM, where C(m,n) is the number of processors required to find connected components of a graph with m edges and n vertices in logarithmic time.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131102900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580892
S. Harabagiu, D. Moldovan
This paper presents a possible solution for the text inference problem extracting information unstated in a text, but implied. The inference algorithm consists of a set of highly parallel search methods that when applied to the knowledge base find contexts of sentences that reveal information relevant to the text. Implementation, results and parallelism analysis are discussed.
{"title":"Parallel inference on a linguistic knowledge base","authors":"S. Harabagiu, D. Moldovan","doi":"10.1109/IPPS.1997.580892","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580892","url":null,"abstract":"This paper presents a possible solution for the text inference problem extracting information unstated in a text, but implied. The inference algorithm consists of a set of highly parallel search methods that when applied to the knowledge base find contexts of sentences that reveal information relevant to the text. Implementation, results and parallelism analysis are discussed.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133499764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580987
Chi-Chang Chen, Jianer Chen
Star networks were proposed recently as an attractive alternative to the well-known hypercube models for interconnection networks. Extensive research has been performed that shows that star networks are as versatile as hypercubes. This paper is an effort in the same direction. Based on the well-known paradigms, we study the one-to-many parallel routing problem on star networks and develop an improved routing algorithm that finds n-1 node-disjoint paths between one node and a set of other n-1 nodes in the n-star network. These parallel paths are proven of minimum length within a small additive constant, and our algorithm has an optimal time complexity. This result significantly improves the previous known algorithms for the problem. Moreover, the algorithm well illustrates an application of the orthogonal partition of star networks, which was observed by the original inventors of the star networks but seems generally overlooked in the subsequent study. We should also point out that similar problems are already studied for hypercubes and have proven useful in designing efficient and fault tolerant routing algorithms on hypercube networks.
{"title":"Nearly optimal one-to-many parallel routing in star networks","authors":"Chi-Chang Chen, Jianer Chen","doi":"10.1109/IPPS.1997.580987","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580987","url":null,"abstract":"Star networks were proposed recently as an attractive alternative to the well-known hypercube models for interconnection networks. Extensive research has been performed that shows that star networks are as versatile as hypercubes. This paper is an effort in the same direction. Based on the well-known paradigms, we study the one-to-many parallel routing problem on star networks and develop an improved routing algorithm that finds n-1 node-disjoint paths between one node and a set of other n-1 nodes in the n-star network. These parallel paths are proven of minimum length within a small additive constant, and our algorithm has an optimal time complexity. This result significantly improves the previous known algorithms for the problem. Moreover, the algorithm well illustrates an application of the orthogonal partition of star networks, which was observed by the original inventors of the star networks but seems generally overlooked in the subsequent study. We should also point out that similar problems are already studied for hypercubes and have proven useful in designing efficient and fault tolerant routing algorithms on hypercube networks.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132211428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580939
Chengzhong Xu, V. Chaudhary
In this paper we present two new run-time algorithms for the parallelization of loops that have indirect access patterns. The algorithms can handle any type of loop-carried dependencies. They follow the INSPECTOR/EXECUTOR scheme and improve upon previous algorithms with the same generality by allowing concurrent reads of the same location and by increasing the overlap of dependent iterations. The algorithms are based on time-stamping rules and implemented using multithreading tools. The experimental results on an SMP server with four processors show that our schemes are efficient and outperform their competitors consistently in all test cases. The difference between the two proposed algorithms is that one allows partially concurrent reads without causing extra overhead in its inspector while the other allows fully concurrent reads at a slight overhead in the dependence analysis. The algorithm allowing fully concurrent reads obtains up to an 80% improvement over its competitor.
{"title":"Time-stamping algorithms for parallelization of loops at run-time","authors":"Chengzhong Xu, V. Chaudhary","doi":"10.1109/IPPS.1997.580939","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580939","url":null,"abstract":"In this paper we present two new run-time algorithms for the parallelization of loops that have indirect access patterns. The algorithms can handle any type of loop-carried dependencies. They follow the INSPECTOR/EXECUTOR scheme and improve upon previous algorithms with the same generality by allowing concurrent reads of the same location and by increasing the overlap of dependent iterations. The algorithms are based on time-stamping rules and implemented using multithreading tools. The experimental results on an SMP server with four processors show that our schemes are efficient and outperform their competitors consistently in all test cases. The difference between the two proposed algorithms is that one allows partially concurrent reads without causing extra overhead in its inspector while the other allows fully concurrent reads at a slight overhead in the dependence analysis. The algorithm allowing fully concurrent reads obtains up to an 80% improvement over its competitor.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114716711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580960
Yungho Choi, T. Pinkston
We explore the design of optimal deadlock recovery-based fully adaptive routers by evaluating promising internal router crossbar designs. Unified and decoupled crossbar designs aimed at exploiting the full capabilities of adaptive routing are evaluated by analyzing their effect on overall network performance. We show that an enhanced hierarchical crossbar design that supports routing locality in virtual network class achieves highest performance with relatively low cost.
{"title":"Crossbar analysis for optimal deadlock recovery router architecture","authors":"Yungho Choi, T. Pinkston","doi":"10.1109/IPPS.1997.580960","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580960","url":null,"abstract":"We explore the design of optimal deadlock recovery-based fully adaptive routers by evaluating promising internal router crossbar designs. Unified and decoupled crossbar designs aimed at exploiting the full capabilities of adaptive routing are evaluated by analyzing their effect on overall network performance. We show that an enhanced hierarchical crossbar design that supports routing locality in virtual network class achieves highest performance with relatively low cost.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114765305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580932
S. Rajasekaran, David S. L. Wei
Shows the power of sampling techniques in designing efficient distributed algorithms. In particular, we show that, by using sampling techniques, selection can be done on some networks in such a way that the message complexity is independent of the cardinality of the set (file), provided the file size is polynomial in the network size. For example, given a file F of size n and an integer k (1/spl les/k/spl les/n), on a p-processor de Bruijn network our deterministic selection algorithm can find the kth smallest key from F using O(p log/sup 3/p) messages and with a communication delay of O(log/sup 3/p), and our randomized selection algorithm can finish the same task using only O(p) messages and a communication delay of O(log p) with high probability, provided the file size is polynomial in network size. Our randomized selection outperforms the existing approaches in terms of both message complexity and communication delay. The property that the number of messages needed and the communication delay are independent of the size of the file makes our distributed selection schemes extremely attractive in such domains as very large database systems. Making use of our selection algorithms to select pivot element(s), we also develop a near-optimal quicksort-based sorting scheme and a nearly-optimal enumeration sorting scheme for sorting large distributed files on the hypercube and de Bruijn networks. Our algorithms are fully distributed without any a priori central control.
{"title":"Designing efficient distributed algorithms using sampling techniques","authors":"S. Rajasekaran, David S. L. Wei","doi":"10.1109/IPPS.1997.580932","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580932","url":null,"abstract":"Shows the power of sampling techniques in designing efficient distributed algorithms. In particular, we show that, by using sampling techniques, selection can be done on some networks in such a way that the message complexity is independent of the cardinality of the set (file), provided the file size is polynomial in the network size. For example, given a file F of size n and an integer k (1/spl les/k/spl les/n), on a p-processor de Bruijn network our deterministic selection algorithm can find the kth smallest key from F using O(p log/sup 3/p) messages and with a communication delay of O(log/sup 3/p), and our randomized selection algorithm can finish the same task using only O(p) messages and a communication delay of O(log p) with high probability, provided the file size is polynomial in network size. Our randomized selection outperforms the existing approaches in terms of both message complexity and communication delay. The property that the number of messages needed and the communication delay are independent of the size of the file makes our distributed selection schemes extremely attractive in such domains as very large database systems. Making use of our selection algorithms to select pivot element(s), we also develop a near-optimal quicksort-based sorting scheme and a nearly-optimal enumeration sorting scheme for sorting large distributed files on the hypercube and de Bruijn networks. Our algorithms are fully distributed without any a priori central control.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122209449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580831
R. Thekkath, A. Singh, J. Singh, S. John, J. Hennessy
Studies done with academic CC-NUMA machines and simulators indicate a good potential for application performance. Our goal therefore, is to investigate whether the CONVEX Exemplar a commercial distributed shared memory machine, lives up to the expected potential of CC-NUMA machines. If not, we would like to understand what architectural or implementation decisions make it less efficient. On evaluating the delivered performance on the Exemplar we find that, while a moderate-scale Exemplar machine works well for several applications, it does not for some important classes. Further performance was affected by four fundamental characteristics of the machine, all of which are due to basic implementation and design choices made on the Exemplar. These are: the effect of processor clustering together with limited node-to-network bandwidth, the effect of tertiary caches, the limited user control over data placement, the sequential memory consistency model together with a cache-based cache coherence protocol, and lastly, longer remote latencies.
{"title":"An evaluation of a commercial CC-NUMA architecture-the CONVEX Exemplar SPP1200","authors":"R. Thekkath, A. Singh, J. Singh, S. John, J. Hennessy","doi":"10.1109/IPPS.1997.580831","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580831","url":null,"abstract":"Studies done with academic CC-NUMA machines and simulators indicate a good potential for application performance. Our goal therefore, is to investigate whether the CONVEX Exemplar a commercial distributed shared memory machine, lives up to the expected potential of CC-NUMA machines. If not, we would like to understand what architectural or implementation decisions make it less efficient. On evaluating the delivered performance on the Exemplar we find that, while a moderate-scale Exemplar machine works well for several applications, it does not for some important classes. Further performance was affected by four fundamental characteristics of the machine, all of which are due to basic implementation and design choices made on the Exemplar. These are: the effect of processor clustering together with limited node-to-network bandwidth, the effect of tertiary caches, the limited user control over data placement, the sequential memory consistency model together with a cache-based cache coherence protocol, and lastly, longer remote latencies.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123541512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}