Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580941
M. Haines, K. Langendoen
Although platform-independent runtime systems for parallel programming languages are desirable, the need for low-level optimizations usually precludes their existence. This is because most optimizations involve some combination of low-level communication and low-level threading the product of which is almost always platform-dependent. We propose a solution to the threading half of this dilemma by using a thread package, that allows fine-grain control over the behaviour of the threads while still providing performance comparable to hand-tuned, machine-dependent thread packages. This makes it possible to construct platform-independent thread modules for parallel runtime systems and, more importantly, to optimize them.
{"title":"Platform-independent runtime optimizations using OpenThreads","authors":"M. Haines, K. Langendoen","doi":"10.1109/IPPS.1997.580941","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580941","url":null,"abstract":"Although platform-independent runtime systems for parallel programming languages are desirable, the need for low-level optimizations usually precludes their existence. This is because most optimizations involve some combination of low-level communication and low-level threading the product of which is almost always platform-dependent. We propose a solution to the threading half of this dilemma by using a thread package, that allows fine-grain control over the behaviour of the threads while still providing performance comparable to hand-tuned, machine-dependent thread packages. This makes it possible to construct platform-independent thread modules for parallel runtime systems and, more importantly, to optimize them.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132601294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580861
R. Snelick
We present a novel tool, called S-Check, for identifying performance bottlenecks in parallel and networked programs. S-Check is a highly-automated sensitivity analysis tool for programs that extends benchmarking and conventional profiling. It predicts how refinements in parts of a program are going to affect performance by making focal changes in code efficiencies and correlating these against overall program performance. This analysis is a sophisticated comparison that catches interactions arising from shared resources or communication links. S-Check's performance assessment ranks code segments "bottleneck" according to their sensitivity to the code efficiency changes. This rank-ordered list serves as a guide for tuning applications. In practice, S-Check code analysis yields faster parallel programs. A case study compares and contrasts sensitivity analyses of the same program on different architectures and offers solutions for performance improvement. An initial implementation of S-Check runs on Silicon Graphics multiprocessors and IBM SP machines. Particulars of the underlying methodology are only sketched with main emphasis given to details of the tool S-Check and its use.
{"title":"S-Check: a tool for tuning parallel programs","authors":"R. Snelick","doi":"10.1109/IPPS.1997.580861","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580861","url":null,"abstract":"We present a novel tool, called S-Check, for identifying performance bottlenecks in parallel and networked programs. S-Check is a highly-automated sensitivity analysis tool for programs that extends benchmarking and conventional profiling. It predicts how refinements in parts of a program are going to affect performance by making focal changes in code efficiencies and correlating these against overall program performance. This analysis is a sophisticated comparison that catches interactions arising from shared resources or communication links. S-Check's performance assessment ranks code segments \"bottleneck\" according to their sensitivity to the code efficiency changes. This rank-ordered list serves as a guide for tuning applications. In practice, S-Check code analysis yields faster parallel programs. A case study compares and contrasts sensitivity analyses of the same program on different architectures and offers solutions for performance improvement. An initial implementation of S-Check runs on Silicon Graphics multiprocessors and IBM SP machines. Particulars of the underlying methodology are only sketched with main emphasis given to details of the tool S-Check and its use.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132157255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580973
Subhasish Subhasish, P. Banerjee
Algebraic factorization is an extremely important part of any logic synthesis system, but it is computationally expensive. Hence, it is important to look at parallel processing to speed up the procedure. This paper presents three different parallel algorithms for algebraic factorization. The first algorithm uses circuit replication and uses a divide-and-conquer strategy. A second algorithm uses totally independent factorization on different circuit partitions with no interactions among the partitions. A third algorithm represents a compromise between the two approaches. It uses a novel L-shaped partitioning strategy which provides some interaction among the rectangles obtained in various partitions. For a large circuit like ex1010, the last algorithm runs 11.5 times faster over the sequential kernel extraction algorithms of the SIS sequential circuit synthesis system on six processors with less than 0.2% degradation in quality of the results.
{"title":"A comparison of parallel approaches for algebraic factorization in logic synthesis","authors":"Subhasish Subhasish, P. Banerjee","doi":"10.1109/IPPS.1997.580973","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580973","url":null,"abstract":"Algebraic factorization is an extremely important part of any logic synthesis system, but it is computationally expensive. Hence, it is important to look at parallel processing to speed up the procedure. This paper presents three different parallel algorithms for algebraic factorization. The first algorithm uses circuit replication and uses a divide-and-conquer strategy. A second algorithm uses totally independent factorization on different circuit partitions with no interactions among the partitions. A third algorithm represents a compromise between the two approaches. It uses a novel L-shaped partitioning strategy which provides some interaction among the rectangles obtained in various partitions. For a large circuit like ex1010, the last algorithm runs 11.5 times faster over the sequential kernel extraction algorithms of the SIS sequential circuit synthesis system on six processors with less than 0.2% degradation in quality of the results.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132184696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580909
X. Martorell, Jesús Labarta, N. Navarro, E. Ayguadé
The authors present the analysis, in a dynamic processor allocation environment, of four scheduling algorithms running on top of the nano-threads programming model. Three of them are well-known: uniform-sized chunking, guided self-scheduling and trapezoid self-scheduling. The fourth is their proposal: adaptable size chunking. In that environment, applications are automatically decomposed into tasks by a parallelizing compiler which uses the hierarchical task graph to represent the source application. The parallel code is an executable representation of this graph with the support of a user-level library (the nano-threads library). The execution environment includes a user-level process (CPU manager) which controls the allocation of processors to applications. The analysis of the scheduling algorithms shows it is possible to provide enough information to the library to allow a fast adaptation to dynamic changes in the processors allocated to the application.
{"title":"Analysis of several scheduling algorithms under the nano-threads programming model","authors":"X. Martorell, Jesús Labarta, N. Navarro, E. Ayguadé","doi":"10.1109/IPPS.1997.580909","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580909","url":null,"abstract":"The authors present the analysis, in a dynamic processor allocation environment, of four scheduling algorithms running on top of the nano-threads programming model. Three of them are well-known: uniform-sized chunking, guided self-scheduling and trapezoid self-scheduling. The fourth is their proposal: adaptable size chunking. In that environment, applications are automatically decomposed into tasks by a parallelizing compiler which uses the hierarchical task graph to represent the source application. The parallel code is an executable representation of this graph with the support of a user-level library (the nano-threads library). The execution environment includes a user-level process (CPU manager) which controls the allocation of processors to applications. The analysis of the scheduling algorithms shows it is possible to provide enough information to the library to allow a fast adaptation to dynamic changes in the processors allocated to the application.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132194915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580921
Stefan Bock, F. Heide, C. Scheideler
The authors consider wormhole routing in a d-dimensional torus of side length n. In particular they present an optimal randomized algorithm for routing worms of length up to O(n/(d log n)/sup 2/), one per node, to random destinations. Previous algorithms only work optimally for two dimensions, or are a factor of log n away from the optimal running time. As a by-product they develop an algorithm for the 2-dimensional torus that guarantees an optimal runtime for worms of length up to O(n/(log n)/sup 2/) with much higher probability than all previous algorithms.
{"title":"Optimal wormhole routing in the (n,d)-torus","authors":"Stefan Bock, F. Heide, C. Scheideler","doi":"10.1109/IPPS.1997.580921","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580921","url":null,"abstract":"The authors consider wormhole routing in a d-dimensional torus of side length n. In particular they present an optimal randomized algorithm for routing worms of length up to O(n/(d log n)/sup 2/), one per node, to random destinations. Previous algorithms only work optimally for two dimensions, or are a factor of log n away from the optimal running time. As a by-product they develop an algorithm for the 2-dimensional torus that guarantees an optimal runtime for worms of length up to O(n/(log n)/sup 2/) with much higher probability than all previous algorithms.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127052040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580971
D. Beazley, P. Lomdahl
The authors describe how they have parallelized Python, an interpreted object oriented scripting language, and used it to build an extensible message-passing molecular dynamics application for the CM-5, Cray T3D, and Sun multiprocessor servers running MPI. This allows one to interact with large-scale message-passing applications, rapidly prototype new features, and perform application specific debugging. It is even possible to write message passing programs in Python itself. They describe some of the tools they have developed to extend Python and results of this approach.
{"title":"Extensible message passing application development and debugging with Python","authors":"D. Beazley, P. Lomdahl","doi":"10.1109/IPPS.1997.580971","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580971","url":null,"abstract":"The authors describe how they have parallelized Python, an interpreted object oriented scripting language, and used it to build an extensible message-passing molecular dynamics application for the CM-5, Cray T3D, and Sun multiprocessor servers running MPI. This allows one to interact with large-scale message-passing applications, rapidly prototype new features, and perform application specific debugging. It is even possible to write message passing programs in Python itself. They describe some of the tools they have developed to extend Python and results of this approach.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115649930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580850
C. Salisbury, R. Melhem
Improvements in optical technology will enable the construction of high bandwidth, low latency switching networks. These networks have many applications in massively parallel processing. However current circuit switching and packet switching techniques are not quite suitable for controlling such networks. Time division multiplexing (TDM) schemes can improve the performance of circuit switched optical interconnection networks by taking advantage of the locality of references present in the communication patterns. In this paper we construct a model for the cost of compiled communications in circuit switched networks. We show how the cost is affected by the characteristics of the network and by the application's communication locality of references. We show how a compiler can use this information to choose the most appropriate multiplexing degree.
{"title":"Modeling compiled communication costs in multiplexed optical networks","authors":"C. Salisbury, R. Melhem","doi":"10.1109/IPPS.1997.580850","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580850","url":null,"abstract":"Improvements in optical technology will enable the construction of high bandwidth, low latency switching networks. These networks have many applications in massively parallel processing. However current circuit switching and packet switching techniques are not quite suitable for controlling such networks. Time division multiplexing (TDM) schemes can improve the performance of circuit switched optical interconnection networks by taking advantage of the locality of references present in the communication patterns. In this paper we construct a model for the cost of compiled communications in circuit switched networks. We show how the cost is affected by the characteristics of the network and by the application's communication locality of references. We show how a compiler can use this information to choose the most appropriate multiplexing degree.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114425522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580985
M. Kunde, Kay Guertzig
Sorting and balanced routing problems for synchronous mesh-like processor networks with reconfigurable buses are considered. Induced by the argument that broadcasting along buses of arbitrary length within unit time seems rather non-realistic, we consider basic problems on reconfigurable meshes that can be solved efficiently even with restricted bus length. It is shown that on r-dimensional reconfigurable meshes of side length n with bus length bounded to a constant l the h-h sorting and routing problem can be solved within hn+o(hrn) steps in any case and in hn/2+o(hrn) steps with high probability, provided that hl/spl ges/4r. This result is due to a data concentration method that is explained in the paper and it will hold even for certain very light loadings, i.e. with significantly less than one elements per processor on average. Extensions to two-dimensional reconfigurable meshes with diagonal links are considered.
{"title":"Efficient sorting and routing on reconfigurable meshes using restricted bus length","authors":"M. Kunde, Kay Guertzig","doi":"10.1109/IPPS.1997.580985","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580985","url":null,"abstract":"Sorting and balanced routing problems for synchronous mesh-like processor networks with reconfigurable buses are considered. Induced by the argument that broadcasting along buses of arbitrary length within unit time seems rather non-realistic, we consider basic problems on reconfigurable meshes that can be solved efficiently even with restricted bus length. It is shown that on r-dimensional reconfigurable meshes of side length n with bus length bounded to a constant l the h-h sorting and routing problem can be solved within hn+o(hrn) steps in any case and in hn/2+o(hrn) steps with high probability, provided that hl/spl ges/4r. This result is due to a data concentration method that is explained in the paper and it will hold even for certain very light loadings, i.e. with significantly less than one elements per processor on average. Extensions to two-dimensional reconfigurable meshes with diagonal links are considered.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"12368 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123547457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580858
K. Schauser, C. Scheiman, G. Park, B. Shirazi, J. Marquis
The Internet, best known by most users as the World-Wide-Web, continues to expand at an amazing pace. We propose a new infrastructure to harness the combined resources, such as CPU cycles or disk storage, and make them available to everyone interested. This infrastructure has the potential for solving parallel supercomputing applications involving thousands of cooperating components. Our approach is based on recent advances in Internet connectivity and the implementation of safe distributed computing embodied in languages such as Java. We developed a prototype of a global computing infrastructure, called SuperWeb, that consists of hosts, brokers and clients. Hosts register a fraction of their computing resources (CPU time, memory, bandwidth, disk space) with resource brokers. Client computations are then mapped by the broker onto the registered resources. We examine an economic model for trading computing resources, and discuss several technical challenges associated with such a global computing environment.
{"title":"SuperWeb: towards a global Web-based parallel computing infrastructure","authors":"K. Schauser, C. Scheiman, G. Park, B. Shirazi, J. Marquis","doi":"10.1109/IPPS.1997.580858","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580858","url":null,"abstract":"The Internet, best known by most users as the World-Wide-Web, continues to expand at an amazing pace. We propose a new infrastructure to harness the combined resources, such as CPU cycles or disk storage, and make them available to everyone interested. This infrastructure has the potential for solving parallel supercomputing applications involving thousands of cooperating components. Our approach is based on recent advances in Internet connectivity and the implementation of safe distributed computing embodied in languages such as Java. We developed a prototype of a global computing infrastructure, called SuperWeb, that consists of hosts, brokers and clients. Hosts register a fraction of their computing resources (CPU time, memory, bandwidth, disk space) with resource brokers. Client computations are then mapped by the broker onto the registered resources. We examine an economic model for trading computing resources, and discuss several technical challenges associated with such a global computing environment.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128479331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580872
T. Andronikos, N. Koziris, G. Papakonstantinou, P. Tsanakas
The n-dimensional grid is one of the most representative patterns of data flow in parallel computation. The most frequently used scheduling models for grids is the unit execution-unit communication time (UET-UCT). We enhance the model of n-dimensional grid by adding extra diagonal edges. First, we calculate the optimal makespan for the generalized UET-UCT grid topology and then we establish the minimum number of processors required, to achieve the optimal makespan. Furthermore, we solve the scheduling problem for generalized n-dimensional grids by proposing an optimal time and space scheduling strategy. We thus prove that UET-UCT scheduling of generalized n-dimensional grids is low complexity tractable.
{"title":"Optimal scheduling for UET-UCT generalized n-dimensional grid task graphs","authors":"T. Andronikos, N. Koziris, G. Papakonstantinou, P. Tsanakas","doi":"10.1109/IPPS.1997.580872","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580872","url":null,"abstract":"The n-dimensional grid is one of the most representative patterns of data flow in parallel computation. The most frequently used scheduling models for grids is the unit execution-unit communication time (UET-UCT). We enhance the model of n-dimensional grid by adding extra diagonal edges. First, we calculate the optimal makespan for the generalized UET-UCT grid topology and then we establish the minimum number of processors required, to achieve the optimal makespan. Furthermore, we solve the scheduling problem for generalized n-dimensional grids by proposing an optimal time and space scheduling strategy. We thus prove that UET-UCT scheduling of generalized n-dimensional grids is low complexity tractable.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128755370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}