Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580970
Hock-Beng Lim, P. Yew
Cache coherence enforcement and memory latency reduction and hiding are very important problems in the design of large-scale shared-memory multiprocessors. The authors propose a compiler-directed cache coherence scheme which makes use of data prefetching. The cache coherence with data prefetching (CCDP) scheme uses compiler analysis techniques to identify potentially-stale data references, which are references to invalid copies of cached data. The key idea of the CCDP scheme is to enforce cache coherence by prefetching the up-to-date data corresponding to these potentially-stale references from the main memory. Application case studies were conducted to gain a quantitative idea of the performance potential of the CCDP scheme on a real system. They applied the CCDP scheme on four benchmark programs from the SPEC CFP95 and CFP92 suites, and executed them on the Cray T3D. The experimental results show that for the programs studied, the scheme provides significant performance improvements by caching shared data and reducing the remote shared-memory access penalty incurred by the programs.
{"title":"A compiler-directed cache coherence scheme using data prefetching","authors":"Hock-Beng Lim, P. Yew","doi":"10.1109/IPPS.1997.580970","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580970","url":null,"abstract":"Cache coherence enforcement and memory latency reduction and hiding are very important problems in the design of large-scale shared-memory multiprocessors. The authors propose a compiler-directed cache coherence scheme which makes use of data prefetching. The cache coherence with data prefetching (CCDP) scheme uses compiler analysis techniques to identify potentially-stale data references, which are references to invalid copies of cached data. The key idea of the CCDP scheme is to enforce cache coherence by prefetching the up-to-date data corresponding to these potentially-stale references from the main memory. Application case studies were conducted to gain a quantitative idea of the performance potential of the CCDP scheme on a real system. They applied the CCDP scheme on four benchmark programs from the SPEC CFP95 and CFP92 suites, and executed them on the Cray T3D. The experimental results show that for the programs studied, the scheme provides significant performance improvements by caching shared data and reducing the remote shared-memory access penalty incurred by the programs.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129655744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580989
Johanne Cohen, P. Fraigniaud, J. König, A. Raspaud
This paper addresses the one-to-all broadcasting problem, and the one-to-many broadcasting problem, usually simply called broadcasting and multicasting, respectively. In this paper, we study these problems under both line model, and cut-through model. The former assumes long distance calls between non neighboring processors. The latter completes the line model by taking into account the use of a routing function. It is known that one can find time optimal broadcast and multicast protocols in the line model in polynomial time. We present a new time optimal broadcasting and multicasting algorithm in the line model. This algorithm efficiently uses the bandwidth of the network. Moreover, it also applies to the cut-through model as soon as the routing function generates shortest paths only.
{"title":"Broadcasting and multicasting in cut-through routed networks","authors":"Johanne Cohen, P. Fraigniaud, J. König, A. Raspaud","doi":"10.1109/IPPS.1997.580989","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580989","url":null,"abstract":"This paper addresses the one-to-all broadcasting problem, and the one-to-many broadcasting problem, usually simply called broadcasting and multicasting, respectively. In this paper, we study these problems under both line model, and cut-through model. The former assumes long distance calls between non neighboring processors. The latter completes the line model by taking into account the use of a routing function. It is known that one can find time optimal broadcast and multicast protocols in the line model in polynomial time. We present a new time optimal broadcasting and multicasting algorithm in the line model. This algorithm efficiently uses the bandwidth of the network. Moreover, it also applies to the cut-through model as soon as the routing function generates shortest paths only.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127547804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580914
M. Ionescu, K. Schauser
Sorting is an important component of many applications, and parallel sorting algorithms have been studied extensively in the last three decades. One of the earliest parallel sorting algorithms is bitonic sort, which is represented by a sorting network consisting of multiple butterfly stages. The paper studies bitonic sort on modern parallel machines which are relatively coarse grained and consist of only a modest number of nodes, thus requiring the mapping of many data elements to each processor. Under such a setting optimizing the bitonic sort algorithm becomes a question of mapping the data elements to processing nodes (data layout) such that communication is minimized. The authors developed a bitonic sort algorithm which minimizes the number of communication steps and optimizes the local computation. The resulting algorithm is faster than previous implementations, as experimental results collected on a 64 node Meiko CS-2 show.
{"title":"Optimizing parallel bitonic sort","authors":"M. Ionescu, K. Schauser","doi":"10.1109/IPPS.1997.580914","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580914","url":null,"abstract":"Sorting is an important component of many applications, and parallel sorting algorithms have been studied extensively in the last three decades. One of the earliest parallel sorting algorithms is bitonic sort, which is represented by a sorting network consisting of multiple butterfly stages. The paper studies bitonic sort on modern parallel machines which are relatively coarse grained and consist of only a modest number of nodes, thus requiring the mapping of many data elements to each processor. Under such a setting optimizing the bitonic sort algorithm becomes a question of mapping the data elements to processing nodes (data layout) such that communication is minimized. The authors developed a bitonic sort algorithm which minimizes the number of communication steps and optimizes the local computation. The resulting algorithm is faster than previous implementations, as experimental results collected on a 64 node Meiko CS-2 show.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122350813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580929
Chris R. Jones, Ambuj K. Singh, D. Agrawal
MPI (Message Passing Interface) is a proposed message-passing standard for the development of efficient and portable parallel programs. An implementation of MPI is presented and evaluated for the Meiko CS/2, a 64-node parallel computer, and a network of 8 SGI workstations connected by an ATM switch and an Ethernet.
{"title":"Low latency MPI for Meiko CS/2 and ATM clusters","authors":"Chris R. Jones, Ambuj K. Singh, D. Agrawal","doi":"10.1109/IPPS.1997.580929","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580929","url":null,"abstract":"MPI (Message Passing Interface) is a proposed message-passing standard for the development of efficient and portable parallel programs. An implementation of MPI is presented and evaluated for the Meiko CS/2, a 64-node parallel computer, and a network of 8 SGI workstations connected by an ATM switch and an Ethernet.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"275 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133268019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580953
P. Dechering, L. Breebaart, F. Kuijlman, K. V. Reeuwijk, H. Sips
In this paper we present a generalized forall statement for parallel languages. The forall statement occurs in many (data) parallel languages and specifies which computations can be performed independently. Many different definitions of such a construct can be found in literature, with different conditions and execution models. We will show how forall constructs of a wide class of parallel languages can be mapped to this generalized forall statement. In addition, the forall statement we propose has the ability to spawn more complex independent activities than can be found in these languages. Denotational semantics are used to define the meaning of the forall and define only one possible program state change. It is shown that it is easy to use and that it is feasible to implement this forall efficiently.
{"title":"Semantics and implementation of a generalized forall statement for parallel languages","authors":"P. Dechering, L. Breebaart, F. Kuijlman, K. V. Reeuwijk, H. Sips","doi":"10.1109/IPPS.1997.580953","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580953","url":null,"abstract":"In this paper we present a generalized forall statement for parallel languages. The forall statement occurs in many (data) parallel languages and specifies which computations can be performed independently. Many different definitions of such a construct can be found in literature, with different conditions and execution models. We will show how forall constructs of a wide class of parallel languages can be mapped to this generalized forall statement. In addition, the forall statement we propose has the ability to spawn more complex independent activities than can be found in these languages. Denotational semantics are used to define the meaning of the forall and define only one possible program state change. It is shown that it is easy to use and that it is feasible to implement this forall efficiently.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131772276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580911
C. Severance, R. Enbody
The work considers the best way to handle a diverse mix of multi-threaded and single-threaded jobs running on a single symmetric parallel processing system. The traditional approaches to this problem are free scheduling, gang scheduling, or space sharing. The paper examines a less common technique called dynamic space sharing. One approach to dynamic space sharing, automatic self allocating threads (ASAT), is compared to all of the traditional approaches to scheduling a mixed load of jobs. Performance results for ASAT scheduling, gang scheduling, and free scheduling are presented. ASAT scheduling is shown to be the superior approach to mixing multi-threaded work with single threaded work.
{"title":"Comparing gang scheduling with dynamic space sharing on symmetric multiprocessors using automatic self-allocating threads (ASAT)","authors":"C. Severance, R. Enbody","doi":"10.1109/IPPS.1997.580911","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580911","url":null,"abstract":"The work considers the best way to handle a diverse mix of multi-threaded and single-threaded jobs running on a single symmetric parallel processing system. The traditional approaches to this problem are free scheduling, gang scheduling, or space sharing. The paper examines a less common technique called dynamic space sharing. One approach to dynamic space sharing, automatic self allocating threads (ASAT), is compared to all of the traditional approaches to scheduling a mixed load of jobs. Performance results for ASAT scheduling, gang scheduling, and free scheduling are presented. ASAT scheduling is shown to be the superior approach to mixing multi-threaded work with single threaded work.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"82 1-2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114121537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580952
Manish Gupta
Privatization of data is an important technique that has been used by compilers to parallelize loops by eliminating storage-related dependences. When a compiler partitions computations based on the ownership of data, selecting a proper mapping of privatizable data is crucial to obtaining the benefits of privatization. This paper presents a novel framework for privatizing scalar and array variables in the context of a data-driven approach to parallelization. We show that there are numerous alternatives available for mapping privatized variables and the choice of mapping can significantly affect the performance of the program. We present an algorithm that attempts to preserve parallelism and minimize communication overheads. We also introduce the concept of partial privatization of arrays that combines data partitioning and privatization, and enables efficient handling of a class of codes with multi-dimensional data distribution that was not previously possible. Finally, we show how the ideas of privatization apply to the execution of control flow statements as well. An implementation of these ideas in the pHPF prototype compiler for High Performance Fortran on the IBM SP2 machine has shown impressive results.
{"title":"On privatization of variables for data-parallel execution","authors":"Manish Gupta","doi":"10.1109/IPPS.1997.580952","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580952","url":null,"abstract":"Privatization of data is an important technique that has been used by compilers to parallelize loops by eliminating storage-related dependences. When a compiler partitions computations based on the ownership of data, selecting a proper mapping of privatizable data is crucial to obtaining the benefits of privatization. This paper presents a novel framework for privatizing scalar and array variables in the context of a data-driven approach to parallelization. We show that there are numerous alternatives available for mapping privatized variables and the choice of mapping can significantly affect the performance of the program. We present an algorithm that attempts to preserve parallelism and minimize communication overheads. We also introduce the concept of partial privatization of arrays that combines data partitioning and privatization, and enables efficient handling of a class of codes with multi-dimensional data distribution that was not previously possible. Finally, we show how the ideas of privatization apply to the execution of control flow statements as well. An implementation of these ideas in the pHPF prototype compiler for High Performance Fortran on the IBM SP2 machine has shown impressive results.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114272923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580924
Yilong Chen, Jyh-Charn S. Liu
This paper presents a hybrid interconnection network architecture to support integrated communication services for multicomputer-based database and multimedia systems. Our study shows that existing wormhole routing networks are inefficient in transfer of long files. We demonstrate the feasibility of integrating different network techniques based on virtual channels and flexible routing mechanisms.
{"title":"A hybrid interconnection network for integrated communication services","authors":"Yilong Chen, Jyh-Charn S. Liu","doi":"10.1109/IPPS.1997.580924","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580924","url":null,"abstract":"This paper presents a hybrid interconnection network architecture to support integrated communication services for multicomputer-based database and multimedia systems. Our study shows that existing wormhole routing networks are inefficient in transfer of long files. We demonstrate the feasibility of integrating different network techniques based on virtual channels and flexible routing mechanisms.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"521 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124483496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580882
S. Rathmayer
Tools for parallel systems today range from specification over debugging to performance analysis and more. Typically, they help the programmers of parallel algorithms from the early development stages to a certain level of program optimization. However in HPC (High Performance Computing) today the end-user of massively parallel CFD (Computational Fluid Dynamics)-programs has little or no support in his work. The scientific engineer who often runs his application on a parallel computer somewhere in the WAN (Wide Area Network) and visualizes the enormous amounts of simulation data on a graphical workstation in his LAN (Local Area Network) has needs which are by far not covered by state of the art visualization systems. The tool proposed here follows a strategy which differs completely from existing, batch-oriented and strictly sequential methods of the working process in the application cycle of parallel HPC applications. It allows both on-line visualization and interactive program steering of massively parallel CFD-applications. The parameters of the mathematical model and the numerical methods build objects of a database which can be accessed by an object-oriented graphical user interface via visualization and modification operators. Experiences with this new tool concept VIPER (VIsualization of Parallel numerical simulation algorithms for Extended Research) applied on a real-world and industrial scientific application will be shown.
{"title":"A tool for on-line visualization and interactive steering of parallel HPC applications","authors":"S. Rathmayer","doi":"10.1109/IPPS.1997.580882","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580882","url":null,"abstract":"Tools for parallel systems today range from specification over debugging to performance analysis and more. Typically, they help the programmers of parallel algorithms from the early development stages to a certain level of program optimization. However in HPC (High Performance Computing) today the end-user of massively parallel CFD (Computational Fluid Dynamics)-programs has little or no support in his work. The scientific engineer who often runs his application on a parallel computer somewhere in the WAN (Wide Area Network) and visualizes the enormous amounts of simulation data on a graphical workstation in his LAN (Local Area Network) has needs which are by far not covered by state of the art visualization systems. The tool proposed here follows a strategy which differs completely from existing, batch-oriented and strictly sequential methods of the working process in the application cycle of parallel HPC applications. It allows both on-line visualization and interactive program steering of massively parallel CFD-applications. The parameters of the mathematical model and the numerical methods build objects of a database which can be accessed by an object-oriented graphical user interface via visualization and modification operators. Experiences with this new tool concept VIPER (VIsualization of Parallel numerical simulation algorithms for Extended Research) applied on a real-world and industrial scientific application will be shown.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123529551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-01DOI: 10.1109/IPPS.1997.580967
K. Gopinath, R. Seshadri
Most alias analyses produce approximate results in the presence of array slices. This may lead to inefficient code which is of concern, especially, in languages like Fortran90. The authors present an overview of a static alias analysis that gives accurate results in the presence of array slices in Fortran90.
{"title":"Alias analysis for Fortran90 array slices","authors":"K. Gopinath, R. Seshadri","doi":"10.1109/IPPS.1997.580967","DOIUrl":"https://doi.org/10.1109/IPPS.1997.580967","url":null,"abstract":"Most alias analyses produce approximate results in the presence of array slices. This may lead to inefficient code which is of concern, especially, in languages like Fortran90. The authors present an overview of a static alias analysis that gives accurate results in the presence of array slices in Fortran90.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"301 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129315292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}