Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574063
Qiang Liu, Zhaoqing Zhang, Xiaomei Ji
Program languages with sophisticated usage of pointers as C are hard to analyze. Recent researches on pointer analysis focus on tracking the possible values of pointers, when a program point is reached, and great progress has been achieved. However, how to apply the result of pointer analysis to dataflow analysis and other program optimization/parallelization is not well studied. This paper presents an efficient interprocedural framework based on two insights in real C program and its use in deriving an context-sensitive pointer analysis algorithm and an accurate interprocedural modification side effects (MOD) computation. Based on the result of the pointer analysis, the inaccuracy induced by merging aliasing information is also studied.
{"title":"Eliminating two kinds of data flow inaccuracy in the presence of pointer aliasing","authors":"Qiang Liu, Zhaoqing Zhang, Xiaomei Ji","doi":"10.1109/APDC.1997.574063","DOIUrl":"https://doi.org/10.1109/APDC.1997.574063","url":null,"abstract":"Program languages with sophisticated usage of pointers as C are hard to analyze. Recent researches on pointer analysis focus on tracking the possible values of pointers, when a program point is reached, and great progress has been achieved. However, how to apply the result of pointer analysis to dataflow analysis and other program optimization/parallelization is not well studied. This paper presents an efficient interprocedural framework based on two insights in real C program and its use in deriving an context-sensitive pointer analysis algorithm and an accurate interprocedural modification side effects (MOD) computation. Based on the result of the pointer analysis, the inaccuracy induced by merging aliasing information is also studied.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117030113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574050
Jean-Noel Colin
In this paper, we present an original approach for the design and execution of distributed applications that require numerous tasks of variable grain. The approach is based on the concept of task cluster which is an entity that groups tasks with strong logical interaction and that guarantees efficient communications between them. We describe the implementation of the model, that mainly relies on the use of lightweight processes as support for the distributed tasks. We also illustrate the use of the proposed approach on real size applications where it has improved both the ease of design and the performance.
{"title":"An environment for the parallel execution of multigrain clustered tasks","authors":"Jean-Noel Colin","doi":"10.1109/APDC.1997.574050","DOIUrl":"https://doi.org/10.1109/APDC.1997.574050","url":null,"abstract":"In this paper, we present an original approach for the design and execution of distributed applications that require numerous tasks of variable grain. The approach is based on the concept of task cluster which is an entity that groups tasks with strong logical interaction and that guarantees efficient communications between them. We describe the implementation of the model, that mainly relies on the use of lightweight processes as support for the distributed tasks. We also illustrate the use of the proposed approach on real size applications where it has improved both the ease of design and the performance.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127555357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574015
K. T. Au, M. Chakravarty, J. Darlington, Yike Guo, Stefan Jähnichen, Martin Köhler, G. Keller, W. Pfannenstiel, M. Simons
This paper describes the integration of nested data parallelism into Fortran 90. Unlike flat data parallelism, nested data parallelism directly provides means for handling irregular data structures and certain forms of control parallelism, such as divide-and-conquer algorithms thus enabling the programmer to express such algorithms far more naturally. Existing work deals with nested data parallelism in a functional environment, which does help avoid a set of problems, but makes efficient implementations more complicated. Moreover functional languages are not readily accepted by programmers used to languages such as Fortran and C, which are currently predominant in programming parallel machines. In this paper, we introduce the imperative data-parallel language Fortran 90V and give an overview of its implementation.
{"title":"Enlarging the scope of vector-based computations: extending Fortran 90 by nested data parallelism","authors":"K. T. Au, M. Chakravarty, J. Darlington, Yike Guo, Stefan Jähnichen, Martin Köhler, G. Keller, W. Pfannenstiel, M. Simons","doi":"10.1109/APDC.1997.574015","DOIUrl":"https://doi.org/10.1109/APDC.1997.574015","url":null,"abstract":"This paper describes the integration of nested data parallelism into Fortran 90. Unlike flat data parallelism, nested data parallelism directly provides means for handling irregular data structures and certain forms of control parallelism, such as divide-and-conquer algorithms thus enabling the programmer to express such algorithms far more naturally. Existing work deals with nested data parallelism in a functional environment, which does help avoid a set of problems, but makes efficient implementations more complicated. Moreover functional languages are not readily accepted by programmers used to languages such as Fortran and C, which are currently predominant in programming parallel machines. In this paper, we introduce the imperative data-parallel language Fortran 90V and give an overview of its implementation.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125875515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574064
Chuanqi Zhu, B. Zang, Tong Chen
An effective automatic parallelizer is critical for users to exploit the resources of parallel computers. Research has gained much progress in recent years. This paper introduces AFT, a new generation of parallelizing compiler that we have developed. It integrates many advanced techniques in an effective and efficient system. The experimental results show that AFT is able to achieve notable parallelization on many programs.
{"title":"The design considerations and test results of AFT-a new generation parallelizing compiler","authors":"Chuanqi Zhu, B. Zang, Tong Chen","doi":"10.1109/APDC.1997.574064","DOIUrl":"https://doi.org/10.1109/APDC.1997.574064","url":null,"abstract":"An effective automatic parallelizer is critical for users to exploit the resources of parallel computers. Research has gained much progress in recent years. This paper introduces AFT, a new generation of parallelizing compiler that we have developed. It integrates many advanced techniques in an effective and efficient system. The experimental results show that AFT is able to achieve notable parallelization on many programs.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121808867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574012
H. Mierendorff, Helmut Schwamborn
Automatic model generation is studied as part of a hybrid modeling strategy using simulation for performance analysis. Two major steps have to be carried out in this context. The program which is being investigated has to be translated into a model. During the translation, runtime has to be estimated for numerous computational blocks of statements which are replaced by simple delays. For performance estimation, the model has finally to be analyzed by an evaluation tool. Model evaluation as well as runtime estimation of computational blocks requires values of some variables, the control variables. We discuss the problem of automatic definition of control variables in general and consider some important cases. For the implementation of a model generating tool, we concentrate on parallel Fortran programs using message passing primitives for process communication.
{"title":"Definition of control variables for automatic performance modeling","authors":"H. Mierendorff, Helmut Schwamborn","doi":"10.1109/APDC.1997.574012","DOIUrl":"https://doi.org/10.1109/APDC.1997.574012","url":null,"abstract":"Automatic model generation is studied as part of a hybrid modeling strategy using simulation for performance analysis. Two major steps have to be carried out in this context. The program which is being investigated has to be translated into a model. During the translation, runtime has to be estimated for numerous computational blocks of statements which are replaced by simple delays. For performance estimation, the model has finally to be analyzed by an evaluation tool. Model evaluation as well as runtime estimation of computational blocks requires values of some variables, the control variables. We discuss the problem of automatic definition of control variables in general and consider some important cases. For the implementation of a model generating tool, we concentrate on parallel Fortran programs using message passing primitives for process communication.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130981737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574035
Liquan Xiao, Weixia Xu, Xingming Zhou
The software overhead which includes interprocess communication latency and the overhead of management processes or threads, is a crucial factor to affect the performance of massively parallel processors system. Multiple threaded architecture can effectively reduce and hide the software overhead. Many models need to be implemented inside a microprocessor. Conversely, this paper addresses a multiple threaded architecture adopted for current microprocessors and implements the architecture using hardware description language. Furthermore, the paper presents its driven execution model and evaluates the performance of the presented multithreading system using a trace driven simulator.
{"title":"A dual-processors multithreaded architecture and its driven execution model","authors":"Liquan Xiao, Weixia Xu, Xingming Zhou","doi":"10.1109/APDC.1997.574035","DOIUrl":"https://doi.org/10.1109/APDC.1997.574035","url":null,"abstract":"The software overhead which includes interprocess communication latency and the overhead of management processes or threads, is a crucial factor to affect the performance of massively parallel processors system. Multiple threaded architecture can effectively reduce and hide the software overhead. Many models need to be implemented inside a microprocessor. Conversely, this paper addresses a multiple threaded architecture adopted for current microprocessors and implements the architecture using hardware description language. Furthermore, the paper presents its driven execution model and evaluates the performance of the presented multithreading system using a trace driven simulator.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124806968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574027
L. Grabowsky, W. Rehm
In the field of parallel FEM methods a number of highly efficient solutions for distributed memory systems exist, but the passage to adaptive parallel FEM simulations leads, in all probability, to a more dynamic behaviour with respect to data placement and load balancing. Therefore shared-memory architecture seems to be a more appropriate solution for getting efficient implementations. This paper presents a parallelized CG-method for shared memory systems which was implemented on a 4-processor SMP system and makes explicit use of shared memory to enhance the communication between different domains. It is based on an idea for implementing parallization on distributed memory systems and represents an appropriate modification of this method. The results show that an increased synchronization expense can partially compensate the advantages of shared memory communication depending on the levels of refinement and the processor number.
{"title":"Efficiency issues of a parallel FEM implementation on shared memory computers","authors":"L. Grabowsky, W. Rehm","doi":"10.1109/APDC.1997.574027","DOIUrl":"https://doi.org/10.1109/APDC.1997.574027","url":null,"abstract":"In the field of parallel FEM methods a number of highly efficient solutions for distributed memory systems exist, but the passage to adaptive parallel FEM simulations leads, in all probability, to a more dynamic behaviour with respect to data placement and load balancing. Therefore shared-memory architecture seems to be a more appropriate solution for getting efficient implementations. This paper presents a parallelized CG-method for shared memory systems which was implemented on a 4-processor SMP system and makes explicit use of shared memory to enhance the communication between different domains. It is based on an idea for implementing parallization on distributed memory systems and represents an appropriate modification of this method. The results show that an increased synchronization expense can partially compensate the advantages of shared memory communication depending on the levels of refinement and the processor number.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124474119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574030
Tao Li, Ben-Wei Rong
Cache coherence and synchronization between processors have been two critical issues in designing a shared memory multiprocessors system. From the perspective of hardware design, a directory based cache coherence protocol and lock mechanism are employed to prevent inconsistency of caches and warrant atomic memory accesses. The BY91-1 multiprocessors efficiently integrate supports for cache coherence and hardware based primitives by using a uniform directory scheme which is dubbed as Dir/sub 2/NB+L. This integration allows for low hardware overhead while maintaining both a coherent caches system and indivisible memory accesses in a scalable and cohesive fashion. This paper describes the design and rationale of this versatile directory scheme. Results on the evaluation of different directory schemes based on a preliminary simulator-CASIMU demonstrate that Dir/sub 2/NB+L scheme is cost-effective. We also report on the experience gained by implementing this directory scheme on BY91-1 multiprocessors system. We believe that this scheme is well suited for CC-NUMA architecture.
{"title":"A versatile directory scheme (Dir/sub 2/NB+L) and its implementation on BY91-1 multiprocessors system","authors":"Tao Li, Ben-Wei Rong","doi":"10.1109/APDC.1997.574030","DOIUrl":"https://doi.org/10.1109/APDC.1997.574030","url":null,"abstract":"Cache coherence and synchronization between processors have been two critical issues in designing a shared memory multiprocessors system. From the perspective of hardware design, a directory based cache coherence protocol and lock mechanism are employed to prevent inconsistency of caches and warrant atomic memory accesses. The BY91-1 multiprocessors efficiently integrate supports for cache coherence and hardware based primitives by using a uniform directory scheme which is dubbed as Dir/sub 2/NB+L. This integration allows for low hardware overhead while maintaining both a coherent caches system and indivisible memory accesses in a scalable and cohesive fashion. This paper describes the design and rationale of this versatile directory scheme. Results on the evaluation of different directory schemes based on a preliminary simulator-CASIMU demonstrate that Dir/sub 2/NB+L scheme is cost-effective. We also report on the experience gained by implementing this directory scheme on BY91-1 multiprocessors system. We believe that this scheme is well suited for CC-NUMA architecture.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"581 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122693389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574056
J. Sogno
For a parallelizing compiler, mainly based on loop transformations, dependence information that is as complete and precise as possible is required. In this paper, we propose a generalized method for computing, in any multi-dimensional loop, information which proved to be useful in the case of irregular dependences. Firstly, we solve the basic problem of the existence of a dependence with an algorithm composed of a preprocessing phase of reduction and of an integer simplex resolution. If a solution exists, we compute by integer simplex the bounds of the distances associated with loop indices. Depending on the values of these bounds, we finally define problems consisting in evaluating the bounds of slopes of dependence vectors, which we solve by integer linear fractional programming. The amount of computation for each new problem is very low. This algorithm has been implemented as an extension of the Janus Test, which was presented in a previous work.
{"title":"Analysis of multidimensional loops with non-uniform dependences","authors":"J. Sogno","doi":"10.1109/APDC.1997.574056","DOIUrl":"https://doi.org/10.1109/APDC.1997.574056","url":null,"abstract":"For a parallelizing compiler, mainly based on loop transformations, dependence information that is as complete and precise as possible is required. In this paper, we propose a generalized method for computing, in any multi-dimensional loop, information which proved to be useful in the case of irregular dependences. Firstly, we solve the basic problem of the existence of a dependence with an algorithm composed of a preprocessing phase of reduction and of an integer simplex resolution. If a solution exists, we compute by integer simplex the bounds of the distances associated with loop indices. Depending on the values of these bounds, we finally define problems consisting in evaluating the bounds of slopes of dependence vectors, which we solve by integer linear fractional programming. The amount of computation for each new problem is very low. This algorithm has been implemented as an extension of the Janus Test, which was presented in a previous work.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125015136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574053
Tao Yu, Zhizhong Tang, Chihong Zhang, Jun Luo
ILSP (Interlaced inner and outer Loop Software Pipelining) is an efficient algorithm of optimizing operations in the nested loops. To ensure the ILSP has a good time efficiency and a good space efficiency, there must be an efficient nested control mechanism to support the algorithm. Our control mechanism is realized by hardware, it avoids adding many extra instructions and minimises the II (Initialization Interval) of each loop in the nested loop. In cooperation with the compiler, our nested loop control mechanism can efficiently support the software pipelining of the nested loop, and can ensure the ILSP has a high speedup and a low space cost.
{"title":"Control mechanism for software pipelining on nested loop","authors":"Tao Yu, Zhizhong Tang, Chihong Zhang, Jun Luo","doi":"10.1109/APDC.1997.574053","DOIUrl":"https://doi.org/10.1109/APDC.1997.574053","url":null,"abstract":"ILSP (Interlaced inner and outer Loop Software Pipelining) is an efficient algorithm of optimizing operations in the nested loops. To ensure the ILSP has a good time efficiency and a good space efficiency, there must be an efficient nested control mechanism to support the algorithm. Our control mechanism is realized by hardware, it avoids adding many extra instructions and minimises the II (Initialization Interval) of each loop in the nested loop. In cooperation with the compiler, our nested loop control mechanism can efficiently support the software pipelining of the nested loop, and can ensure the ILSP has a high speedup and a low space cost.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125047729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}