Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574011
Yang Shi, Chenxi Zhang, Chunyuan Zhang
Computer numerical simulation is widely applied in engineering and social fields. It has shown great value in these fields. Small scale simulation applications can be processed on the traditional simulation computer, but with the size of problem increasing, sequential processing cannot meet the requirements. Dynamic real-time simulation and super real-time simulation require high performance simulation computers. In this paper we first analyse the structure of a classical simulation computer AD-100 which was developed by ADI Inc., then a novel structure for a simulation computer which adopts the MPP technology is proposed. At the end of this paper an experimental result is given to test the feasibility of parallel simulation processing.
{"title":"The study of parallel simulation processing based on MPP technology","authors":"Yang Shi, Chenxi Zhang, Chunyuan Zhang","doi":"10.1109/APDC.1997.574011","DOIUrl":"https://doi.org/10.1109/APDC.1997.574011","url":null,"abstract":"Computer numerical simulation is widely applied in engineering and social fields. It has shown great value in these fields. Small scale simulation applications can be processed on the traditional simulation computer, but with the size of problem increasing, sequential processing cannot meet the requirements. Dynamic real-time simulation and super real-time simulation require high performance simulation computers. In this paper we first analyse the structure of a classical simulation computer AD-100 which was developed by ADI Inc., then a novel structure for a simulation computer which adopts the MPP technology is proposed. At the end of this paper an experimental result is given to test the feasibility of parallel simulation processing.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125029391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574052
Guangzuo Cui, Mingzeng Hu, Xiaoming Li
This paper presents a new rapid thread replacement mechanism which is important in multithread technology. Analysis to the memory system indicates that the memory utilization decreases with the increase of cache hit ratio. The parallelism between thread computation and thread replacement is found by analyzing their working processes. Based on these, we advance a rapid multithread replacement mechanism which overlaps the thread replacement with thread computation. More especially, with finite hardware contexts, this mechanism can play the same role of infinite contexts by tolerating the replacement overhead. By modifying the general thread switching model, we build the thread replacement model and evaluate this mechanism in theory and experiment methods. At last, we discuss the hardware implementation and put forward the problems to be resolved in the future.
{"title":"Parallel replacement mechanism for multithread","authors":"Guangzuo Cui, Mingzeng Hu, Xiaoming Li","doi":"10.1109/APDC.1997.574052","DOIUrl":"https://doi.org/10.1109/APDC.1997.574052","url":null,"abstract":"This paper presents a new rapid thread replacement mechanism which is important in multithread technology. Analysis to the memory system indicates that the memory utilization decreases with the increase of cache hit ratio. The parallelism between thread computation and thread replacement is found by analyzing their working processes. Based on these, we advance a rapid multithread replacement mechanism which overlaps the thread replacement with thread computation. More especially, with finite hardware contexts, this mechanism can play the same role of infinite contexts by tolerating the replacement overhead. By modifying the general thread switching model, we build the thread replacement model and evaluate this mechanism in theory and experiment methods. At last, we discuss the hardware implementation and put forward the problems to be resolved in the future.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123356011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574026
Jörn Eisenbiegler, Welf Löwe, A. Wehrenpfennig
We present a strategy for optimizing parallel algorithms introducing redundant computations. In order to calculate the optimal amount of redundancy, we generalize the LogP model to capture messages of varying sizes using functions instead of constants for the machine parameters. We validate our method for a wave simulation algorithm on a Parsytec PowerXplorer with eight processors and a workstation cluster with four workstations.
{"title":"On the optimization by redundancy using an extended LogP model","authors":"Jörn Eisenbiegler, Welf Löwe, A. Wehrenpfennig","doi":"10.1109/APDC.1997.574026","DOIUrl":"https://doi.org/10.1109/APDC.1997.574026","url":null,"abstract":"We present a strategy for optimizing parallel algorithms introducing redundant computations. In order to calculate the optimal amount of redundancy, we generalize the LogP model to capture messages of varying sizes using functions instead of constants for the machine parameters. We validate our method for a wave simulation algorithm on a Parsytec PowerXplorer with eight processors and a workstation cluster with four workstations.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114809573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574063
Qiang Liu, Zhaoqing Zhang, Xiaomei Ji
Program languages with sophisticated usage of pointers as C are hard to analyze. Recent researches on pointer analysis focus on tracking the possible values of pointers, when a program point is reached, and great progress has been achieved. However, how to apply the result of pointer analysis to dataflow analysis and other program optimization/parallelization is not well studied. This paper presents an efficient interprocedural framework based on two insights in real C program and its use in deriving an context-sensitive pointer analysis algorithm and an accurate interprocedural modification side effects (MOD) computation. Based on the result of the pointer analysis, the inaccuracy induced by merging aliasing information is also studied.
{"title":"Eliminating two kinds of data flow inaccuracy in the presence of pointer aliasing","authors":"Qiang Liu, Zhaoqing Zhang, Xiaomei Ji","doi":"10.1109/APDC.1997.574063","DOIUrl":"https://doi.org/10.1109/APDC.1997.574063","url":null,"abstract":"Program languages with sophisticated usage of pointers as C are hard to analyze. Recent researches on pointer analysis focus on tracking the possible values of pointers, when a program point is reached, and great progress has been achieved. However, how to apply the result of pointer analysis to dataflow analysis and other program optimization/parallelization is not well studied. This paper presents an efficient interprocedural framework based on two insights in real C program and its use in deriving an context-sensitive pointer analysis algorithm and an accurate interprocedural modification side effects (MOD) computation. Based on the result of the pointer analysis, the inaccuracy induced by merging aliasing information is also studied.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117030113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574050
Jean-Noel Colin
In this paper, we present an original approach for the design and execution of distributed applications that require numerous tasks of variable grain. The approach is based on the concept of task cluster which is an entity that groups tasks with strong logical interaction and that guarantees efficient communications between them. We describe the implementation of the model, that mainly relies on the use of lightweight processes as support for the distributed tasks. We also illustrate the use of the proposed approach on real size applications where it has improved both the ease of design and the performance.
{"title":"An environment for the parallel execution of multigrain clustered tasks","authors":"Jean-Noel Colin","doi":"10.1109/APDC.1997.574050","DOIUrl":"https://doi.org/10.1109/APDC.1997.574050","url":null,"abstract":"In this paper, we present an original approach for the design and execution of distributed applications that require numerous tasks of variable grain. The approach is based on the concept of task cluster which is an entity that groups tasks with strong logical interaction and that guarantees efficient communications between them. We describe the implementation of the model, that mainly relies on the use of lightweight processes as support for the distributed tasks. We also illustrate the use of the proposed approach on real size applications where it has improved both the ease of design and the performance.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127555357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574027
L. Grabowsky, W. Rehm
In the field of parallel FEM methods a number of highly efficient solutions for distributed memory systems exist, but the passage to adaptive parallel FEM simulations leads, in all probability, to a more dynamic behaviour with respect to data placement and load balancing. Therefore shared-memory architecture seems to be a more appropriate solution for getting efficient implementations. This paper presents a parallelized CG-method for shared memory systems which was implemented on a 4-processor SMP system and makes explicit use of shared memory to enhance the communication between different domains. It is based on an idea for implementing parallization on distributed memory systems and represents an appropriate modification of this method. The results show that an increased synchronization expense can partially compensate the advantages of shared memory communication depending on the levels of refinement and the processor number.
{"title":"Efficiency issues of a parallel FEM implementation on shared memory computers","authors":"L. Grabowsky, W. Rehm","doi":"10.1109/APDC.1997.574027","DOIUrl":"https://doi.org/10.1109/APDC.1997.574027","url":null,"abstract":"In the field of parallel FEM methods a number of highly efficient solutions for distributed memory systems exist, but the passage to adaptive parallel FEM simulations leads, in all probability, to a more dynamic behaviour with respect to data placement and load balancing. Therefore shared-memory architecture seems to be a more appropriate solution for getting efficient implementations. This paper presents a parallelized CG-method for shared memory systems which was implemented on a 4-processor SMP system and makes explicit use of shared memory to enhance the communication between different domains. It is based on an idea for implementing parallization on distributed memory systems and represents an appropriate modification of this method. The results show that an increased synchronization expense can partially compensate the advantages of shared memory communication depending on the levels of refinement and the processor number.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124474119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574035
Liquan Xiao, Weixia Xu, Xingming Zhou
The software overhead which includes interprocess communication latency and the overhead of management processes or threads, is a crucial factor to affect the performance of massively parallel processors system. Multiple threaded architecture can effectively reduce and hide the software overhead. Many models need to be implemented inside a microprocessor. Conversely, this paper addresses a multiple threaded architecture adopted for current microprocessors and implements the architecture using hardware description language. Furthermore, the paper presents its driven execution model and evaluates the performance of the presented multithreading system using a trace driven simulator.
{"title":"A dual-processors multithreaded architecture and its driven execution model","authors":"Liquan Xiao, Weixia Xu, Xingming Zhou","doi":"10.1109/APDC.1997.574035","DOIUrl":"https://doi.org/10.1109/APDC.1997.574035","url":null,"abstract":"The software overhead which includes interprocess communication latency and the overhead of management processes or threads, is a crucial factor to affect the performance of massively parallel processors system. Multiple threaded architecture can effectively reduce and hide the software overhead. Many models need to be implemented inside a microprocessor. Conversely, this paper addresses a multiple threaded architecture adopted for current microprocessors and implements the architecture using hardware description language. Furthermore, the paper presents its driven execution model and evaluates the performance of the presented multithreading system using a trace driven simulator.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124806968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574056
J. Sogno
For a parallelizing compiler, mainly based on loop transformations, dependence information that is as complete and precise as possible is required. In this paper, we propose a generalized method for computing, in any multi-dimensional loop, information which proved to be useful in the case of irregular dependences. Firstly, we solve the basic problem of the existence of a dependence with an algorithm composed of a preprocessing phase of reduction and of an integer simplex resolution. If a solution exists, we compute by integer simplex the bounds of the distances associated with loop indices. Depending on the values of these bounds, we finally define problems consisting in evaluating the bounds of slopes of dependence vectors, which we solve by integer linear fractional programming. The amount of computation for each new problem is very low. This algorithm has been implemented as an extension of the Janus Test, which was presented in a previous work.
{"title":"Analysis of multidimensional loops with non-uniform dependences","authors":"J. Sogno","doi":"10.1109/APDC.1997.574056","DOIUrl":"https://doi.org/10.1109/APDC.1997.574056","url":null,"abstract":"For a parallelizing compiler, mainly based on loop transformations, dependence information that is as complete and precise as possible is required. In this paper, we propose a generalized method for computing, in any multi-dimensional loop, information which proved to be useful in the case of irregular dependences. Firstly, we solve the basic problem of the existence of a dependence with an algorithm composed of a preprocessing phase of reduction and of an integer simplex resolution. If a solution exists, we compute by integer simplex the bounds of the distances associated with loop indices. Depending on the values of these bounds, we finally define problems consisting in evaluating the bounds of slopes of dependence vectors, which we solve by integer linear fractional programming. The amount of computation for each new problem is very low. This algorithm has been implemented as an extension of the Janus Test, which was presented in a previous work.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125015136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574053
Tao Yu, Zhizhong Tang, Chihong Zhang, Jun Luo
ILSP (Interlaced inner and outer Loop Software Pipelining) is an efficient algorithm of optimizing operations in the nested loops. To ensure the ILSP has a good time efficiency and a good space efficiency, there must be an efficient nested control mechanism to support the algorithm. Our control mechanism is realized by hardware, it avoids adding many extra instructions and minimises the II (Initialization Interval) of each loop in the nested loop. In cooperation with the compiler, our nested loop control mechanism can efficiently support the software pipelining of the nested loop, and can ensure the ILSP has a high speedup and a low space cost.
{"title":"Control mechanism for software pipelining on nested loop","authors":"Tao Yu, Zhizhong Tang, Chihong Zhang, Jun Luo","doi":"10.1109/APDC.1997.574053","DOIUrl":"https://doi.org/10.1109/APDC.1997.574053","url":null,"abstract":"ILSP (Interlaced inner and outer Loop Software Pipelining) is an efficient algorithm of optimizing operations in the nested loops. To ensure the ILSP has a good time efficiency and a good space efficiency, there must be an efficient nested control mechanism to support the algorithm. Our control mechanism is realized by hardware, it avoids adding many extra instructions and minimises the II (Initialization Interval) of each loop in the nested loop. In cooperation with the compiler, our nested loop control mechanism can efficiently support the software pipelining of the nested loop, and can ensure the ILSP has a high speedup and a low space cost.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125047729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574030
Tao Li, Ben-Wei Rong
Cache coherence and synchronization between processors have been two critical issues in designing a shared memory multiprocessors system. From the perspective of hardware design, a directory based cache coherence protocol and lock mechanism are employed to prevent inconsistency of caches and warrant atomic memory accesses. The BY91-1 multiprocessors efficiently integrate supports for cache coherence and hardware based primitives by using a uniform directory scheme which is dubbed as Dir/sub 2/NB+L. This integration allows for low hardware overhead while maintaining both a coherent caches system and indivisible memory accesses in a scalable and cohesive fashion. This paper describes the design and rationale of this versatile directory scheme. Results on the evaluation of different directory schemes based on a preliminary simulator-CASIMU demonstrate that Dir/sub 2/NB+L scheme is cost-effective. We also report on the experience gained by implementing this directory scheme on BY91-1 multiprocessors system. We believe that this scheme is well suited for CC-NUMA architecture.
{"title":"A versatile directory scheme (Dir/sub 2/NB+L) and its implementation on BY91-1 multiprocessors system","authors":"Tao Li, Ben-Wei Rong","doi":"10.1109/APDC.1997.574030","DOIUrl":"https://doi.org/10.1109/APDC.1997.574030","url":null,"abstract":"Cache coherence and synchronization between processors have been two critical issues in designing a shared memory multiprocessors system. From the perspective of hardware design, a directory based cache coherence protocol and lock mechanism are employed to prevent inconsistency of caches and warrant atomic memory accesses. The BY91-1 multiprocessors efficiently integrate supports for cache coherence and hardware based primitives by using a uniform directory scheme which is dubbed as Dir/sub 2/NB+L. This integration allows for low hardware overhead while maintaining both a coherent caches system and indivisible memory accesses in a scalable and cohesive fashion. This paper describes the design and rationale of this versatile directory scheme. Results on the evaluation of different directory schemes based on a preliminary simulator-CASIMU demonstrate that Dir/sub 2/NB+L scheme is cost-effective. We also report on the experience gained by implementing this directory scheme on BY91-1 multiprocessors system. We believe that this scheme is well suited for CC-NUMA architecture.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"581 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122693389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}