Pub Date : 1995-04-19DOI: 10.1109/ICAPP.1995.472248
L. Barriga, R. Rönngren, R. Ayani
Parallel simulation has been an active research area for more than a decade. The parallel simulation community needs a common benchmark suite for performance evaluation of parallel simulation environments. Performance evaluation of a parallel simulation environment is harder than evaluating a parallel processing system, since the underlying system is nor only composed of architecture and operating system, but also of simulation kernel. Thus, simulation kernel designers often confront a twofold task: (i) to evaluate how efficiently their simulation kernel runs on certain architectures; and (ii) to evaluate how simulation problems scale using this kernel In this paper we advocate an incremental benchmarking methodology that focuses on the evaluation of a parallel simulation system which is based on Time Warp. We start from a reduced set of ping models that can effectively estimate the various overheads, contention and latencies of Time Warp running on a multiprocessor. The benchmark suite has been used to locate several sources of overhead in an existing Time Warp implementation. Using this benchmark suite we also compare the performance of the improved version of the Time Warp implementation with the original one.
{"title":"Benchmarking parallel simulation algorithms","authors":"L. Barriga, R. Rönngren, R. Ayani","doi":"10.1109/ICAPP.1995.472248","DOIUrl":"https://doi.org/10.1109/ICAPP.1995.472248","url":null,"abstract":"Parallel simulation has been an active research area for more than a decade. The parallel simulation community needs a common benchmark suite for performance evaluation of parallel simulation environments. Performance evaluation of a parallel simulation environment is harder than evaluating a parallel processing system, since the underlying system is nor only composed of architecture and operating system, but also of simulation kernel. Thus, simulation kernel designers often confront a twofold task: (i) to evaluate how efficiently their simulation kernel runs on certain architectures; and (ii) to evaluate how simulation problems scale using this kernel In this paper we advocate an incremental benchmarking methodology that focuses on the evaluation of a parallel simulation system which is based on Time Warp. We start from a reduced set of ping models that can effectively estimate the various overheads, contention and latencies of Time Warp running on a multiprocessor. The benchmark suite has been used to locate several sources of overhead in an existing Time Warp implementation. Using this benchmark suite we also compare the performance of the improved version of the Time Warp implementation with the original one.","PeriodicalId":448130,"journal":{"name":"Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129181572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-04-19DOI: 10.1109/ICAPP.1995.472239
E. Fromentin, N. Plouzeau, M. Raynal
Distributed programs are much more difficult to design, understand and implement than sequential or parallel ones. This is mainly due to the uncertainty created by the asynchrony inherent to distributed machines. So appropriate concepts and tools have to be devised to help the programmer of distributed applications in his task. This paper is motivated by the practical problem called distributed debugging. It presents concepts and tools that help the programmer to analyze distributed executions. Two basic problems are addressed: replay of a distributed execution (how to reproduce an equivalent execution despite of asynchrony) and the detection of a stable or unstable property of a distributed execution. Concepts and tools presented are fundamental when designing an environment for distributed program development. This paper is essentially a survey presenting a state of the art in replay mechanisms and detection of unstable properties on global states of distributed executions.<>
{"title":"An introduction to the analysis and debug of distributed computations","authors":"E. Fromentin, N. Plouzeau, M. Raynal","doi":"10.1109/ICAPP.1995.472239","DOIUrl":"https://doi.org/10.1109/ICAPP.1995.472239","url":null,"abstract":"Distributed programs are much more difficult to design, understand and implement than sequential or parallel ones. This is mainly due to the uncertainty created by the asynchrony inherent to distributed machines. So appropriate concepts and tools have to be devised to help the programmer of distributed applications in his task. This paper is motivated by the practical problem called distributed debugging. It presents concepts and tools that help the programmer to analyze distributed executions. Two basic problems are addressed: replay of a distributed execution (how to reproduce an equivalent execution despite of asynchrony) and the detection of a stable or unstable property of a distributed execution. Concepts and tools presented are fundamental when designing an environment for distributed program development. This paper is essentially a survey presenting a state of the art in replay mechanisms and detection of unstable properties on global states of distributed executions.<<ETX>>","PeriodicalId":448130,"journal":{"name":"Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124526956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-04-19DOI: 10.1109/ICAPP.1995.472235
F. Schon
The shared memory paradigm offers a well known programming model for parallel systems. But it lacks from its bad performance in conventional implementations if it is used in large grain or page based systems. The main problems are (1) the transparent view on the system level, (2) the false sharing caused by locating several consistency units into the same transportation unit, and that (3) high level software implementations are not integrated within the system architecture. The first point is addressed by annotating programming objects and deriving a specific configuration of system functionalities. The second point is solved by GAME, the General and Autonomous Merging Environment which allows a multiple reader, multiple writer approach. The third point is directed by three implementation models of GAME. A hardware based implementation and even a software based implementation are able to hide the costs of the local activities to perform GAME by the network latency.<>
{"title":"Integrating memory consistency models and communication systems","authors":"F. Schon","doi":"10.1109/ICAPP.1995.472235","DOIUrl":"https://doi.org/10.1109/ICAPP.1995.472235","url":null,"abstract":"The shared memory paradigm offers a well known programming model for parallel systems. But it lacks from its bad performance in conventional implementations if it is used in large grain or page based systems. The main problems are (1) the transparent view on the system level, (2) the false sharing caused by locating several consistency units into the same transportation unit, and that (3) high level software implementations are not integrated within the system architecture. The first point is addressed by annotating programming objects and deriving a specific configuration of system functionalities. The second point is solved by GAME, the General and Autonomous Merging Environment which allows a multiple reader, multiple writer approach. The third point is directed by three implementation models of GAME. A hardware based implementation and even a software based implementation are able to hide the costs of the local activities to perform GAME by the network latency.<<ETX>>","PeriodicalId":448130,"journal":{"name":"Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116225618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-04-19DOI: 10.1109/ICAPP.1995.472164
A. Al-Khalili
A systematic method of mapping algorithms from single assignment algorithms into systolic arrays is presented. The method is based on a space-time mapping technique of the index sets. We present a method of generation and selection of a valid transform dependency matrix that will yield an optimal or near optimal systolic array once it is mapped. The proposed method increases the visibility of the architecture in terms of processor delay and communication between processors at the algorithmic level, so that the designer is able to select a desired array at early stages of the design. An example of the proposed method is given.<>
{"title":"Synthesis of systolic arrays from single assignment algorithm","authors":"A. Al-Khalili","doi":"10.1109/ICAPP.1995.472164","DOIUrl":"https://doi.org/10.1109/ICAPP.1995.472164","url":null,"abstract":"A systematic method of mapping algorithms from single assignment algorithms into systolic arrays is presented. The method is based on a space-time mapping technique of the index sets. We present a method of generation and selection of a valid transform dependency matrix that will yield an optimal or near optimal systolic array once it is mapped. The proposed method increases the visibility of the architecture in terms of processor delay and communication between processors at the algorithmic level, so that the designer is able to select a desired array at early stages of the design. An example of the proposed method is given.<<ETX>>","PeriodicalId":448130,"journal":{"name":"Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125834884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-04-19DOI: 10.1109/ICAPP.1995.472237
Xiaofang Zhou, M. Orlowska
A large number of parallel join algorithms has been proposed to maintain load-balancing in the presence of data skew. However, one important type of data skew-join product skew (JPS)-has been little studied. In this paper, a dynamic parallel join algorithm, which employs a two-phase scheduling procedure, is designed to handle the JPS problem. Two sets of scheduling heuristics are studied against various parameters. It is shown that many of the existing algorithms can be regarded as a special case of our algorithm, whose cost is based on the nature of data skew. While it can cope with JPS which other algorithms cannot approach, it can be as efficient as most existing algorithms when JPS does not exist.<>
{"title":"Handling data skew in parallel hash join computation using two-phase scheduling","authors":"Xiaofang Zhou, M. Orlowska","doi":"10.1109/ICAPP.1995.472237","DOIUrl":"https://doi.org/10.1109/ICAPP.1995.472237","url":null,"abstract":"A large number of parallel join algorithms has been proposed to maintain load-balancing in the presence of data skew. However, one important type of data skew-join product skew (JPS)-has been little studied. In this paper, a dynamic parallel join algorithm, which employs a two-phase scheduling procedure, is designed to handle the JPS problem. Two sets of scheduling heuristics are studied against various parameters. It is shown that many of the existing algorithms can be regarded as a special case of our algorithm, whose cost is based on the nature of data skew. While it can cope with JPS which other algorithms cannot approach, it can be as efficient as most existing algorithms when JPS does not exist.<<ETX>>","PeriodicalId":448130,"journal":{"name":"Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127987246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-04-19DOI: 10.1109/ICAPP.1995.472192
D. Harrar
We propose some non-standard, yet straightforward, and highly efficacious alternative modes of data assignment which induce a significant reduction in communication volume and hence in execution time for stencil operations, i.e. local iterative updates, implemented within a data-parallel programming environment. Performance results obtained in the solution of two three-dimensional elliptic partial differential equations (PDEs) using iterative methods entailing such updates indicate that substantial performance increases can be realized using these alternative data assignment schemes.<>
{"title":"On the acceleration of stencil operations in the data-parallel solution of PDEs","authors":"D. Harrar","doi":"10.1109/ICAPP.1995.472192","DOIUrl":"https://doi.org/10.1109/ICAPP.1995.472192","url":null,"abstract":"We propose some non-standard, yet straightforward, and highly efficacious alternative modes of data assignment which induce a significant reduction in communication volume and hence in execution time for stencil operations, i.e. local iterative updates, implemented within a data-parallel programming environment. Performance results obtained in the solution of two three-dimensional elliptic partial differential equations (PDEs) using iterative methods entailing such updates indicate that substantial performance increases can be realized using these alternative data assignment schemes.<<ETX>>","PeriodicalId":448130,"journal":{"name":"Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing","volume":"518 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133825483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-04-19DOI: 10.1109/ICAPP.1995.472179
S. Dutta, M. Franklin
Changes in control flow, caused primarily by conditional branches, are a prime impediment to the performance of wide-issue superscalar processors. This paper investigates a block-level prediction scheme to mitigate the effects of control flow changes caused by conditional branches. Instead of predicting the outcome of each conditional branch individually, this scheme predicts the target of a sequential block of instructions, thereby allowing the superscalar processor to go past multiple branches per cycle. This approach is evaluated using the MIPS architecture, for 8-way and 12-way superscalar processors, and an improvement in effective fetch size of approximately 15% and 25%, respectively, over identical processors that use branch prediction is observed. No appreciable difference in the prediction accuracy was observed, although block-level prediction predicted one out of four outcomes.<>
{"title":"Block-level prediction for wide-issue superscalar processors","authors":"S. Dutta, M. Franklin","doi":"10.1109/ICAPP.1995.472179","DOIUrl":"https://doi.org/10.1109/ICAPP.1995.472179","url":null,"abstract":"Changes in control flow, caused primarily by conditional branches, are a prime impediment to the performance of wide-issue superscalar processors. This paper investigates a block-level prediction scheme to mitigate the effects of control flow changes caused by conditional branches. Instead of predicting the outcome of each conditional branch individually, this scheme predicts the target of a sequential block of instructions, thereby allowing the superscalar processor to go past multiple branches per cycle. This approach is evaluated using the MIPS architecture, for 8-way and 12-way superscalar processors, and an improvement in effective fetch size of approximately 15% and 25%, respectively, over identical processors that use branch prediction is observed. No appreciable difference in the prediction accuracy was observed, although block-level prediction predicted one out of four outcomes.<<ETX>>","PeriodicalId":448130,"journal":{"name":"Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132844176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-04-19DOI: 10.1109/ICAPP.1995.472276
K. C. Posch, R. Posch
This paper describes the design process from algorithm design to the chip level for a parallel implementation of a modified version of the RSA encryption method. The final system consists of several dozens of custom chips computing module exponentiation based on residue number system coding. Emphasis is put on the hierarchical design view, its benefits and ifs shortcomings.<>
{"title":"Designing a new encryption method for optimum parallel performance","authors":"K. C. Posch, R. Posch","doi":"10.1109/ICAPP.1995.472276","DOIUrl":"https://doi.org/10.1109/ICAPP.1995.472276","url":null,"abstract":"This paper describes the design process from algorithm design to the chip level for a parallel implementation of a modified version of the RSA encryption method. The final system consists of several dozens of custom chips computing module exponentiation based on residue number system coding. Emphasis is put on the hierarchical design view, its benefits and ifs shortcomings.<<ETX>>","PeriodicalId":448130,"journal":{"name":"Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133864644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-04-19DOI: 10.1109/ICAPP.1995.472279
F. Wang, Young-il Choo
We develop an optimized program for the N-body problem on the CM-5 with vector units. The work is intended to make full use of the power of the vector pipelines provided by the CM-5 equipped with vector units to improve the computation performance. Some development issues using the vector units are discussed. The code is written in CDPEAC, an assembly-like language which can be called from C. Performance data and some analysis results are given.<>
{"title":"Vectoring the N-body problem on the CM-5","authors":"F. Wang, Young-il Choo","doi":"10.1109/ICAPP.1995.472279","DOIUrl":"https://doi.org/10.1109/ICAPP.1995.472279","url":null,"abstract":"We develop an optimized program for the N-body problem on the CM-5 with vector units. The work is intended to make full use of the power of the vector pipelines provided by the CM-5 equipped with vector units to improve the computation performance. Some development issues using the vector units are discussed. The code is written in CDPEAC, an assembly-like language which can be called from C. Performance data and some analysis results are given.<<ETX>>","PeriodicalId":448130,"journal":{"name":"Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing","volume":"127 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113996850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-04-19DOI: 10.1109/ICAPP.1995.472175
Young-Chon Kim, Pal-Jin Lee, D. Choi, Byung-Ok Kim, Sungwan Park, Young-sun Kim
With variable bit rate (VBR) video sources, adjacent slices in a frame are strongly correlated with each other. This is also the case for the frame represented by frame correlation. VBR video sources can be statistically characterized by peak rate, average rate, and standard deviation of the rate of generated cells. Taking account of each correlative and statistical properties, VBR video sources can be more efficiently transmitted by estimating the required bandwidth. In this paper, we propose a scheme that predicts and allocates dynamically transmission bandwidth for VBR video sources in ATM based BISDN. The performance of the proposed scheme is evaluated through simulations. Simulation results show that the proposed scheme is superior to the conventional ones in terms of bandwidth utilization and cell loss rate.<>
{"title":"Dynamic bandwidth allocation for VBR video sources in ATM based BISDN","authors":"Young-Chon Kim, Pal-Jin Lee, D. Choi, Byung-Ok Kim, Sungwan Park, Young-sun Kim","doi":"10.1109/ICAPP.1995.472175","DOIUrl":"https://doi.org/10.1109/ICAPP.1995.472175","url":null,"abstract":"With variable bit rate (VBR) video sources, adjacent slices in a frame are strongly correlated with each other. This is also the case for the frame represented by frame correlation. VBR video sources can be statistically characterized by peak rate, average rate, and standard deviation of the rate of generated cells. Taking account of each correlative and statistical properties, VBR video sources can be more efficiently transmitted by estimating the required bandwidth. In this paper, we propose a scheme that predicts and allocates dynamically transmission bandwidth for VBR video sources in ATM based BISDN. The performance of the proposed scheme is evaluated through simulations. Simulation results show that the proposed scheme is superior to the conventional ones in terms of bandwidth utilization and cell loss rate.<<ETX>>","PeriodicalId":448130,"journal":{"name":"Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124094099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}