This paper presents a solution to analyze the performance of grid scheduling algorithms for tasks with dependencies. Finding the optimal procedures for DAG scheduling in Grid systems is important due to the latest computing necessities: large scale distributed computing and complex applications for different research areas. We propose a solution to evaluate DAG scheduling algorithms using simulation, an approach suitable to evaluate different scheduling algorithms using various task dependencies and considering a wide range of Grid system architectures. Our proposed solution is based on MONARC, a generic simulation framework designed for modeling large scale distributed systems. We present our research results in extending the simulation platform to accommodate various DAG scheduling procedures and, as a case study, we present a critical analysis of four well known DAG scheduling strategies: CCF (Cluster ready Children First), ETF (Earliest Time First), HLFET (Highest Level First with Estimated Times) and Hybrid Remapper. The obtained results show that the proposed solution is a very good instrument for evaluating performance in case of a wide range of DAG scheduling algorithms.
{"title":"Performance Analysis of Grid DAG Scheduling Algorithms using MONARC Simulation Tool","authors":"Florin Pop, C. Dobre, V. Cristea","doi":"10.1109/ISPDC.2008.15","DOIUrl":"https://doi.org/10.1109/ISPDC.2008.15","url":null,"abstract":"This paper presents a solution to analyze the performance of grid scheduling algorithms for tasks with dependencies. Finding the optimal procedures for DAG scheduling in Grid systems is important due to the latest computing necessities: large scale distributed computing and complex applications for different research areas. We propose a solution to evaluate DAG scheduling algorithms using simulation, an approach suitable to evaluate different scheduling algorithms using various task dependencies and considering a wide range of Grid system architectures. Our proposed solution is based on MONARC, a generic simulation framework designed for modeling large scale distributed systems. We present our research results in extending the simulation platform to accommodate various DAG scheduling procedures and, as a case study, we present a critical analysis of four well known DAG scheduling strategies: CCF (Cluster ready Children First), ETF (Earliest Time First), HLFET (Highest Level First with Estimated Times) and Hybrid Remapper. The obtained results show that the proposed solution is a very good instrument for evaluating performance in case of a wide range of DAG scheduling algorithms.","PeriodicalId":125975,"journal":{"name":"2008 International Symposium on Parallel and Distributed Computing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130238315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carlos Castañeda Marroquín, C. Navarrete, A. Ortega, M. Alfonseca, E. Anguiano
In the last years, the computers have increased their capacity of calculus and networks - for the interconnection of these machines - have been improved until obtaining the actual high rates of data transferring. The programs that now a days try to take advantage of these new technologies,cannot be written using the traditional techniques of programming,since most of the algorithms were designed for being executed in only one processor, in a non concurrent form, instead of being executed concurrently in a set of processors,working and communicating through a network.This work aims to present the ongoing development of a new method to simulate the Ferromagnetic Potts model, taking into account these new technologies.
{"title":"Parallel Metropolis-Montecarlo Simulation for Potts Model using an Adaptable Network Topology based on Dynamic Graph Partitioning","authors":"Carlos Castañeda Marroquín, C. Navarrete, A. Ortega, M. Alfonseca, E. Anguiano","doi":"10.1109/ISPDC.2008.51","DOIUrl":"https://doi.org/10.1109/ISPDC.2008.51","url":null,"abstract":"In the last years, the computers have increased their capacity of calculus and networks - for the interconnection of these machines - have been improved until obtaining the actual high rates of data transferring. The programs that now a days try to take advantage of these new technologies,cannot be written using the traditional techniques of programming,since most of the algorithms were designed for being executed in only one processor, in a non concurrent form, instead of being executed concurrently in a set of processors,working and communicating through a network.This work aims to present the ongoing development of a new method to simulate the Ferromagnetic Potts model, taking into account these new technologies.","PeriodicalId":125975,"journal":{"name":"2008 International Symposium on Parallel and Distributed Computing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130933118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Computers perform different applications in different ways. To characterize an application performance into a machine, the usual method is a throughout execution of it. This work is a step into a synthetic probe able to characterize a master-worker application's performance in a fraction of the time required to run it entirely. This is specially important for CPU-intensive scientific applications, who runs for very long, as it makes sense that it runs as efficiently (and fast) as possible. To know how, and for how long a master-worker application is going to run can guide the decision to use this machine or not. Our software probe takes into account only the performance-relevant parts of the application, discovering a program's relevant phases. Running solely these significant phases is a powerful way to quickly characterize the application's performance on a machine. It can help to select the best computing nodes in a grid or in a multi-cluster to run this application, and even quickly predict the total execution time for this application/data set in the machine analyzed. We also present ongoing work on a fully synthetic probe generated from programs' phases.
{"title":"Software probes: towards a quick method for machine characterization and application performance prediction","authors":"A. Strube, Dolores Rexachs, E. Luque","doi":"10.1109/ISPDC.2008.40","DOIUrl":"https://doi.org/10.1109/ISPDC.2008.40","url":null,"abstract":"Computers perform different applications in different ways. To characterize an application performance into a machine, the usual method is a throughout execution of it. This work is a step into a synthetic probe able to characterize a master-worker application's performance in a fraction of the time required to run it entirely. This is specially important for CPU-intensive scientific applications, who runs for very long, as it makes sense that it runs as efficiently (and fast) as possible. To know how, and for how long a master-worker application is going to run can guide the decision to use this machine or not. Our software probe takes into account only the performance-relevant parts of the application, discovering a program's relevant phases. Running solely these significant phases is a powerful way to quickly characterize the application's performance on a machine. It can help to select the best computing nodes in a grid or in a multi-cluster to run this application, and even quickly predict the total execution time for this application/data set in the machine analyzed. We also present ongoing work on a fully synthetic probe generated from programs' phases.","PeriodicalId":125975,"journal":{"name":"2008 International Symposium on Parallel and Distributed Computing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128929439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work we present the runtime architecture of the OMPi OpenMP compiler. OMPi is a source-to-source C translator featuring a portable, modular and extensible runtime system. It allows for OpenMP threads to map to different execution entities which range from kernel/user-level threads to processes, providing transparent support of OpenMP applications on both SMP machines and clusters of SMPs. When operating within an SMP machine, arbitrary threading libraries can be employed; currently a multitude of such libraries is available, including one which is based on portable user-level threading, for high-performance nested parallelism support. When operating on a cluster, processes are used as the execution entities and different software DSM cores can be utilized under a unified interface; the runtime system uses a hybrid approach whereby its internal bookkeeping is done through explicit message passing, while user-program shared variables are handled by the DSM core.
{"title":"A Runtime System Architecture for Ubiquitous Support of OpenMP","authors":"G. C. Philos, V. Dimakopoulos, P. Hadjidoukas","doi":"10.1109/ISPDC.2008.49","DOIUrl":"https://doi.org/10.1109/ISPDC.2008.49","url":null,"abstract":"In this work we present the runtime architecture of the OMPi OpenMP compiler. OMPi is a source-to-source C translator featuring a portable, modular and extensible runtime system. It allows for OpenMP threads to map to different execution entities which range from kernel/user-level threads to processes, providing transparent support of OpenMP applications on both SMP machines and clusters of SMPs. When operating within an SMP machine, arbitrary threading libraries can be employed; currently a multitude of such libraries is available, including one which is based on portable user-level threading, for high-performance nested parallelism support. When operating on a cluster, processes are used as the execution entities and different software DSM cores can be utilized under a unified interface; the runtime system uses a hybrid approach whereby its internal bookkeeping is done through explicit message passing, while user-program shared variables are handled by the DSM core.","PeriodicalId":125975,"journal":{"name":"2008 International Symposium on Parallel and Distributed Computing","volume":"153 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114238169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Gamatie, É. Rutten, Huafeng Yu, Pierre Boulet, J. Dekeyser
This paper presents an approach for the modeling and formal validation of high-performance systems. The approach relies on the repetitive model of computation used to express the parallelism of such systems within the Gaspard framework, which is dedicated to the codesign of high-performance system-on-chip. The system descriptions obtained with this model are then projected on the synchronous model of computation. The result of this projection consists of an equational model that allows one to formally analyze clock synchronizability issues so as to guarantee the reliable deployment of systems on platforms.
{"title":"Modeling and Formal Validation of High-Performance Embedded Systems","authors":"A. Gamatie, É. Rutten, Huafeng Yu, Pierre Boulet, J. Dekeyser","doi":"10.1109/ISPDC.2008.28","DOIUrl":"https://doi.org/10.1109/ISPDC.2008.28","url":null,"abstract":"This paper presents an approach for the modeling and formal validation of high-performance systems. The approach relies on the repetitive model of computation used to express the parallelism of such systems within the Gaspard framework, which is dedicated to the codesign of high-performance system-on-chip. The system descriptions obtained with this model are then projected on the synchronous model of computation. The result of this projection consists of an equational model that allows one to formally analyze clock synchronizability issues so as to guarantee the reliable deployment of systems on platforms.","PeriodicalId":125975,"journal":{"name":"2008 International Symposium on Parallel and Distributed Computing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121850699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a package, called Heterogeneous PBLAS (HeteroPBLAS), which is built on top of PBLAS and provides optimized parallel basic linear algebra subprograms for heterogeneous computational clusters. We present the user interface and the software hierarchy of the first research implementation of HeteroPBLAS. This is the first step towards the development of a parallel linear algebra package for heterogeneous computational clusters. We demonstrate the efficiency of the HeteroPBLAS programs on a homogeneous computing cluster and a heterogeneous computing cluster.
{"title":"Heterogeneous PBLAS: Optimization of PBLAS for Heterogeneous Computational Clusters","authors":"Ravi Reddy, Alexey L. Lastovetsky, P. Alonso","doi":"10.1109/ISPDC.2008.9","DOIUrl":"https://doi.org/10.1109/ISPDC.2008.9","url":null,"abstract":"This paper presents a package, called Heterogeneous PBLAS (HeteroPBLAS), which is built on top of PBLAS and provides optimized parallel basic linear algebra subprograms for heterogeneous computational clusters. We present the user interface and the software hierarchy of the first research implementation of HeteroPBLAS. This is the first step towards the development of a parallel linear algebra package for heterogeneous computational clusters. We demonstrate the efficiency of the HeteroPBLAS programs on a homogeneous computing cluster and a heterogeneous computing cluster.","PeriodicalId":125975,"journal":{"name":"2008 International Symposium on Parallel and Distributed Computing","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132942756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work we present an all-optical hypercube architecture and a systolic routing protocol for it. An r-dimensional optical hypercube network (OHC) consists of N = 2r processing nodes and r2r optical links. We study a systolic routing protocol that is based on cyclic changes of states of routers and scheduled sendings of packets. The protocol ensures that no electro-optical conversions are needed in the intermediate routing nodes and all the packets injected into the routing machinery reach their targets without collisions. A work-optimal routing of an h-relation is achieved with a reasonable size of h in omega(NlogN).
{"title":"Scheduled Routing in an Optical Hypercube","authors":"Risto T. Honkanen","doi":"10.1109/ISPDC.2008.16","DOIUrl":"https://doi.org/10.1109/ISPDC.2008.16","url":null,"abstract":"In this work we present an all-optical hypercube architecture and a systolic routing protocol for it. An r-dimensional optical hypercube network (OHC) consists of N = 2r processing nodes and r2r optical links. We study a systolic routing protocol that is based on cyclic changes of states of routers and scheduled sendings of packets. The protocol ensures that no electro-optical conversions are needed in the intermediate routing nodes and all the packets injected into the routing machinery reach their targets without collisions. A work-optimal routing of an h-relation is achieved with a reasonable size of h in omega(NlogN).","PeriodicalId":125975,"journal":{"name":"2008 International Symposium on Parallel and Distributed Computing","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116117505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study the online file allocation problem on ring networks. In this paper, we present a 7-competitive randomized algorithm against an adaptive online adversary on uniform ring networks. The algorithm is deterministic if the file size is 1. Moreover, we obtain lower bounds of 4.25 and 3.833 for a deterministic algorithm and a randomized algorithm against an adaptive online adversary, respectively, on ring networks.
{"title":"Randomized Online File Allocation on Uniform Ring Networks","authors":"Akira Matsubayashi, Y. Kawamura","doi":"10.1109/ISPDC.2008.27","DOIUrl":"https://doi.org/10.1109/ISPDC.2008.27","url":null,"abstract":"We study the online file allocation problem on ring networks. In this paper, we present a 7-competitive randomized algorithm against an adaptive online adversary on uniform ring networks. The algorithm is deterministic if the file size is 1. Moreover, we obtain lower bounds of 4.25 and 3.833 for a deterministic algorithm and a randomized algorithm against an adaptive online adversary, respectively, on ring networks.","PeriodicalId":125975,"journal":{"name":"2008 International Symposium on Parallel and Distributed Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125099638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-comparand associative processors are efficient in parallel processing of complex search problems that arise from many application areas including computational geometry, graph theory and list/matrix computations. In this paper we report new FPGA implementations of a multi-comparand multi-search associative processor. The architecture of the processor working in a combined bit-serial/bit-parallel word-parallel mode and its functions are described. Then, several implementations of associative processors in VHDL, using Xilinx Foundation ISE software and Digilent development boards with Xilinx FPGA devices are reported. Parameters of the implemented FPGA processors are presented and discussed.
多公司关联处理器在并行处理复杂搜索问题方面是高效的,这些问题出现在许多应用领域,包括计算几何、图论和列表/矩阵计算。在本文中,我们报告了一种新的多比较多搜索关联处理器的FPGA实现。描述了以位串行/位并行字并行组合方式工作的处理器结构及其功能。然后,介绍了使用Xilinx Foundation ISE软件和Digilent开发板和Xilinx FPGA器件在VHDL中实现关联处理器的几种方法。给出并讨论了所实现的FPGA处理器的参数。
{"title":"FPGA Implementations of a Parallel Associative Processor with Multi-Comparand Multi-Search Operations","authors":"Zbigniew Kokosinski, Bartlomiej Malus","doi":"10.1109/ISPDC.2008.42","DOIUrl":"https://doi.org/10.1109/ISPDC.2008.42","url":null,"abstract":"Multi-comparand associative processors are efficient in parallel processing of complex search problems that arise from many application areas including computational geometry, graph theory and list/matrix computations. In this paper we report new FPGA implementations of a multi-comparand multi-search associative processor. The architecture of the processor working in a combined bit-serial/bit-parallel word-parallel mode and its functions are described. Then, several implementations of associative processors in VHDL, using Xilinx Foundation ISE software and Digilent development boards with Xilinx FPGA devices are reported. Parameters of the implemented FPGA processors are presented and discussed.","PeriodicalId":125975,"journal":{"name":"2008 International Symposium on Parallel and Distributed Computing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126761063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Konwar, Peter M. Musial, Alexander A. Shvartsman
Quorum systems-collections of sets with pairwise nonempty intersections-are used in distributed settings to implement services such as consensus and consistent memory. Quorums have been substantially studied in static settings, however the design and analysis of quorum-based distributed services in resource-limited ad hoc networks is a relatively unexplored area. The pioneering work of Chockler, Gilbert, and Patt-Shamir considers such networks and proposes an implementation of probabilistic quorum systems with per-node communication bit complexity of O(log2 n), where n is the number of nodes. The authors assumes a priori knowledge of node failure probability p, where 0 ¿ p < 1/4. Additionally their work overlooks the cost of gathering responses from quorum members by the client. We present a new probabilistic quorum construction with a lower, per quorum access, communication bit complexity of O(log n) for multi-hop networks. Our quorum access algorithm is based on self-sampling by the nodes themselves, in a way equivalent to accessing a quorum set, with high probability. In addition, we provide a novel on-line algorithm to estimate the node failure probability parameter p, thus removing the assumption that it is known a priori. This is accomplished with per node communication bit complexity of O(log2 n). We demonstrate the utility of our construction by presenting a single-writer, multi-reader algorithm that uses our probabilistic quorums to implement atomic objects in ad hoc networks, where consistency is guaranteed with high probability. We include simulation results illustrating the high probability guarantee for our atomic memory service.
{"title":"Spontaneous, Self-Sampling Quorum Systems for Ad Hoc Networks","authors":"K. Konwar, Peter M. Musial, Alexander A. Shvartsman","doi":"10.1109/ISPDC.2008.61","DOIUrl":"https://doi.org/10.1109/ISPDC.2008.61","url":null,"abstract":"Quorum systems-collections of sets with pairwise nonempty intersections-are used in distributed settings to implement services such as consensus and consistent memory. Quorums have been substantially studied in static settings, however the design and analysis of quorum-based distributed services in resource-limited ad hoc networks is a relatively unexplored area. The pioneering work of Chockler, Gilbert, and Patt-Shamir considers such networks and proposes an implementation of probabilistic quorum systems with per-node communication bit complexity of O(log2 n), where n is the number of nodes. The authors assumes a priori knowledge of node failure probability p, where 0 ¿ p < 1/4. Additionally their work overlooks the cost of gathering responses from quorum members by the client. We present a new probabilistic quorum construction with a lower, per quorum access, communication bit complexity of O(log n) for multi-hop networks. Our quorum access algorithm is based on self-sampling by the nodes themselves, in a way equivalent to accessing a quorum set, with high probability. In addition, we provide a novel on-line algorithm to estimate the node failure probability parameter p, thus removing the assumption that it is known a priori. This is accomplished with per node communication bit complexity of O(log2 n). We demonstrate the utility of our construction by presenting a single-writer, multi-reader algorithm that uses our probabilistic quorums to implement atomic objects in ad hoc networks, where consistency is guaranteed with high probability. We include simulation results illustrating the high probability guarantee for our atomic memory service.","PeriodicalId":125975,"journal":{"name":"2008 International Symposium on Parallel and Distributed Computing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126145177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}