Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663756
Daniel Becker, R. Rabenseifner, F. Wolf
To support the development of efficient parallel codes on cluster systems, event tracing is a widely used technique with a broad spectrum of applications ranging from performance analysis, performance prediction and modeling to debugging. Usually, events are recorded along with the time of their occurrence to measure the temporal distance between them and/or to establish a total event ordering. Obviously, measuring the time between concurrent events requires a global clock, which often, however, is not available on clusters. Assuming that potentially different drifts of local clocks remain constant over time, linear offset interpolation can be applied postmortem to map local onto global timestamps. In this study, we investigate the robustness of the above assumption using different timers and show that the error of timestamps derived in this way can easily lead to a misrepresentation of the logical event order imposed by the semantics of the underlying communication substrate. We conclude that linear offset interpolation alone may be insufficient for many applications of event tracing and discuss further options.
{"title":"Implications of non-constant clock drifts for the timestamps of concurrent events","authors":"Daniel Becker, R. Rabenseifner, F. Wolf","doi":"10.1109/CLUSTR.2008.4663756","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663756","url":null,"abstract":"To support the development of efficient parallel codes on cluster systems, event tracing is a widely used technique with a broad spectrum of applications ranging from performance analysis, performance prediction and modeling to debugging. Usually, events are recorded along with the time of their occurrence to measure the temporal distance between them and/or to establish a total event ordering. Obviously, measuring the time between concurrent events requires a global clock, which often, however, is not available on clusters. Assuming that potentially different drifts of local clocks remain constant over time, linear offset interpolation can be applied postmortem to map local onto global timestamps. In this study, we investigate the robustness of the above assumption using different timers and show that the error of timestamps derived in this way can easily lead to a misrepresentation of the logical event order imposed by the semantics of the underlying communication substrate. We conclude that linear offset interpolation alone may be insufficient for many applications of event tracing and discuss further options.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125707597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663752
M. Kesavan, A. Ranadive, Ada Gavrilovska, K. Schwan
A key benefit of utility data centers and cloud computing infrastructure is the level of consolidation they can offer to arbitrary guest applications, and the substantial saving in operational costs and resources that can be derived in the process. However, significant challenges remain before it becomes possible to effectively and at low cost manage virtualized systems, particularly in the face of increasing complexity of individual many-core platforms, and given the dynamic behaviors and resource requirements exhibited by cloud guest VMs. This paper describes the active coordination (ACT) approach, aimed to address a specific issue in the management domain, which is the fact that management actions must (1) typically touch upon multiple resources in order to be effective, and (2) must be continuously refined in order to deal with the dynamism in the platform resource loads. ACT relies on the notion of class-of-service, associated with (sets of) guest VMs, based on which it maps VMs onto platform units, the latter encapsulating sets of platform resources of different types. Using these abstractions, ACT can perform active management in multiple ways, including a VM-specific approach and a black box approach that relies on continuous monitoring of the guest VMs' runtime behavior and on an adaptive resource allocation algorithm, termed Multiplicative Increase, Subtractive Decrease Algorithm with Wiggle Room. In addition, ACT permits explicit external events to trigger VM or application-specific resource allocations, e.g., leveraging emerging standards such as WSDM. The experimental analysis of the ACT prototype, built for Xen-based platforms, use industry-standard benchmarks, including RUBiS, Hadoop, and SPEC. They demonstrate ACT's ability to efficiently manage the aggregate platform resources according to the guest VMs' relative importance (class-of-service), for both the black-box and the VM-specific approach.
{"title":"Active CoordinaTion (ACT) - toward effectively managing virtualized multicore clouds","authors":"M. Kesavan, A. Ranadive, Ada Gavrilovska, K. Schwan","doi":"10.1109/CLUSTR.2008.4663752","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663752","url":null,"abstract":"A key benefit of utility data centers and cloud computing infrastructure is the level of consolidation they can offer to arbitrary guest applications, and the substantial saving in operational costs and resources that can be derived in the process. However, significant challenges remain before it becomes possible to effectively and at low cost manage virtualized systems, particularly in the face of increasing complexity of individual many-core platforms, and given the dynamic behaviors and resource requirements exhibited by cloud guest VMs. This paper describes the active coordination (ACT) approach, aimed to address a specific issue in the management domain, which is the fact that management actions must (1) typically touch upon multiple resources in order to be effective, and (2) must be continuously refined in order to deal with the dynamism in the platform resource loads. ACT relies on the notion of class-of-service, associated with (sets of) guest VMs, based on which it maps VMs onto platform units, the latter encapsulating sets of platform resources of different types. Using these abstractions, ACT can perform active management in multiple ways, including a VM-specific approach and a black box approach that relies on continuous monitoring of the guest VMs' runtime behavior and on an adaptive resource allocation algorithm, termed Multiplicative Increase, Subtractive Decrease Algorithm with Wiggle Room. In addition, ACT permits explicit external events to trigger VM or application-specific resource allocations, e.g., leveraging emerging standards such as WSDM. The experimental analysis of the ACT prototype, built for Xen-based platforms, use industry-standard benchmarks, including RUBiS, Hadoop, and SPEC. They demonstrate ACT's ability to efficiently manage the aggregate platform resources according to the guest VMs' relative importance (class-of-service), for both the black-box and the VM-specific approach.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"15 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133755358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663753
Paraskevas Yiapanis, D. Haglin, Anna M. Manning, K. Mayes, J. Keane
SUDA2 is a recursive search algorithm for minimal unique itemset detection. Such sets of items are formed via combinations of non-obvious attributes enabling individual record identification. The nature of SUDA2 allows work to be divided into non-overlapping tasks enabling parallel execution. Earlier work developed a parallel implementation for SUDA2 on an SMP cluster, and this was found to be several orders of magnitude faster than sequential SUDA2. However, if fixed-granularity parallel tasks are scheduled naively in the order of their generation, the system load tends to be imbalanced with little work at the beginning and end of the search. This paper investigates the effectiveness of variable-grained and dynamic work generation strategies for parallel SUDA2. These methods restrict the number of sub-tasks to be generated, based on the criterion of probable work size. The further we descend in the search recursion tree, the smaller the tasks become, thus we only select the largest tasks at each level of recursion as being suitable for scheduling. The revised algorithm runs approximately twice as fast as the existing parallel SUDA2 for finer levels of granularity when variable-grained work generation is applied. The dynamic method, performing level-wise task selection based on size, outperforms the other techniques investigated.
{"title":"Variable-grain and dynamic work generation for Minimal Unique Itemset mining","authors":"Paraskevas Yiapanis, D. Haglin, Anna M. Manning, K. Mayes, J. Keane","doi":"10.1109/CLUSTR.2008.4663753","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663753","url":null,"abstract":"SUDA2 is a recursive search algorithm for minimal unique itemset detection. Such sets of items are formed via combinations of non-obvious attributes enabling individual record identification. The nature of SUDA2 allows work to be divided into non-overlapping tasks enabling parallel execution. Earlier work developed a parallel implementation for SUDA2 on an SMP cluster, and this was found to be several orders of magnitude faster than sequential SUDA2. However, if fixed-granularity parallel tasks are scheduled naively in the order of their generation, the system load tends to be imbalanced with little work at the beginning and end of the search. This paper investigates the effectiveness of variable-grained and dynamic work generation strategies for parallel SUDA2. These methods restrict the number of sub-tasks to be generated, based on the criterion of probable work size. The further we descend in the search recursion tree, the smaller the tasks become, thus we only select the largest tasks at each level of recursion as being suitable for scheduling. The revised algorithm runs approximately twice as fast as the existing parallel SUDA2 for finer levels of granularity when variable-grained work generation is applied. The dynamic method, performing level-wise task selection based on size, outperforms the other techniques investigated.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132251313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663799
H. Takizawa, Katsuto Sato, Hiroaki Kobayashi
A commodity personal computer (PC) can be seen as a hybrid computing system equipped with two different kinds of processors, i.e. CPU and a graphics processing unit (GPU). Since the superiorities of GPUs in the performance and the power efficiency strongly depend on the system configuration and the data size determined at the runtime, a programmer cannot always know which processor should be used to execute a certain kernel. Therefore, this paper presents a runtime environment that dynamically selects an appropriate processor so as to improve the energy efficiency. The evaluation results clearly indicate that the runtime processor selection at executing each kernel with given data streams is promising for energy-aware computing on a hybrid computing system.
{"title":"SPRAT: Runtime processor selection for energy-aware computing","authors":"H. Takizawa, Katsuto Sato, Hiroaki Kobayashi","doi":"10.1109/CLUSTR.2008.4663799","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663799","url":null,"abstract":"A commodity personal computer (PC) can be seen as a hybrid computing system equipped with two different kinds of processors, i.e. CPU and a graphics processing unit (GPU). Since the superiorities of GPUs in the performance and the power efficiency strongly depend on the system configuration and the data size determined at the runtime, a programmer cannot always know which processor should be used to execute a certain kernel. Therefore, this paper presents a runtime environment that dynamically selects an appropriate processor so as to improve the energy efficiency. The evaluation results clearly indicate that the runtime processor selection at executing each kernel with given data streams is promising for energy-aware computing on a hybrid computing system.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133406164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663755
S. Hunold, T. Rauber, F. Suter
Applications raising in many scientific fields exhibit both data and task parallelism that have to be exploited efficiently. A classic approach is to structure those applications by a task graph whose nodes represent parallel computations. Scheduling such mixed-parallel applications is challenging even on a single homogeneous platform, such as a cluster. Most of the mixed-parallel application scheduling algorithms rely on two decoupled steps: allocation and mapping. This separation can induce unnecessary or costly data redistributions that have an impact on the overall performance. This is particularly true for data intensive applications. In this paper, we propose an original approach in which the allocations determined in the first step can be adapted during the second step in order to minimize the impact of these data redistributions. Two redistribution aware mapping strategies are detailed and a study of their impact on the schedule length is proposed through a comparison with an efficient two step algorithm over a broad range of experimental scenarios.
{"title":"Redistribution aware two-step scheduling for mixed-parallel applications","authors":"S. Hunold, T. Rauber, F. Suter","doi":"10.1109/CLUSTR.2008.4663755","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663755","url":null,"abstract":"Applications raising in many scientific fields exhibit both data and task parallelism that have to be exploited efficiently. A classic approach is to structure those applications by a task graph whose nodes represent parallel computations. Scheduling such mixed-parallel applications is challenging even on a single homogeneous platform, such as a cluster. Most of the mixed-parallel application scheduling algorithms rely on two decoupled steps: allocation and mapping. This separation can induce unnecessary or costly data redistributions that have an impact on the overall performance. This is particularly true for data intensive applications. In this paper, we propose an original approach in which the allocations determined in the first step can be adapted during the second step in order to minimize the impact of these data redistributions. Two redistribution aware mapping strategies are detailed and a study of their impact on the schedule length is proposed through a comparison with an efficient two step algorithm over a broad range of experimental scenarios.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127086804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663789
Bibo Tu, Ming Zou, Jianfeng Zhan, Xiaofang Zhao, Jianping Fan
MPI collective operations on multi-core clusters should be multi-core aware. In this paper, collective algorithms with hierarchical virtual topology focus on the performance difference among different communication levels on multi-core clusters, simply for intra-node and inter-node communication; Furthermore, to select befitting segment sizes for intra-node collective communication can cater to cache hierarchy in multi-core processors. Based on existing collective algorithms in MPICH2, above two techniques construct portable optimization methodology over MPICH2 for collective operations on multi-core clusters. Conforming to above optimization methodology, multi-core aware broadcast algorithm has been implemented and evaluated as a case study. The results of performance evaluation show that the multi-core aware optimization methodology over MPICH2 is efficient.
{"title":"Multi-core aware optimization for MPI collectives","authors":"Bibo Tu, Ming Zou, Jianfeng Zhan, Xiaofang Zhao, Jianping Fan","doi":"10.1109/CLUSTR.2008.4663789","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663789","url":null,"abstract":"MPI collective operations on multi-core clusters should be multi-core aware. In this paper, collective algorithms with hierarchical virtual topology focus on the performance difference among different communication levels on multi-core clusters, simply for intra-node and inter-node communication; Furthermore, to select befitting segment sizes for intra-node collective communication can cater to cache hierarchy in multi-core processors. Based on existing collective algorithms in MPICH2, above two techniques construct portable optimization methodology over MPICH2 for collective operations on multi-core clusters. Conforming to above optimization methodology, multi-core aware broadcast algorithm has been implemented and evaluated as a case study. The results of performance evaluation show that the multi-core aware optimization methodology over MPICH2 is efficient.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116953416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663782
Takafumi Watanabe, M. Nakao, T. Hiroyasu, Tomohiro Otsuka, M. Koibuchi
In addition to its use in local area networks, Ethernet has been used for connecting hosts in the area of high-performance computing. Here, we investigated the impact of topology and link aggregation on a large-scale PC cluster with Ethernet. Ethernet topology that allows loops and its routing can be implemented by the VLAN routing method without creating broadcast storms. To simplify the system configuration without modifying system software, the VLAN tag is added to a frame at switches in our implementation of topologies. Each host creates VLAN interfaces that have different local network addresses on a physical interface, so that a switch learns the MAC addresses of hosts in a PC cluster by broadcast. Evaluation results showed that the performance characteristics of an eight-switch network are comparable to those of an ideal 1-switch (full crossbar) network in the execution of high-performance LINPACK benchmark (HPL) on a 225-host PC cluster. On the other hand, evaluation results using NAS parallel benchmarks indicated that topologies achieved by the proposed methodology showed performance improvements of up to about 650% as compared to the simple tree topology. These results indicate that topology and link aggregation have marked impacts and commodity switches can be used instead of expensive and high functional switches.
{"title":"Impact of topology and link aggregation on a PC cluster with Ethernet","authors":"Takafumi Watanabe, M. Nakao, T. Hiroyasu, Tomohiro Otsuka, M. Koibuchi","doi":"10.1109/CLUSTR.2008.4663782","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663782","url":null,"abstract":"In addition to its use in local area networks, Ethernet has been used for connecting hosts in the area of high-performance computing. Here, we investigated the impact of topology and link aggregation on a large-scale PC cluster with Ethernet. Ethernet topology that allows loops and its routing can be implemented by the VLAN routing method without creating broadcast storms. To simplify the system configuration without modifying system software, the VLAN tag is added to a frame at switches in our implementation of topologies. Each host creates VLAN interfaces that have different local network addresses on a physical interface, so that a switch learns the MAC addresses of hosts in a PC cluster by broadcast. Evaluation results showed that the performance characteristics of an eight-switch network are comparable to those of an ideal 1-switch (full crossbar) network in the execution of high-performance LINPACK benchmark (HPL) on a 225-host PC cluster. On the other hand, evaluation results using NAS parallel benchmarks indicated that topologies achieved by the proposed methodology showed performance improvements of up to about 650% as compared to the simple tree topology. These results indicate that topology and link aggregation have marked impacts and commodity switches can be used instead of expensive and high functional switches.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114382260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663808
Liqiang Cao, Hongbing Luo, Baoyin Zhang
This paper proposes a new multi-pattern parallel I/O benchmark called Jetter, which evaluates parallel I/O throughput with either the contiguous I/O pattern or the non-contiguous I/O pattern, in either the share-one-file model or the file-per-process model, by either the POSIX interface or the MPI-I/O interface. Jetter helps end users make sense of the pattern performance law, and helps them develop efficient applications in a platform. We have evaluated the parallel I/O bandwidth in a 32 CPU shared memory computer with Jetter. The results show that I/O pattern determines throughput. Optimizing I/O model, interface, etc. in a pattern will improve bandwidth 2 or 3 times.
{"title":"Jetter: a multi-pattern parallel I/O benchmark","authors":"Liqiang Cao, Hongbing Luo, Baoyin Zhang","doi":"10.1109/CLUSTR.2008.4663808","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663808","url":null,"abstract":"This paper proposes a new multi-pattern parallel I/O benchmark called Jetter, which evaluates parallel I/O throughput with either the contiguous I/O pattern or the non-contiguous I/O pattern, in either the share-one-file model or the file-per-process model, by either the POSIX interface or the MPI-I/O interface. Jetter helps end users make sense of the pattern performance law, and helps them develop efficient applications in a platform. We have evaluated the parallel I/O bandwidth in a 32 CPU shared memory computer with Jetter. The results show that I/O pattern determines throughput. Optimizing I/O model, interface, etc. in a pattern will improve bandwidth 2 or 3 times.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128810187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663809
Paulo Afonso Lopes, P. Medeiros
We present part of our recent work on performance enhancement of cluster file systems using shared disks over a SAN. This work is built around the proposal of pCFS, a file system specifically targeting those environments. In we presented the objectives and design principles of pCFS and a proof-of-concept implementation, carried out by modifying Red Hatpsilas GFS , showing significant improvements in operations over files shared among processes running in different nodes. pCFS differs from GFS in two main aspects: its use of cooperative caching and a finer grain of locking. The first aspect, which used the LAN to enhance performance in write sharing situations, was described elsewhere ; we now introduce a complementary strategy - locking file regions instead of the whole file - which enables us to use the SAN while delivering a high level of performance in those same write sharing situations. pCFS may apply inter-node locks to regions, allowing processes to operate in parallel with a minimum of coherency overhead among nodes; a process cannot access outside its region(s) and, when a writer unlocks a region, others can then lock it and be able to see modified data immediately. Through a set of experiments where a file is shared between processes running in different nodes, we show that the described approach allows a gain of, at least, an order of magnitude over plain GFS.
{"title":"Enhancing write performance of a shared-disk cluster filesystem through a fine-grained locking strategy","authors":"Paulo Afonso Lopes, P. Medeiros","doi":"10.1109/CLUSTR.2008.4663809","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663809","url":null,"abstract":"We present part of our recent work on performance enhancement of cluster file systems using shared disks over a SAN. This work is built around the proposal of pCFS, a file system specifically targeting those environments. In we presented the objectives and design principles of pCFS and a proof-of-concept implementation, carried out by modifying Red Hatpsilas GFS , showing significant improvements in operations over files shared among processes running in different nodes. pCFS differs from GFS in two main aspects: its use of cooperative caching and a finer grain of locking. The first aspect, which used the LAN to enhance performance in write sharing situations, was described elsewhere ; we now introduce a complementary strategy - locking file regions instead of the whole file - which enables us to use the SAN while delivering a high level of performance in those same write sharing situations. pCFS may apply inter-node locks to regions, allowing processes to operate in parallel with a minimum of coherency overhead among nodes; a process cannot access outside its region(s) and, when a writer unlocks a region, others can then lock it and be able to see modified data immediately. Through a set of experiments where a file is shared between processes running in different nodes, we show that the described approach allows a gain of, at least, an order of magnitude over plain GFS.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133092736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663759
Tal Maoz, A. Barak, Lior Amar
The renewed interest in virtualization gives rise to new opportunities for running high performance computing (HPC) applications on clusters and grids. These include the ability to create a uniform (virtual) run-time environment on top of a multitude of hardware and software platforms, and the possibility for dynamic resource allocation towards the improvement of process performance, e.g., by virtual machine (VM) migration as a means for load-balancing. This paper deals with issues related to running HPC applications on multi-clusters and grids using VMware, a virtualization package running on Windows, Linux and OS X. The paper presents the ldquoJobrunrdquo system for transparent, on-demand VM launching upon job submission, and its integration with the MOSIX cluster and grid management system. We present a novel approach to job migration, combining VM migration with process migration using Jobrun, by which it is possible to migrate groups of processes and parallel jobs among different clusters in a multi-cluster or in a grid. We use four real HPC applications to evaluate the overheads of VMware (both on Linux and Windows), the MOSIX cluster extensions and their combination, and present detailed measurements of the performance of Jobrun.
{"title":"Combining Virtual Machine migration with process migration for HPC on multi-clusters and Grids","authors":"Tal Maoz, A. Barak, Lior Amar","doi":"10.1109/CLUSTR.2008.4663759","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663759","url":null,"abstract":"The renewed interest in virtualization gives rise to new opportunities for running high performance computing (HPC) applications on clusters and grids. These include the ability to create a uniform (virtual) run-time environment on top of a multitude of hardware and software platforms, and the possibility for dynamic resource allocation towards the improvement of process performance, e.g., by virtual machine (VM) migration as a means for load-balancing. This paper deals with issues related to running HPC applications on multi-clusters and grids using VMware, a virtualization package running on Windows, Linux and OS X. The paper presents the ldquoJobrunrdquo system for transparent, on-demand VM launching upon job submission, and its integration with the MOSIX cluster and grid management system. We present a novel approach to job migration, combining VM migration with process migration using Jobrun, by which it is possible to migrate groups of processes and parallel jobs among different clusters in a multi-cluster or in a grid. We use four real HPC applications to evaluate the overheads of VMware (both on Linux and Windows), the MOSIX cluster extensions and their combination, and present detailed measurements of the performance of Jobrun.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"175 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131610452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}