Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid Pub Date : 2009-05-18 DOI:10.1109/CCGRID.2009.21

S. Fu

{"title":"Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing","authors":"S. Fu","doi":"10.1109/CCGRID.2009.21","DOIUrl":null,"url":null,"abstract":"In large-scale clusters and computational grids, component failures become norms instead of exceptions. Failure occurrence as well as its impact on system performance and operation costs have become an increasingly important concern to system designers and administrators. In this paper, we study how to efficiently utilize system resources for high-availability clusters with the support of the virtual machine (VM) technology. We design a reconfigurable distributed virtual machine (RDVM) infrastructure for clusters computing. We propose failure-aware node selection strategies for the construction and reconfiguration of RDVMs. We leverage the proactive failure management techniques in calculating nodes' reliability status. We consider both the performance and reliability status of compute nodes in making selection decisions. We define a capacity-reliability metric to combine the effects of both factors in node selection, and propose Best-fit algorithms to find the best qualified nodes on which to instantiate VMs to run parallel jobs. We have conducted experiments using failure traces from production clusters and the NAS Parallel Benchmark programs on a real cluster. The results show the enhancement of system productivity and dependability by using the proposed strategies. With the Best-fit strategies, the job completion rate is increased by 17.6% compared with that achieved in the current LANL HPC cluster, and the task completion rate reaches 91.7%.","PeriodicalId":118263,"journal":{"name":"2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2009.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 36

Abstract

In large-scale clusters and computational grids, component failures become norms instead of exceptions. Failure occurrence as well as its impact on system performance and operation costs have become an increasingly important concern to system designers and administrators. In this paper, we study how to efficiently utilize system resources for high-availability clusters with the support of the virtual machine (VM) technology. We design a reconfigurable distributed virtual machine (RDVM) infrastructure for clusters computing. We propose failure-aware node selection strategies for the construction and reconfiguration of RDVMs. We leverage the proactive failure management techniques in calculating nodes' reliability status. We consider both the performance and reliability status of compute nodes in making selection decisions. We define a capacity-reliability metric to combine the effects of both factors in node selection, and propose Best-fit algorithms to find the best qualified nodes on which to instantiate VMs to run parallel jobs. We have conducted experiments using failure traces from production clusters and the NAS Parallel Benchmark programs on a real cluster. The results show the enhancement of system productivity and dependability by using the proposed strategies. With the Best-fit strategies, the job completion rate is increased by 17.6% compared with that achieved in the current LANL HPC cluster, and the task completion rate reaches 91.7%.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

面向高可用性计算的分布式虚拟机故障感知构建与重构

在大规模集群和计算网格中，组件故障成为常态而不是例外。故障的发生及其对系统性能和运行成本的影响已成为系统设计者和管理员日益关注的问题。本文研究了在虚拟机技术的支持下，如何有效地利用高可用性集群的系统资源。我们为集群计算设计了一个可重构的分布式虚拟机(RDVM)基础架构。我们提出了故障感知节点选择策略，用于rdvm的构建和重构。我们利用主动故障管理技术计算节点的可靠性状态。在进行选择决策时，我们同时考虑计算节点的性能和可靠性状况。我们定义了一个容量-可靠性度量来结合节点选择中这两个因素的影响，并提出了最佳拟合算法来找到最符合条件的节点，在这些节点上实例化vm以运行并行作业。我们已经在一个真实的集群上使用来自生产集群的故障跟踪和NAS并行基准程序进行了实验。结果表明，该策略提高了系统的生产率和可靠性。采用Best-fit策略后，任务完成率比现有LANL HPC集群提高了17.6%，任务完成率达到91.7%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid

自引率

0.00%

发文量