Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

S. Fu
{"title":"Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing","authors":"S. Fu","doi":"10.1109/CCGRID.2009.21","DOIUrl":null,"url":null,"abstract":"In large-scale clusters and computational grids, component failures become norms instead of exceptions. Failure occurrence as well as its impact on system performance and operation costs have become an increasingly important concern to system designers and administrators. In this paper, we study how to efficiently utilize system resources for high-availability clusters with the support of the virtual machine (VM) technology. We design a reconfigurable distributed virtual machine (RDVM) infrastructure for clusters computing. We propose failure-aware node selection strategies for the construction and reconfiguration of RDVMs. We leverage the proactive failure management techniques in calculating nodes' reliability status. We consider both the performance and reliability status of compute nodes in making selection decisions. We define a capacity-reliability metric to combine the effects of both factors in node selection, and propose Best-fit algorithms to find the best qualified nodes on which to instantiate VMs to run parallel jobs. We have conducted experiments using failure traces from production clusters and the NAS Parallel Benchmark programs on a real cluster. The results show the enhancement of system productivity and dependability by using the proposed strategies. With the Best-fit strategies, the job completion rate is increased by 17.6% compared with that achieved in the current LANL HPC cluster, and the task completion rate reaches 91.7%.","PeriodicalId":118263,"journal":{"name":"2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2009.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 36

Abstract

In large-scale clusters and computational grids, component failures become norms instead of exceptions. Failure occurrence as well as its impact on system performance and operation costs have become an increasingly important concern to system designers and administrators. In this paper, we study how to efficiently utilize system resources for high-availability clusters with the support of the virtual machine (VM) technology. We design a reconfigurable distributed virtual machine (RDVM) infrastructure for clusters computing. We propose failure-aware node selection strategies for the construction and reconfiguration of RDVMs. We leverage the proactive failure management techniques in calculating nodes' reliability status. We consider both the performance and reliability status of compute nodes in making selection decisions. We define a capacity-reliability metric to combine the effects of both factors in node selection, and propose Best-fit algorithms to find the best qualified nodes on which to instantiate VMs to run parallel jobs. We have conducted experiments using failure traces from production clusters and the NAS Parallel Benchmark programs on a real cluster. The results show the enhancement of system productivity and dependability by using the proposed strategies. With the Best-fit strategies, the job completion rate is increased by 17.6% compared with that achieved in the current LANL HPC cluster, and the task completion rate reaches 91.7%.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
面向高可用性计算的分布式虚拟机故障感知构建与重构
在大规模集群和计算网格中,组件故障成为常态而不是例外。故障的发生及其对系统性能和运行成本的影响已成为系统设计者和管理员日益关注的问题。本文研究了在虚拟机技术的支持下,如何有效地利用高可用性集群的系统资源。我们为集群计算设计了一个可重构的分布式虚拟机(RDVM)基础架构。我们提出了故障感知节点选择策略,用于rdvm的构建和重构。我们利用主动故障管理技术计算节点的可靠性状态。在进行选择决策时,我们同时考虑计算节点的性能和可靠性状况。我们定义了一个容量-可靠性度量来结合节点选择中这两个因素的影响,并提出了最佳拟合算法来找到最符合条件的节点,在这些节点上实例化vm以运行并行作业。我们已经在一个真实的集群上使用来自生产集群的故障跟踪和NAS并行基准程序进行了实验。结果表明,该策略提高了系统的生产率和可靠性。采用Best-fit策略后,任务完成率比现有LANL HPC集群提高了17.6%,任务完成率达到91.7%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Towards Visualization Scalability through Time Intervals and Hierarchical Organization of Monitoring Data Collusion Detection for Grid Computing Resource Information Aggregation in Hierarchical Grid Networks Distributed Indexing for Resource Discovery in P2P Networks Challenges and Opportunities on Parallel/Distributed Programming for Large-scale: From Multi-core to Clouds
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1