VPC: Scalable, Low Downtime Checkpointing for Virtual Clusters

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI:10.1109/SBAC-PAD.2012.31

Peng Lu, B. Ravindran, Changsoo Kim

{"title":"VPC: Scalable, Low Downtime Checkpointing for Virtual Clusters","authors":"Peng Lu, B. Ravindran, Changsoo Kim","doi":"10.1109/SBAC-PAD.2012.31","DOIUrl":null,"url":null,"abstract":"A virtual cluster (VC) consists of multiple virtual machines (VMs) running on different physical hosts, inter-connected by a virtual network. A fault-tolerant protocol and mechanism are essential to the VC's availability and usability. We present Virtual Predict Check pointing (or VPC), a lightweight, globally consistent check pointing mechanism, which checkpoints the VC for immediate restoration after VM failures. By predicting the checkpoint-caused page faults during each check pointing interval, VPC further reduces the solo VM downtime than traditional incremental check pointing approaches. Besides, VPC uses a globally consistent check-pointing algorithm, which preserves the global consistency of the VMs' execution and communication states, and only saves the updated memory pages during each check pointing interval to reduce the entire VC downtime. Our implementation reveals that, compared with past VC check pointing/migration solutions including VNsnap, VPC reduces the solo VM downtime by as much as 45%, under the NPB benchmark, and reduces the entire VC downtime by as much as 50%, under the NPB distributed program. Additionally, VPC incurs a memory overhead of no more than 9%. In all cases, VPC's performance overhead is less than 16%.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SBAC-PAD.2012.31","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

A virtual cluster (VC) consists of multiple virtual machines (VMs) running on different physical hosts, inter-connected by a virtual network. A fault-tolerant protocol and mechanism are essential to the VC's availability and usability. We present Virtual Predict Check pointing (or VPC), a lightweight, globally consistent check pointing mechanism, which checkpoints the VC for immediate restoration after VM failures. By predicting the checkpoint-caused page faults during each check pointing interval, VPC further reduces the solo VM downtime than traditional incremental check pointing approaches. Besides, VPC uses a globally consistent check-pointing algorithm, which preserves the global consistency of the VMs' execution and communication states, and only saves the updated memory pages during each check pointing interval to reduce the entire VC downtime. Our implementation reveals that, compared with past VC check pointing/migration solutions including VNsnap, VPC reduces the solo VM downtime by as much as 45%, under the NPB benchmark, and reduces the entire VC downtime by as much as 50%, under the NPB distributed program. Additionally, VPC incurs a memory overhead of no more than 9%. In all cases, VPC's performance overhead is less than 16%.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

VPC:可扩展，低停机检查点的虚拟集群

虚拟集群(VC)由运行在不同物理主机上的多个虚拟机组成，这些虚拟机通过虚拟网络相互连接。容错协议和机制对VC的可用性和可用性至关重要。我们提出了虚拟预测检查指向(或VPC)，这是一种轻量级的，全局一致的检查指向机制，它在VM故障后检查点VC以便立即恢复。通过预测每个检查点间隔内检查点导致的页面错误，VPC比传统的增量检查点方法进一步减少单个虚拟机的停机时间。此外，VPC使用全局一致的检查点算法，保持虚拟机执行状态和通信状态的全局一致性，在每个检查点周期内只保存更新的内存页面，减少整个VC的停机时间。我们的实现表明，与过去的VC检查点/迁移解决方案(包括VNsnap)相比，VPC在NPB基准下将单个VM停机时间减少了45%，在NPB分布式程序下将整个VC停机时间减少了50%。另外，VPC的内存开销不超过9%。在所有情况下，VPC的性能开销都小于16%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

自引率

0.00%

发文量

期刊最新文献

Using Heterogeneous Networks to Improve Energy Efficiency in Direct Coherence Protocols for Many-Core CMPs Cloud Workload Analysis with SWAT Energy-Performance Tradeoffs in Software Transactional Memory CSHARP: Coherence and SHaring Aware Cache Replacement Policies for Parallel Applications Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs