{"title":"A Transparent Hypervisor-level Checkpoint-Restart Mechanism for a Cluster of Virtual Machines","authors":"Chayawat Pechwises, K. Chanchio","doi":"10.1109/JCSSE.2018.8457176","DOIUrl":null,"url":null,"abstract":"A cluster of virtual machines is a common platform for running MPI applications in cloud computing environments. However, most traditional methods to provide fault tolerance to these applications are not fully transparent and require specific, checkpointing-enabled MPI software. This paper presents a novel Transparent Hypervisor-level Checkpoint-Restart mechanism, namely the Virtual Cluster Checkpoint-Restart (VCCR), to perform checkpoint and restart operations at hypervisor-level. VCCR is highly transparent to MPI applications and guest OS. In VCCR, a software framework consisting of a controller and agent processes is created to perform checkpoint and restart operations for the entire cluster. The checkpoint and restart protocols of VCCR are designed based on the principles of barrier synchronization and virtual time to maintain global consistency and efficiency. We have developed a prototype of VCCR on top the QEMU-KVM software and conducted two preliminary experiments using NAS Parallel Benchmark. Experimental results confirm that VCCR can correctly and efficiently checkpoint and restart a cluster of virtual machines.","PeriodicalId":338973,"journal":{"name":"2018 15th International Joint Conference on Computer Science and Software Engineering (JCSSE)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 15th International Joint Conference on Computer Science and Software Engineering (JCSSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCSSE.2018.8457176","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
A cluster of virtual machines is a common platform for running MPI applications in cloud computing environments. However, most traditional methods to provide fault tolerance to these applications are not fully transparent and require specific, checkpointing-enabled MPI software. This paper presents a novel Transparent Hypervisor-level Checkpoint-Restart mechanism, namely the Virtual Cluster Checkpoint-Restart (VCCR), to perform checkpoint and restart operations at hypervisor-level. VCCR is highly transparent to MPI applications and guest OS. In VCCR, a software framework consisting of a controller and agent processes is created to perform checkpoint and restart operations for the entire cluster. The checkpoint and restart protocols of VCCR are designed based on the principles of barrier synchronization and virtual time to maintain global consistency and efficiency. We have developed a prototype of VCCR on top the QEMU-KVM software and conducted two preliminary experiments using NAS Parallel Benchmark. Experimental results confirm that VCCR can correctly and efficiently checkpoint and restart a cluster of virtual machines.