F. Cerveira, R. Barbosa, H. Madeira, Filipe Araújo
{"title":"Recovery for Virtualized Environments","authors":"F. Cerveira, R. Barbosa, H. Madeira, Filipe Araújo","doi":"10.1109/EDCC.2015.26","DOIUrl":null,"url":null,"abstract":"Cloud infrastructures provide elastic computing resources to client organizations, enabling them to build online applications while avoiding the fixed costs associated to a complete IT infrastructure. However, such organizations are unlikely to fully trust the cloud for the most critical applications. Among other threats, soft errors are expected to increase with the shrinking geometries of transistors, and many errors are left for the software layers to correct and mask. This paper characterizes the behavior of a virtualized environment, using Xen with CentOS as the hypervisor, in presence of soft errors. One of the main threats arises from soft errors directly affecting the hypervisor, as these faults have the potential to disrupt several virtual machines at once. With this in mind, we develop a fault tolerant architecture for cloud applications, which relies on experimental data collected using fault injection to guide its design. This architecture recovers from bit-flip errors with the help of a watchdog timer, to securely reboot the hypervisor. Nevertheless, errors might still propagate outside the system, for example to a client in a client-server interaction. Despite this, our results suggest that our architecture and a few simple techniques, like timers on the client, can recover a very large fraction of errors in client-server applications with small hardware and performance overhead. Conversely, the fraction of errors requiring Byzantine fault-tolerant techniques is quite small, thus restricting those expensive approaches to highly critical applications.","PeriodicalId":138826,"journal":{"name":"2015 11th European Dependable Computing Conference (EDCC)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 11th European Dependable Computing Conference (EDCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EDCC.2015.26","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12
Abstract
Cloud infrastructures provide elastic computing resources to client organizations, enabling them to build online applications while avoiding the fixed costs associated to a complete IT infrastructure. However, such organizations are unlikely to fully trust the cloud for the most critical applications. Among other threats, soft errors are expected to increase with the shrinking geometries of transistors, and many errors are left for the software layers to correct and mask. This paper characterizes the behavior of a virtualized environment, using Xen with CentOS as the hypervisor, in presence of soft errors. One of the main threats arises from soft errors directly affecting the hypervisor, as these faults have the potential to disrupt several virtual machines at once. With this in mind, we develop a fault tolerant architecture for cloud applications, which relies on experimental data collected using fault injection to guide its design. This architecture recovers from bit-flip errors with the help of a watchdog timer, to securely reboot the hypervisor. Nevertheless, errors might still propagate outside the system, for example to a client in a client-server interaction. Despite this, our results suggest that our architecture and a few simple techniques, like timers on the client, can recover a very large fraction of errors in client-server applications with small hardware and performance overhead. Conversely, the fraction of errors requiring Byzantine fault-tolerant techniques is quite small, thus restricting those expensive approaches to highly critical applications.