Feng Qin, Joseph A. Tucek, Yuanyuan Zhou, Jagadeesan Sundaresan
{"title":"Rx: Treating bugs as allergies—a safe method to survive software failures","authors":"Feng Qin, Joseph A. Tucek, Yuanyuan Zhou, Jagadeesan Sundaresan","doi":"10.1145/1275517.1275519","DOIUrl":null,"url":null,"abstract":"Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Prior work on surviving software failures suffers from one or more of the following limitations: required application restructuring, inability to address deterministic software bugs, unsafe speculation on program execution, and long recovery time.\n This paper proposes an innovative safe technique, called Rx, which can quickly recover programs from many types of software bugs, both deterministic and nondeterministic. Our idea, inspired from allergy treatment in real life, is to rollback the program to a recent checkpoint upon a software failure, and then to reexecute the program in a modified environment. We base this idea on the observation that many bugs are correlated with the execution environment, and therefore can be avoided by removing the “allergen” from the environment. Rx requires few to no modifications to applications and provides programmers with additional feedback for bug diagnosis.\n We have implemented Rx on Linux. Our experiments with five server applications that contain seven bugs of various types show that Rx can survive six out of seven software failures and provide transparent fast recovery within 0.017--0.16 seconds, 21--53 times faster than the whole program restart approach for all but one case (CVS). In contrast, the two tested alternatives, a whole program restart approach and a simple rollback and reexecution without environmental changes, cannot successfully recover the four servers (Squid, Apache, CVS, and ypserv) that contain deterministic bugs, and have only a 40% recovery rate for the server (MySQL) that contains a nondeterministic concurrency bug. Additionally, Rx's checkpointing system is lightweight, imposing small time and space overheads.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"32 1","pages":"7"},"PeriodicalIF":2.0000,"publicationDate":"2007-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"95","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Computer Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/1275517.1275519","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 95
Abstract
Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Prior work on surviving software failures suffers from one or more of the following limitations: required application restructuring, inability to address deterministic software bugs, unsafe speculation on program execution, and long recovery time.
This paper proposes an innovative safe technique, called Rx, which can quickly recover programs from many types of software bugs, both deterministic and nondeterministic. Our idea, inspired from allergy treatment in real life, is to rollback the program to a recent checkpoint upon a software failure, and then to reexecute the program in a modified environment. We base this idea on the observation that many bugs are correlated with the execution environment, and therefore can be avoided by removing the “allergen” from the environment. Rx requires few to no modifications to applications and provides programmers with additional feedback for bug diagnosis.
We have implemented Rx on Linux. Our experiments with five server applications that contain seven bugs of various types show that Rx can survive six out of seven software failures and provide transparent fast recovery within 0.017--0.16 seconds, 21--53 times faster than the whole program restart approach for all but one case (CVS). In contrast, the two tested alternatives, a whole program restart approach and a simple rollback and reexecution without environmental changes, cannot successfully recover the four servers (Squid, Apache, CVS, and ypserv) that contain deterministic bugs, and have only a 40% recovery rate for the server (MySQL) that contains a nondeterministic concurrency bug. Additionally, Rx's checkpointing system is lightweight, imposing small time and space overheads.
期刊介绍:
ACM Transactions on Computer Systems (TOCS) presents research and development results on the design, implementation, analysis, evaluation, and use of computer systems and systems software. The term "computer systems" is interpreted broadly and includes operating systems, systems architecture and hardware, distributed systems, optimizing compilers, and the interaction between systems and computer networks. Articles appearing in TOCS will tend either to present new techniques and concepts, or to report on experiences and experiments with actual systems. Insights useful to system designers, builders, and users will be emphasized.
TOCS publishes research and technical papers, both short and long. It includes technical correspondence to permit commentary on technical topics and on previously published papers.