{"title":"An on-line algorithm for checkpoint placement","authors":"A. Ziv, Jehoshua Bruck","doi":"10.1109/ISSRE.1996.558869","DOIUrl":null,"url":null,"abstract":"Checkpointing is a common technique for reducing the time to recover from faults in computer systems. By saving intermediate states of programs in a reliable storage device, checkpointing enables one to reduce the processing time loss caused by faults. The length of the intervals between the checkpoints affects the execution time of the programs. Long intervals lead to a long re-processing time, while too-frequent checkpointing leads to a high checkpointing overhead. In this paper, we present an online algorithm for the placement of checkpoints. The algorithm uses online knowledge of the current cost of a checkpoint when it decides whether or not to place a checkpoint. We show how the execution time of a program using this algorithm can be analyzed. The total overhead of the execution time when the proposed algorithm is used is smaller than the overhead when fixed intervals are used. Although the proposed algorithm uses only online knowledge about the cost of checkpointing, its behavior is close to that of the off-line optimal algorithm that uses the complete knowledge of the checkpointing cost.","PeriodicalId":441362,"journal":{"name":"Proceedings of ISSRE '96: 7th International Symposium on Software Reliability Engineering","volume":"107 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"107","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of ISSRE '96: 7th International Symposium on Software Reliability Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSRE.1996.558869","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 107
Abstract
Checkpointing is a common technique for reducing the time to recover from faults in computer systems. By saving intermediate states of programs in a reliable storage device, checkpointing enables one to reduce the processing time loss caused by faults. The length of the intervals between the checkpoints affects the execution time of the programs. Long intervals lead to a long re-processing time, while too-frequent checkpointing leads to a high checkpointing overhead. In this paper, we present an online algorithm for the placement of checkpoints. The algorithm uses online knowledge of the current cost of a checkpoint when it decides whether or not to place a checkpoint. We show how the execution time of a program using this algorithm can be analyzed. The total overhead of the execution time when the proposed algorithm is used is smaller than the overhead when fixed intervals are used. Although the proposed algorithm uses only online knowledge about the cost of checkpointing, its behavior is close to that of the off-line optimal algorithm that uses the complete knowledge of the checkpointing cost.