{"title":"Application-transparent process-level error recovery for multicomputers","authors":"Y. Tamir, T. Frazier","doi":"10.1109/HICSS.1989.47170","DOIUrl":null,"url":null,"abstract":"An application-transparent, process-level, distributed error recovery scheme for multicomputers is proposed. Checkpointing is initiated by timers at intervals determined by the needs of the application. Checkpointing and recovery involve only as much of the system as is necessary: a set of interacting processes. Processes that are not part of the interacting set do not participate in checkpointing or recovery and continue to do useful work. Several checkpoint and/or recovery session may be active simultaneously. The scheme does not require significant overhead during normal operation, since it is not necessary to make message transmission atomic, acknowledge each message, or transmit checkbits with each packet. Variations of the technique using packet-switching or virtual circuits are discussed, and the scheme is compared to previously published techniques.<<ETX>>","PeriodicalId":300182,"journal":{"name":"[1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume 1: Architecture Track","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"[1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume 1: Architecture Track","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HICSS.1989.47170","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 18
Abstract
An application-transparent, process-level, distributed error recovery scheme for multicomputers is proposed. Checkpointing is initiated by timers at intervals determined by the needs of the application. Checkpointing and recovery involve only as much of the system as is necessary: a set of interacting processes. Processes that are not part of the interacting set do not participate in checkpointing or recovery and continue to do useful work. Several checkpoint and/or recovery session may be active simultaneously. The scheme does not require significant overhead during normal operation, since it is not necessary to make message transmission atomic, acknowledge each message, or transmit checkbits with each packet. Variations of the technique using packet-switching or virtual circuits are discussed, and the scheme is compared to previously published techniques.<>