Application-transparent process-level error recovery for multicomputers

Y. Tamir, T. Frazier
{"title":"Application-transparent process-level error recovery for multicomputers","authors":"Y. Tamir, T. Frazier","doi":"10.1109/HICSS.1989.47170","DOIUrl":null,"url":null,"abstract":"An application-transparent, process-level, distributed error recovery scheme for multicomputers is proposed. Checkpointing is initiated by timers at intervals determined by the needs of the application. Checkpointing and recovery involve only as much of the system as is necessary: a set of interacting processes. Processes that are not part of the interacting set do not participate in checkpointing or recovery and continue to do useful work. Several checkpoint and/or recovery session may be active simultaneously. The scheme does not require significant overhead during normal operation, since it is not necessary to make message transmission atomic, acknowledge each message, or transmit checkbits with each packet. Variations of the technique using packet-switching or virtual circuits are discussed, and the scheme is compared to previously published techniques.<<ETX>>","PeriodicalId":300182,"journal":{"name":"[1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume 1: Architecture Track","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"[1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume 1: Architecture Track","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HICSS.1989.47170","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 18

Abstract

An application-transparent, process-level, distributed error recovery scheme for multicomputers is proposed. Checkpointing is initiated by timers at intervals determined by the needs of the application. Checkpointing and recovery involve only as much of the system as is necessary: a set of interacting processes. Processes that are not part of the interacting set do not participate in checkpointing or recovery and continue to do useful work. Several checkpoint and/or recovery session may be active simultaneously. The scheme does not require significant overhead during normal operation, since it is not necessary to make message transmission atomic, acknowledge each message, or transmit checkbits with each packet. Variations of the technique using packet-switching or virtual circuits are discussed, and the scheme is compared to previously published techniques.<>
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
多台计算机的应用程序透明进程级错误恢复
提出了一种应用透明、进程级、分布式的多机错误恢复方案。检查点由计时器启动,时间间隔由应用程序的需要决定。检查点和恢复只涉及系统中必要的部分:一组交互过程。不属于交互集的进程不参与检查点或恢复,并继续执行有用的工作。多个检查点和/或恢复会话可能同时处于活动状态。该方案在正常操作期间不需要大量开销,因为不需要使消息传输原子化、确认每条消息或与每个数据包一起传输校验位。讨论了使用分组交换或虚拟电路的技术变体,并将该方案与先前发表的技术进行了比较。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Instruction set architecture of an efficient pipelined dataflow architecture An integrated CAD system for algorithm-specific IC design A massive memory supercomputer Extended ASLM-a reconfigurable database machine NS32532: case study in general-purpose microprocessor design tradeoffs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1