Replaying distributed programs without message logging

Robert H. B. Netzer, Yikang Xu
{"title":"Replaying distributed programs without message logging","authors":"Robert H. B. Netzer, Yikang Xu","doi":"10.1109/HPDC.1997.622370","DOIUrl":null,"url":null,"abstract":"Debugging long program runs can be difficult because of the delays required to repeatedly re-run the execution. Even a moderately long run of five minutes can incur aggravating delays. To address this problem, techniques exist that allow re-executing a distributed program from intermediate points by using combinations of checkpointing and message logging. In this paper we explore another idea: how to support replay without logging the contents of any message. When no messages are logged, the set of global states from which replay is possible is constrained, and it has been unknown how to compute this set without exhaustively searching the space of all global states, whose size is exponential in the number of processes. We present a simple and efficient hybrid on-the-fly/post-mortem algorithm for detecting the necessary and sufficient conditions under which parts of the execution can be replayed without message logs. A small amount of trace (two vectors) is recorded at each checkpoint and a fast post-mortem algorithm computes global states from which replay can begin. This algorithm is independent of the checkpointing technique used.","PeriodicalId":243171,"journal":{"name":"Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183)","volume":"57 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1997-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPDC.1997.622370","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20

Abstract

Debugging long program runs can be difficult because of the delays required to repeatedly re-run the execution. Even a moderately long run of five minutes can incur aggravating delays. To address this problem, techniques exist that allow re-executing a distributed program from intermediate points by using combinations of checkpointing and message logging. In this paper we explore another idea: how to support replay without logging the contents of any message. When no messages are logged, the set of global states from which replay is possible is constrained, and it has been unknown how to compute this set without exhaustively searching the space of all global states, whose size is exponential in the number of processes. We present a simple and efficient hybrid on-the-fly/post-mortem algorithm for detecting the necessary and sufficient conditions under which parts of the execution can be replayed without message logs. A small amount of trace (two vectors) is recorded at each checkpoint and a fast post-mortem algorithm computes global states from which replay can begin. This algorithm is independent of the checkpointing technique used.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在没有消息记录的情况下重播分布式程序
由于反复重新运行执行所需的延迟,调试长时间的程序运行可能很困难。即使是中等长度的5分钟也会导致严重的延误。为了解决这个问题,现有的技术允许使用检查点和消息日志的组合从中间点重新执行分布式程序。在本文中,我们探索了另一个想法:如何在不记录任何消息内容的情况下支持重播。当没有记录任何消息时,可能重播的全局状态集受到约束,并且不知道如何在不彻底搜索所有全局状态空间(其大小与进程数量呈指数关系)的情况下计算该集合。我们提出了一种简单有效的实时/事后混合算法,用于检测必要和充分的条件,在这些条件下,可以在没有消息日志的情况下重播部分执行。在每个检查点记录少量的跟踪(两个向量),快速的事后分析算法计算重播可以开始的全局状态。该算法独立于所使用的检查点技术。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Design patterns for parallel computing using a network of processors Forecasting network performance to support dynamic scheduling using the network weather service Performance aspects of switched SCI systems Utilizing heterogeneous networks in distributed parallel computing systems Cut-through delivery in Trapeze: An exercise in low-latency messaging
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1