Replaying distributed programs without message logging

Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183) Pub Date : 1997-08-05 DOI:10.1109/HPDC.1997.622370

Robert H. B. Netzer, Yikang Xu

{"title":"Replaying distributed programs without message logging","authors":"Robert H. B. Netzer, Yikang Xu","doi":"10.1109/HPDC.1997.622370","DOIUrl":null,"url":null,"abstract":"Debugging long program runs can be difficult because of the delays required to repeatedly re-run the execution. Even a moderately long run of five minutes can incur aggravating delays. To address this problem, techniques exist that allow re-executing a distributed program from intermediate points by using combinations of checkpointing and message logging. In this paper we explore another idea: how to support replay without logging the contents of any message. When no messages are logged, the set of global states from which replay is possible is constrained, and it has been unknown how to compute this set without exhaustively searching the space of all global states, whose size is exponential in the number of processes. We present a simple and efficient hybrid on-the-fly/post-mortem algorithm for detecting the necessary and sufficient conditions under which parts of the execution can be replayed without message logs. A small amount of trace (two vectors) is recorded at each checkpoint and a fast post-mortem algorithm computes global states from which replay can begin. This algorithm is independent of the checkpointing technique used.","PeriodicalId":243171,"journal":{"name":"Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183)","volume":"57 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1997-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPDC.1997.622370","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

Abstract

Debugging long program runs can be difficult because of the delays required to repeatedly re-run the execution. Even a moderately long run of five minutes can incur aggravating delays. To address this problem, techniques exist that allow re-executing a distributed program from intermediate points by using combinations of checkpointing and message logging. In this paper we explore another idea: how to support replay without logging the contents of any message. When no messages are logged, the set of global states from which replay is possible is constrained, and it has been unknown how to compute this set without exhaustively searching the space of all global states, whose size is exponential in the number of processes. We present a simple and efficient hybrid on-the-fly/post-mortem algorithm for detecting the necessary and sufficient conditions under which parts of the execution can be replayed without message logs. A small amount of trace (two vectors) is recorded at each checkpoint and a fast post-mortem algorithm computes global states from which replay can begin. This algorithm is independent of the checkpointing technique used.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在没有消息记录的情况下重播分布式程序

由于反复重新运行执行所需的延迟，调试长时间的程序运行可能很困难。即使是中等长度的5分钟也会导致严重的延误。为了解决这个问题，现有的技术允许使用检查点和消息日志的组合从中间点重新执行分布式程序。在本文中，我们探索了另一个想法:如何在不记录任何消息内容的情况下支持重播。当没有记录任何消息时，可能重播的全局状态集受到约束，并且不知道如何在不彻底搜索所有全局状态空间(其大小与进程数量呈指数关系)的情况下计算该集合。我们提出了一种简单有效的实时/事后混合算法，用于检测必要和充分的条件，在这些条件下，可以在没有消息日志的情况下重播部分执行。在每个检查点记录少量的跟踪(两个向量)，快速的事后分析算法计算重播可以开始的全局状态。该算法独立于所使用的检查点技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183)

自引率

0.00%

发文量

期刊最新文献

Design patterns for parallel computing using a network of processors Forecasting network performance to support dynamic scheduling using the network weather service Performance aspects of switched SCI systems Utilizing heterogeneous networks in distributed parallel computing systems Cut-through delivery in Trapeze: An exercise in low-latency messaging