SReplay: Deterministic Sub-Group Replay for One-Sided Communication

Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI:10.1145/2925426.2926264

Xuehai Qian, Koushik Sen, Paul H. Hargrove, Costin Iancu

{"title":"SReplay: Deterministic Sub-Group Replay for One-Sided Communication","authors":"Xuehai Qian, Koushik Sen, Paul H. Hargrove, Costin Iancu","doi":"10.1145/2925426.2926264","DOIUrl":null,"url":null,"abstract":"Replay of parallel execution is required by HPC debuggers and resilience mechanisms. Up-to-date, there is no existing deterministic replay solution for one-sided communication. The essential problem is that the readers of updated data do not have any information on which remote threads produced the updates, the conventional happens-before based ordering tracking techniques are challenging to work at scale. This paper presents SReplay, the first software tool for sub-group deterministic record and replay for one-sided communication. SReplay allows the user to specify and record the execution of a set of threads of interest (sub-group), and then deterministically replays the execution of the sub-group on a local machine without starting the remaining threads. SReplay ensures sub-group determinism using a hybrid data- and order-replay technique. SReplay maintains scalability by a combination of local logging and approximative event order tracking within sub-group. Our evaluation on deterministic and nondeterministic UPC programs shows that SReplay introduces an overhead ranging from 1.3x to 29x, when running on 1,024 cores and tracking up to 16 threads.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2925426.2926264","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Replay of parallel execution is required by HPC debuggers and resilience mechanisms. Up-to-date, there is no existing deterministic replay solution for one-sided communication. The essential problem is that the readers of updated data do not have any information on which remote threads produced the updates, the conventional happens-before based ordering tracking techniques are challenging to work at scale. This paper presents SReplay, the first software tool for sub-group deterministic record and replay for one-sided communication. SReplay allows the user to specify and record the execution of a set of threads of interest (sub-group), and then deterministically replays the execution of the sub-group on a local machine without starting the remaining threads. SReplay ensures sub-group determinism using a hybrid data- and order-replay technique. SReplay maintains scalability by a combination of local logging and approximative event order tracking within sub-group. Our evaluation on deterministic and nondeterministic UPC programs shows that SReplay introduces an overhead ranging from 1.3x to 29x, when running on 1,024 cores and tracking up to 16 threads.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

单侧通信的确定性子组重放

HPC调试器和弹性机制需要并行执行的重播。到目前为止，还没有针对单边通信的确定性重放解决方案。关键的问题是，更新数据的读取器没有任何关于哪个远程线程产生了更新的信息，传统的基于happens-before的排序跟踪技术很难大规模地工作。本文介绍了第一个用于单侧通信的子群确定性记录和重放的软件工具SReplay。SReplay允许用户指定并记录一组感兴趣的线程(子组)的执行，然后确定地在本地机器上重放子组的执行，而不启动剩余的线程。SReplay使用混合数据和顺序重放技术确保子组确定性。SReplay通过结合本地日志记录和子组内近似事件顺序跟踪来保持可伸缩性。我们对确定性和非确定性UPC程序的评估表明，当运行在1024个内核上并跟踪最多16个线程时，SReplay引入的开销从1.3倍到29倍不等。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2016 International Conference on Supercomputing

自引率

0.00%

发文量

期刊最新文献

Prefetching Techniques for Near-memory Throughput Processors Polly-ACC Transparent compilation to heterogeneous hardware Galaxyfly: A Novel Family of Flexible-Radix Low-Diameter Topologies for Large-Scales Interconnection Networks Parallel Transposition of Sparse Data Structures Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics