Rebound: Scalable checkpointing for coherent shared memory

2011 38th Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2011-06-04 DOI:10.1145/2000064.2000083

Rishi Agarwal, P. Garg, J. Torrellas

{"title":"Rebound: Scalable checkpointing for coherent shared memory","authors":"Rishi Agarwal, P. Garg, J. Torrellas","doi":"10.1145/2000064.2000083","DOIUrl":null,"url":null,"abstract":"As we move to large manycores, the hardware-based global check-pointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads. Scalable checkpointing requires tracking inter-thread dependences and building the checkpoint and rollback operations around dynamic groups of communicating processors. To address this problem, this paper introduces Rebound, the first hardware-based scheme for coordinated local checkpointing in multi-processors with directory-based cache coherence. Rebound leverages the transactions of a directory protocol to track inter-thread dependences. In addition, it boosts checkpointing efficiency by: (i) delaying the writeback of data to safe memory at checkpoints, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing at barrier synchronization. Finally, Rebound introduces distributed algorithms for checkpointing and rollback sets of processors. Simulations of parallel programs with up to 64 threads show that Rebound is scalable and has very low overhead. For 64 processors, its average performance overhead is only 2%, compared to 15% for global checkpointing.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2000064.2000083","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 34

Abstract

As we move to large manycores, the hardware-based global check-pointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads. Scalable checkpointing requires tracking inter-thread dependences and building the checkpoint and rollback operations around dynamic groups of communicating processors. To address this problem, this paper introduces Rebound, the first hardware-based scheme for coordinated local checkpointing in multi-processors with directory-based cache coherence. Rebound leverages the transactions of a directory protocol to track inter-thread dependences. In addition, it boosts checkpointing efficiency by: (i) delaying the writeback of data to safe memory at checkpoints, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing at barrier synchronization. Finally, Rebound introduces distributed algorithms for checkpointing and rollback sets of processors. Simulations of parallel programs with up to 64 threads show that Rebound is scalable and has very low overhead. For 64 processors, its average performance overhead is only 2%, compared to 15% for global checkpointing.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

反弹:用于一致共享内存的可伸缩检查点

当我们转向大型多核时，为小型共享内存机器提出的基于硬件的全局检查点方案无法扩展。可伸缩性障碍包括全局操作、全局回滚所损失的工作，以及不平衡负载或I/ o密集型负载的低效率。可伸缩的检查点需要跟踪线程间依赖关系，并围绕动态通信处理器组构建检查点和回滚操作。为了解决这个问题，本文引入了第一个基于硬件的多处理器协调本地检查点方案，该方案具有基于目录的缓存一致性。Rebound利用目录协议的事务来跟踪线程间的依赖关系。此外，它还通过以下方式提高检查点效率:(i)延迟检查点上的数据回写到安全内存，(ii)支持使用多个检查点的操作，以及(iii)优化屏障同步时的检查点。最后，Rebound介绍了用于检查点和回滚处理器集的分布式算法。对多达64个线程的并行程序的模拟表明，反弹是可伸缩的，并且开销非常低。对于64个处理器，其平均性能开销仅为2%，而全局检查点则为15%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2011 38th Annual International Symposium on Computer Architecture (ISCA)

自引率

0.00%

发文量