Efficient Checkpointing of Loop-Based Codes for Non-volatile Main Memory

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2017-09-01 DOI:10.1109/PACT.2017.58

Hussein Elnawawy, Mohammad A. Alshboul, James Tuck, Yan Solihin

{"title":"Efficient Checkpointing of Loop-Based Codes for Non-volatile Main Memory","authors":"Hussein Elnawawy, Mohammad A. Alshboul, James Tuck, Yan Solihin","doi":"10.1109/PACT.2017.58","DOIUrl":null,"url":null,"abstract":"Future main memory will likely include Non-Volatile Memory. Non-Volatile Main Memory (NVMM) provides an opportunity to rethink checkpointing strategies for providing failure safety to applications. While there are many checkpointing and logging schemes in literature, their use must be revisited as they incur high execution time overheads as well as a large number of additional writes to NVMM, which may significantly impact write endurance.In this paper, we propose a novel recompute-based failure safety approach, and demonstrate its applicability to loop-based code. Rather than keeping a fully consistent logging state, we only log enough state to enable recomputation. Upon a failure, our approach recovers to a consistent state by determining which parts of the computation were not completed and recomputing them. Effectively, our approach removes the need to keep checkpoints or logs, thus reducing execution time overheads and improving NVMM write endurance, at the expense of more complex recovery. We compare our new approach against logging and checkpointing on five scientific workloads, including tiled matrix multiplication, on a computer system model that was built on gem5 and supports Intel PMEM instruction extensions. For tiled matrix multiplication, our recompute approach incurs an execution time overhead of only 5%, in contrast to 8% overhead with logging and 207% overhead with checkpointing. Furthermore, recompute only adds 7% additional NVMM writes, compared to 111% with logging and 330% with checkpointing.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"37","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2017.58","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 37

Abstract

Future main memory will likely include Non-Volatile Memory. Non-Volatile Main Memory (NVMM) provides an opportunity to rethink checkpointing strategies for providing failure safety to applications. While there are many checkpointing and logging schemes in literature, their use must be revisited as they incur high execution time overheads as well as a large number of additional writes to NVMM, which may significantly impact write endurance.In this paper, we propose a novel recompute-based failure safety approach, and demonstrate its applicability to loop-based code. Rather than keeping a fully consistent logging state, we only log enough state to enable recomputation. Upon a failure, our approach recovers to a consistent state by determining which parts of the computation were not completed and recomputing them. Effectively, our approach removes the need to keep checkpoints or logs, thus reducing execution time overheads and improving NVMM write endurance, at the expense of more complex recovery. We compare our new approach against logging and checkpointing on five scientific workloads, including tiled matrix multiplication, on a computer system model that was built on gem5 and supports Intel PMEM instruction extensions. For tiled matrix multiplication, our recompute approach incurs an execution time overhead of only 5%, in contrast to 8% overhead with logging and 207% overhead with checkpointing. Furthermore, recompute only adds 7% additional NVMM writes, compared to 111% with logging and 330% with checkpointing.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

非易失性主存中基于循环代码的有效检查点

未来的主存储器可能包括非易失性存储器。非易失性主存储器(NVMM)提供了重新思考为应用程序提供故障安全的检查点策略的机会。虽然文献中有许多检查点和日志记录方案，但必须重新考虑它们的使用，因为它们会导致高执行时间开销以及对NVMM的大量额外写入，这可能会显著影响写入持久性。在本文中，我们提出了一种新的基于重新计算的失效安全方法，并证明了它对基于循环的代码的适用性。而不是保持完全一致的日志记录状态，我们只记录足够的状态以启用重新计算。在发生故障时，我们的方法通过确定计算的哪些部分没有完成并重新计算它们来恢复到一致状态。实际上，我们的方法消除了保留检查点或日志的需要，从而减少了执行时间开销并提高了NVMM写入持久性，但代价是更复杂的恢复。我们在一个基于gem5构建并支持Intel PMEM指令扩展的计算机系统模型上，比较了我们的新方法在五种科学工作负载(包括平铺矩阵乘法)上的日志记录和检查点。对于平铺矩阵乘法，我们的重计算方法的执行时间开销仅为5%，而日志记录的开销为8%，检查点的开销为207%。此外，recompute只增加了7%的额外NVMM写操作，而日志记录和检查点分别增加了111%和330%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

自引率

0.00%

发文量