Efficient Checkpointing of Loop-Based Codes for Non-volatile Main Memory

Hussein Elnawawy, Mohammad A. Alshboul, James Tuck, Yan Solihin
{"title":"Efficient Checkpointing of Loop-Based Codes for Non-volatile Main Memory","authors":"Hussein Elnawawy, Mohammad A. Alshboul, James Tuck, Yan Solihin","doi":"10.1109/PACT.2017.58","DOIUrl":null,"url":null,"abstract":"Future main memory will likely include Non-Volatile Memory. Non-Volatile Main Memory (NVMM) provides an opportunity to rethink checkpointing strategies for providing failure safety to applications. While there are many checkpointing and logging schemes in literature, their use must be revisited as they incur high execution time overheads as well as a large number of additional writes to NVMM, which may significantly impact write endurance.In this paper, we propose a novel recompute-based failure safety approach, and demonstrate its applicability to loop-based code. Rather than keeping a fully consistent logging state, we only log enough state to enable recomputation. Upon a failure, our approach recovers to a consistent state by determining which parts of the computation were not completed and recomputing them. Effectively, our approach removes the need to keep checkpoints or logs, thus reducing execution time overheads and improving NVMM write endurance, at the expense of more complex recovery. We compare our new approach against logging and checkpointing on five scientific workloads, including tiled matrix multiplication, on a computer system model that was built on gem5 and supports Intel PMEM instruction extensions. For tiled matrix multiplication, our recompute approach incurs an execution time overhead of only 5%, in contrast to 8% overhead with logging and 207% overhead with checkpointing. Furthermore, recompute only adds 7% additional NVMM writes, compared to 111% with logging and 330% with checkpointing.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"37","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2017.58","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 37

Abstract

Future main memory will likely include Non-Volatile Memory. Non-Volatile Main Memory (NVMM) provides an opportunity to rethink checkpointing strategies for providing failure safety to applications. While there are many checkpointing and logging schemes in literature, their use must be revisited as they incur high execution time overheads as well as a large number of additional writes to NVMM, which may significantly impact write endurance.In this paper, we propose a novel recompute-based failure safety approach, and demonstrate its applicability to loop-based code. Rather than keeping a fully consistent logging state, we only log enough state to enable recomputation. Upon a failure, our approach recovers to a consistent state by determining which parts of the computation were not completed and recomputing them. Effectively, our approach removes the need to keep checkpoints or logs, thus reducing execution time overheads and improving NVMM write endurance, at the expense of more complex recovery. We compare our new approach against logging and checkpointing on five scientific workloads, including tiled matrix multiplication, on a computer system model that was built on gem5 and supports Intel PMEM instruction extensions. For tiled matrix multiplication, our recompute approach incurs an execution time overhead of only 5%, in contrast to 8% overhead with logging and 207% overhead with checkpointing. Furthermore, recompute only adds 7% additional NVMM writes, compared to 111% with logging and 330% with checkpointing.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
非易失性主存中基于循环代码的有效检查点
未来的主存储器可能包括非易失性存储器。非易失性主存储器(NVMM)提供了重新思考为应用程序提供故障安全的检查点策略的机会。虽然文献中有许多检查点和日志记录方案,但必须重新考虑它们的使用,因为它们会导致高执行时间开销以及对NVMM的大量额外写入,这可能会显著影响写入持久性。在本文中,我们提出了一种新的基于重新计算的失效安全方法,并证明了它对基于循环的代码的适用性。而不是保持完全一致的日志记录状态,我们只记录足够的状态以启用重新计算。在发生故障时,我们的方法通过确定计算的哪些部分没有完成并重新计算它们来恢复到一致状态。实际上,我们的方法消除了保留检查点或日志的需要,从而减少了执行时间开销并提高了NVMM写入持久性,但代价是更复杂的恢复。我们在一个基于gem5构建并支持Intel PMEM指令扩展的计算机系统模型上,比较了我们的新方法在五种科学工作负载(包括平铺矩阵乘法)上的日志记录和检查点。对于平铺矩阵乘法,我们的重计算方法的执行时间开销仅为5%,而日志记录的开销为8%,检查点的开销为207%。此外,recompute只增加了7%的额外NVMM写操作,而日志记录和检查点分别增加了111%和330%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
POSTER: Exploiting Approximations for Energy/Quality Tradeoffs in Service-Based Applications End-to-End Deep Learning of Optimization Heuristics Large Scale Data Clustering Using Memristive k-Median Computation DrMP: Mixed Precision-Aware DRAM for High Performance Approximate and Precise Computing POSTER: Improving Datacenter Efficiency Through Partitioning-Aware Scheduling
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1