ITALC: Interactive Tool for Application-Level Checkpointing

R. Arora, Trung Nguyen Ba
{"title":"ITALC: Interactive Tool for Application-Level Checkpointing","authors":"R. Arora, Trung Nguyen Ba","doi":"10.1145/3152493.3152558","DOIUrl":null,"url":null,"abstract":"The computational resources at open-science supercomputing centers are shared among multiple users at a given time, and hence are governed by policies that ensure their fair and optimal usage. Such policies can impose upper-limits on (1) the number of compute-nodes, and (2) the wall-clock time that can be requested per computational job. Given these limits on computational jobs, several applications may not run to completion in a single session. Therefore, as a workaround, the users are advised to take advantage of the checkpoint-and-restart technique and spread their computations across multiple interdependent computational jobs. The checkpoint-and-restart technique helps in saving the execution state of the applications periodically. A saved state is known as a checkpoint. When their computational jobs time-out after running for the maximum wall-clock time, while leaving their computations incomplete, the users can submit new jobs to resume their computations using the checkpoints saved during their previous job runs. The checkpoint-and-restart technique can also be useful for making the applications tolerant to certain types of faults, viz., network and compute-node failures. When this technique is built within an application itself, it is called Application-Level Checkpointing (ALC). We are developing an interactive tool to assist the users in semi-automatically inserting the ALC mechanism into their existing applications without doing any manual reengineering. As compared to other approaches for checkpointing, the checkpoints written with our tool have smaller memory footprint, and thus, incur a smaller I/O overhead.","PeriodicalId":258031,"journal":{"name":"Proceedings of the Fourth International Workshop on HPC User Support Tools","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Fourth International Workshop on HPC User Support Tools","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3152493.3152558","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11

Abstract

The computational resources at open-science supercomputing centers are shared among multiple users at a given time, and hence are governed by policies that ensure their fair and optimal usage. Such policies can impose upper-limits on (1) the number of compute-nodes, and (2) the wall-clock time that can be requested per computational job. Given these limits on computational jobs, several applications may not run to completion in a single session. Therefore, as a workaround, the users are advised to take advantage of the checkpoint-and-restart technique and spread their computations across multiple interdependent computational jobs. The checkpoint-and-restart technique helps in saving the execution state of the applications periodically. A saved state is known as a checkpoint. When their computational jobs time-out after running for the maximum wall-clock time, while leaving their computations incomplete, the users can submit new jobs to resume their computations using the checkpoints saved during their previous job runs. The checkpoint-and-restart technique can also be useful for making the applications tolerant to certain types of faults, viz., network and compute-node failures. When this technique is built within an application itself, it is called Application-Level Checkpointing (ALC). We are developing an interactive tool to assist the users in semi-automatically inserting the ALC mechanism into their existing applications without doing any manual reengineering. As compared to other approaches for checkpointing, the checkpoints written with our tool have smaller memory footprint, and thus, incur a smaller I/O overhead.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用于应用程序级检查点的交互式工具
开放科学超级计算中心的计算资源在给定时间内由多个用户共享,因此由确保公平和最佳使用的策略进行管理。这些策略可以对(1)计算节点的数量和(2)每个计算作业可以请求的挂钟时间施加上限。考虑到计算作业的这些限制,多个应用程序可能无法在一个会话中运行完成。因此,作为一种解决方案,建议用户利用检查点-重新启动技术,并将其计算分散到多个相互依赖的计算作业中。检查点和重启技术有助于定期保存应用程序的执行状态。保存的状态称为检查点。当他们的计算作业在运行了最长的时钟时间后超时,而计算仍未完成时,用户可以提交新作业,使用在以前的作业运行期间保存的检查点恢复计算。检查点和重新启动技术对于使应用程序能够容忍某些类型的故障(即网络和计算节点故障)也很有用。当这种技术在应用程序本身内构建时,它被称为应用程序级检查点(ALC)。我们正在开发一个交互式工具,以帮助用户半自动地将ALC机制插入到他们现有的应用程序中,而无需进行任何手动重新设计。与其他检查点方法相比,使用我们的工具编写的检查点占用更小的内存,因此产生更小的I/O开销。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Markov Chain Modeling for Anomaly Detection in High Performance Computing System Logs Nix as HPC package management system Testpilot: A Flexible Framework for User-centric Testing of HPC Clusters An Edge Service for Managing HPC Workflows ITALC: Interactive Tool for Application-Level Checkpointing
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1