ITALC: Interactive Tool for Application-Level Checkpointing

Proceedings of the Fourth International Workshop on HPC User Support Tools Pub Date : 2017-11-12 DOI:10.1145/3152493.3152558

R. Arora, Trung Nguyen Ba

{"title":"ITALC: Interactive Tool for Application-Level Checkpointing","authors":"R. Arora, Trung Nguyen Ba","doi":"10.1145/3152493.3152558","DOIUrl":null,"url":null,"abstract":"The computational resources at open-science supercomputing centers are shared among multiple users at a given time, and hence are governed by policies that ensure their fair and optimal usage. Such policies can impose upper-limits on (1) the number of compute-nodes, and (2) the wall-clock time that can be requested per computational job. Given these limits on computational jobs, several applications may not run to completion in a single session. Therefore, as a workaround, the users are advised to take advantage of the checkpoint-and-restart technique and spread their computations across multiple interdependent computational jobs. The checkpoint-and-restart technique helps in saving the execution state of the applications periodically. A saved state is known as a checkpoint. When their computational jobs time-out after running for the maximum wall-clock time, while leaving their computations incomplete, the users can submit new jobs to resume their computations using the checkpoints saved during their previous job runs. The checkpoint-and-restart technique can also be useful for making the applications tolerant to certain types of faults, viz., network and compute-node failures. When this technique is built within an application itself, it is called Application-Level Checkpointing (ALC). We are developing an interactive tool to assist the users in semi-automatically inserting the ALC mechanism into their existing applications without doing any manual reengineering. As compared to other approaches for checkpointing, the checkpoints written with our tool have smaller memory footprint, and thus, incur a smaller I/O overhead.","PeriodicalId":258031,"journal":{"name":"Proceedings of the Fourth International Workshop on HPC User Support Tools","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Fourth International Workshop on HPC User Support Tools","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3152493.3152558","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

The computational resources at open-science supercomputing centers are shared among multiple users at a given time, and hence are governed by policies that ensure their fair and optimal usage. Such policies can impose upper-limits on (1) the number of compute-nodes, and (2) the wall-clock time that can be requested per computational job. Given these limits on computational jobs, several applications may not run to completion in a single session. Therefore, as a workaround, the users are advised to take advantage of the checkpoint-and-restart technique and spread their computations across multiple interdependent computational jobs. The checkpoint-and-restart technique helps in saving the execution state of the applications periodically. A saved state is known as a checkpoint. When their computational jobs time-out after running for the maximum wall-clock time, while leaving their computations incomplete, the users can submit new jobs to resume their computations using the checkpoints saved during their previous job runs. The checkpoint-and-restart technique can also be useful for making the applications tolerant to certain types of faults, viz., network and compute-node failures. When this technique is built within an application itself, it is called Application-Level Checkpointing (ALC). We are developing an interactive tool to assist the users in semi-automatically inserting the ALC mechanism into their existing applications without doing any manual reengineering. As compared to other approaches for checkpointing, the checkpoints written with our tool have smaller memory footprint, and thus, incur a smaller I/O overhead.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于应用程序级检查点的交互式工具

开放科学超级计算中心的计算资源在给定时间内由多个用户共享，因此由确保公平和最佳使用的策略进行管理。这些策略可以对(1)计算节点的数量和(2)每个计算作业可以请求的挂钟时间施加上限。考虑到计算作业的这些限制，多个应用程序可能无法在一个会话中运行完成。因此，作为一种解决方案，建议用户利用检查点-重新启动技术，并将其计算分散到多个相互依赖的计算作业中。检查点和重启技术有助于定期保存应用程序的执行状态。保存的状态称为检查点。当他们的计算作业在运行了最长的时钟时间后超时，而计算仍未完成时，用户可以提交新作业，使用在以前的作业运行期间保存的检查点恢复计算。检查点和重新启动技术对于使应用程序能够容忍某些类型的故障(即网络和计算节点故障)也很有用。当这种技术在应用程序本身内构建时，它被称为应用程序级检查点(ALC)。我们正在开发一个交互式工具，以帮助用户半自动地将ALC机制插入到他们现有的应用程序中，而无需进行任何手动重新设计。与其他检查点方法相比，使用我们的工具编写的检查点占用更小的内存，因此产生更小的I/O开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the Fourth International Workshop on HPC User Support Tools

自引率

0.00%

发文量

期刊最新文献

Markov Chain Modeling for Anomaly Detection in High Performance Computing System Logs Nix as HPC package management system Testpilot: A Flexible Framework for User-centric Testing of HPC Clusters An Edge Service for Managing HPC Workflows ITALC: Interactive Tool for Application-Level Checkpointing