{"title":"ITALC: Interactive Tool for Application-Level Checkpointing","authors":"R. Arora, Trung Nguyen Ba","doi":"10.1145/3152493.3152558","DOIUrl":null,"url":null,"abstract":"The computational resources at open-science supercomputing centers are shared among multiple users at a given time, and hence are governed by policies that ensure their fair and optimal usage. Such policies can impose upper-limits on (1) the number of compute-nodes, and (2) the wall-clock time that can be requested per computational job. Given these limits on computational jobs, several applications may not run to completion in a single session. Therefore, as a workaround, the users are advised to take advantage of the checkpoint-and-restart technique and spread their computations across multiple interdependent computational jobs. The checkpoint-and-restart technique helps in saving the execution state of the applications periodically. A saved state is known as a checkpoint. When their computational jobs time-out after running for the maximum wall-clock time, while leaving their computations incomplete, the users can submit new jobs to resume their computations using the checkpoints saved during their previous job runs. The checkpoint-and-restart technique can also be useful for making the applications tolerant to certain types of faults, viz., network and compute-node failures. When this technique is built within an application itself, it is called Application-Level Checkpointing (ALC). We are developing an interactive tool to assist the users in semi-automatically inserting the ALC mechanism into their existing applications without doing any manual reengineering. As compared to other approaches for checkpointing, the checkpoints written with our tool have smaller memory footprint, and thus, incur a smaller I/O overhead.","PeriodicalId":258031,"journal":{"name":"Proceedings of the Fourth International Workshop on HPC User Support Tools","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Fourth International Workshop on HPC User Support Tools","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3152493.3152558","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11
Abstract
The computational resources at open-science supercomputing centers are shared among multiple users at a given time, and hence are governed by policies that ensure their fair and optimal usage. Such policies can impose upper-limits on (1) the number of compute-nodes, and (2) the wall-clock time that can be requested per computational job. Given these limits on computational jobs, several applications may not run to completion in a single session. Therefore, as a workaround, the users are advised to take advantage of the checkpoint-and-restart technique and spread their computations across multiple interdependent computational jobs. The checkpoint-and-restart technique helps in saving the execution state of the applications periodically. A saved state is known as a checkpoint. When their computational jobs time-out after running for the maximum wall-clock time, while leaving their computations incomplete, the users can submit new jobs to resume their computations using the checkpoints saved during their previous job runs. The checkpoint-and-restart technique can also be useful for making the applications tolerant to certain types of faults, viz., network and compute-node failures. When this technique is built within an application itself, it is called Application-Level Checkpointing (ALC). We are developing an interactive tool to assist the users in semi-automatically inserting the ALC mechanism into their existing applications without doing any manual reengineering. As compared to other approaches for checkpointing, the checkpoints written with our tool have smaller memory footprint, and thus, incur a smaller I/O overhead.