高性能计算应用中弹性恢复的设计与研究

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC) Pub Date : 2020-12-01 DOI:10.1109/HiPC50609.2020.00040

Kai Keller, K. Parasyris, L. Bautista-Gomez

{"title":"高性能计算应用中弹性恢复的设计与研究","authors":"Kai Keller, K. Parasyris, L. Bautista-Gomez","doi":"10.1109/HiPC50609.2020.00040","DOIUrl":null,"url":null,"abstract":"The efficient utilization of current supercomputing systems with deep storage hierarchies demands scientific applications that are capable of leveraging such heterogeneous hardware. Fault tolerance, and checkpointing in particular, is one of the most time-consuming aspects if not handled correctly. High checkpoint performance can be achieved using optimized multilevel checkpoint and restart libraries. Unfortunately, those libraries do not allow for restarts with a modified number of processes or scientific post-processing of the checkpointed data. This is because they typically use an N-N checkpointing scheme and opaque file-formats. In this article, we present a novel mechanism to asynchronously store checkpoints into a self-descriptive file format and load the data upon recovery with a different number of processes. We provide an API that defines the process-local data as part of a globally shared dataset. Our measurements demonstrate a low overhead between 0.6% and 2.5% for a 2.25 TB checkpoint with 6K processes.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Design and Study of Elastic Recovery in HPC Applications\",\"authors\":\"Kai Keller, K. Parasyris, L. Bautista-Gomez\",\"doi\":\"10.1109/HiPC50609.2020.00040\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The efficient utilization of current supercomputing systems with deep storage hierarchies demands scientific applications that are capable of leveraging such heterogeneous hardware. Fault tolerance, and checkpointing in particular, is one of the most time-consuming aspects if not handled correctly. High checkpoint performance can be achieved using optimized multilevel checkpoint and restart libraries. Unfortunately, those libraries do not allow for restarts with a modified number of processes or scientific post-processing of the checkpointed data. This is because they typically use an N-N checkpointing scheme and opaque file-formats. In this article, we present a novel mechanism to asynchronously store checkpoints into a self-descriptive file format and load the data upon recovery with a different number of processes. We provide an API that defines the process-local data as part of a globally shared dataset. Our measurements demonstrate a low overhead between 0.6% and 2.5% for a 2.25 TB checkpoint with 6K processes.\",\"PeriodicalId\":375004,\"journal\":{\"name\":\"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HiPC50609.2020.00040\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC50609.2020.00040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

高效利用当前具有深度存储层次的超级计算系统需要能够利用这种异构硬件的科学应用程序。如果处理不当，容错，特别是检查点，是最耗时的方面之一。使用优化的多级检查点和重启库可以实现高检查点性能。不幸的是，这些库不允许在修改进程数量或对检查点数据进行科学后处理的情况下重新启动。这是因为它们通常使用N-N检查点方案和不透明的文件格式。在本文中，我们提出了一种新的机制，可以异步地将检查点存储为自描述文件格式，并在恢复时使用不同数量的进程加载数据。我们提供了一个API，该API将流程本地数据定义为全局共享数据集的一部分。我们的测量表明，对于具有6K进程的2.25 TB检查点，开销在0.6%到2.5%之间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Design and Study of Elastic Recovery in HPC Applications

The efficient utilization of current supercomputing systems with deep storage hierarchies demands scientific applications that are capable of leveraging such heterogeneous hardware. Fault tolerance, and checkpointing in particular, is one of the most time-consuming aspects if not handled correctly. High checkpoint performance can be achieved using optimized multilevel checkpoint and restart libraries. Unfortunately, those libraries do not allow for restarts with a modified number of processes or scientific post-processing of the checkpointed data. This is because they typically use an N-N checkpointing scheme and opaque file-formats. In this article, we present a novel mechanism to asynchronously store checkpoints into a self-descriptive file format and load the data upon recovery with a different number of processes. We provide an API that defines the process-local data as part of a globally shared dataset. Our measurements demonstrate a low overhead between 0.6% and 2.5% for a 2.25 TB checkpoint with 6K processes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

自引率

0.00%

发文量

期刊最新文献

HiPC 2020 ORGANIZATION HiPC 2020 Industry Sponsors PufferFish: NUMA-Aware Work-stealing Library using Elastic Tasks Algorithms for Preemptive Co-scheduling of Kernels on GPUs 27th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2020) Technical program