Convergence-aware optimal checkpointing for exploratory deep learning training jobs

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Future Generation Computer Systems-The International Journal of Escience Pub Date : 2024-11-08 DOI:10.1016/j.future.2024.107597

Hongliang Li , Zichen Wang , Hairui Zhao , Meng Zhang , Xiang Li , Haixiao Xu

{"title":"Convergence-aware optimal checkpointing for exploratory deep learning training jobs","authors":"Hongliang Li , Zichen Wang , Hairui Zhao , Meng Zhang , Xiang Li , Haixiao Xu","doi":"10.1016/j.future.2024.107597","DOIUrl":null,"url":null,"abstract":"<div><div>Training Deep Learning (DL) models are becoming more time-consuming, thus interruptions to the training processes are inevitable. We can obtain an optimal checkpointing interval to minimize the fault tolerance overhead for a HPC (High Performance Computing) job with the precondition that the job progress is proportional to its execution time. Unfortunately, it is not the case in DL model training, where a DL training job yields diminishing returns across its lifetime. Meanwhile, training DL models is inherently exploratory, with early termination frequently occurring during model training&developing. It makes the early progress of a DL training job more valuable than the later ones. Even placement of checkpoints would either increase the risks in the early stages or waste resources overprotecting the latter stages. Moreover, in data parallelism, the state-of-the-art quality-driven scheduling strategies allocate more resources for the early stages of a job than the later ones to accelerate the training progress, which further amplifies the issue. In summary, the early stage is more important than the later stages. Allocating more fault-tolerant resources to the early stages is beneficial for the model exploration. Based on the aforementioned conclusion, we present COCI, an approach to compute optimal checkpointing configuration for a exploratory DL training job, minimizing the fault tolerance overhead, including checkpoint cost and recovery cost. We implement COCI based on state-of-the-art iteration-level checkpointing mechanism, as a pluggable module compatible with PyTorch without extra user input. The experimental results show that COCI reduces up to 40.18% fault tolerance overhead compared to existing state-of-the-art DL fault tolerance methods in serial scenario, 60.64% in data parallel scenario.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"164 ","pages":"Article 107597"},"PeriodicalIF":6.2000,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X24005612","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Training Deep Learning (DL) models are becoming more time-consuming, thus interruptions to the training processes are inevitable. We can obtain an optimal checkpointing interval to minimize the fault tolerance overhead for a HPC (High Performance Computing) job with the precondition that the job progress is proportional to its execution time. Unfortunately, it is not the case in DL model training, where a DL training job yields diminishing returns across its lifetime. Meanwhile, training DL models is inherently exploratory, with early termination frequently occurring during model training&developing. It makes the early progress of a DL training job more valuable than the later ones. Even placement of checkpoints would either increase the risks in the early stages or waste resources overprotecting the latter stages. Moreover, in data parallelism, the state-of-the-art quality-driven scheduling strategies allocate more resources for the early stages of a job than the later ones to accelerate the training progress, which further amplifies the issue. In summary, the early stage is more important than the later stages. Allocating more fault-tolerant resources to the early stages is beneficial for the model exploration. Based on the aforementioned conclusion, we present COCI, an approach to compute optimal checkpointing configuration for a exploratory DL training job, minimizing the fault tolerance overhead, including checkpoint cost and recovery cost. We implement COCI based on state-of-the-art iteration-level checkpointing mechanism, as a pluggable module compatible with PyTorch without extra user input. The experimental results show that COCI reduces up to 40.18% fault tolerance overhead compared to existing state-of-the-art DL fault tolerance methods in serial scenario, 60.64% in data parallel scenario.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

针对探索性深度学习训练工作的收敛感知优化检查点功能

深度学习（DL）模型的训练越来越耗时，因此训练过程的中断不可避免。我们可以获得最佳的检查点间隔，从而最大限度地减少 HPC（高性能计算）作业的容错开销，前提条件是作业进度与其执行时间成正比。遗憾的是，在 DL 模型训练中情况并非如此，DL 训练作业在其整个生命周期中的收益是递减的。同时，DL 模型的训练本质上是探索性的，在模型训练和开发过程中经常会出现提前终止的情况。这使得 DL 训练工作的早期进展比后期进展更有价值。即使设置检查点，要么会增加早期阶段的风险，要么会浪费资源过度保护后期阶段。此外，在数据并行的情况下，最先进的质量驱动调度策略会为作业的早期阶段分配比后期阶段更多的资源，以加快训练进度，这进一步加剧了问题的严重性。总之，早期阶段比后期阶段更重要。为早期阶段分配更多容错资源有利于模型探索。基于上述结论，我们提出了 COCI，这是一种为探索性 DL 训练作业计算最佳检查点配置的方法，能最大限度地减少容错开销，包括检查点成本和恢复成本。我们基于最先进的迭代级检查点机制实现了 COCI，它是与 PyTorch 兼容的可插拔模块，无需用户额外输入。实验结果表明，与现有最先进的 DL 容错方法相比，COCI 在串行场景下减少了 40.18% 的容错开销，在数据并行场景下减少了 60.64%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Future Generation Computer Systems-The International Journal of Escience 工程技术-计算机：理论方法

CiteScore

19.90

自引率

2.70%

发文量

376

审稿时长

10.6 months

期刊介绍： Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.

期刊最新文献

Identifying runtime libraries in statically linked linux binaries High throughput edit distance computation on FPGA-based accelerators using HLS In silico framework for genome analysis Adaptive ensemble optimization for memory-related hyperparameters in retraining DNN at edge Convergence-aware optimal checkpointing for exploratory deep learning training jobs