面向深度学习工作负载的大规模GPU集群可靠性研究

Companion Proceedings of the Web Conference 2021 Pub Date : 2021-04-19 DOI:10.1145/3442442.3452056

Junjie Qian, Taeyoon Kim, Myeongjae Jeon

{"title":"面向深度学习工作负载的大规模GPU集群可靠性研究","authors":"Junjie Qian, Taeyoon Kim, Myeongjae Jeon","doi":"10.1145/3442442.3452056","DOIUrl":null,"url":null,"abstract":"Recent advances on deep learning technologies have made GPU clusters popular as training platforms. In this paper, we study reliability issues while focusing on training job failures from analyzing logs collected from running deep learning workloads on a large-scale GPU cluster in production. These failures are largely grouped into two categories, infrastructure and user, based on their sources, and reveal diverse reasons causing the failures. With insights obtained from the failure analysis, we suggest several different ways to improve the stability of shared GPU clusters designed for DL training and optimize user experience by reducing failure occurrences.","PeriodicalId":129420,"journal":{"name":"Companion Proceedings of the Web Conference 2021","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Reliability of Large Scale GPU Clusters for Deep Learning Workloads\",\"authors\":\"Junjie Qian, Taeyoon Kim, Myeongjae Jeon\",\"doi\":\"10.1145/3442442.3452056\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent advances on deep learning technologies have made GPU clusters popular as training platforms. In this paper, we study reliability issues while focusing on training job failures from analyzing logs collected from running deep learning workloads on a large-scale GPU cluster in production. These failures are largely grouped into two categories, infrastructure and user, based on their sources, and reveal diverse reasons causing the failures. With insights obtained from the failure analysis, we suggest several different ways to improve the stability of shared GPU clusters designed for DL training and optimize user experience by reducing failure occurrences.\",\"PeriodicalId\":129420,\"journal\":{\"name\":\"Companion Proceedings of the Web Conference 2021\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Companion Proceedings of the Web Conference 2021\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3442442.3452056\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Proceedings of the Web Conference 2021","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3442442.3452056","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

深度学习技术的最新进展使GPU集群成为流行的训练平台。在本文中，我们研究了可靠性问题，同时通过分析在生产中的大规模GPU集群上运行深度学习工作负载收集的日志来关注训练作业失败。这些故障根据其来源大致分为基础设施和用户两类，并揭示了导致故障的各种原因。根据从故障分析中获得的见解，我们提出了几种不同的方法来提高为深度学习训练设计的共享GPU集群的稳定性，并通过减少故障发生来优化用户体验。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Reliability of Large Scale GPU Clusters for Deep Learning Workloads

Recent advances on deep learning technologies have made GPU clusters popular as training platforms. In this paper, we study reliability issues while focusing on training job failures from analyzing logs collected from running deep learning workloads on a large-scale GPU cluster in production. These failures are largely grouped into two categories, infrastructure and user, based on their sources, and reveal diverse reasons causing the failures. With insights obtained from the failure analysis, we suggest several different ways to improve the stability of shared GPU clusters designed for DL training and optimize user experience by reducing failure occurrences.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Companion Proceedings of the Web Conference 2021

自引率

0.00%

发文量