Xonar:分布式深度学习的基于分析的作业排序器

Q1 Computer Science IEEE Cloud Computing Pub Date : 2022-07-01 DOI:10.1109/CLOUD55607.2022.00030

Changyong Shin, Gyeongsik Yang, Yeonho Yoo, J. Lee, C. Yoo

{"title":"Xonar:分布式深度学习的基于分析的作业排序器","authors":"Changyong Shin, Gyeongsik Yang, Yeonho Yoo, J. Lee, C. Yoo","doi":"10.1109/CLOUD55607.2022.00030","DOIUrl":null,"url":null,"abstract":"Deep learning models have a wide spectrum of GPU execution time and memory size. When running distributed training jobs, however, their GPU execution time and memory size have not been taken into account, which leads to the high variance of job completion time (JCT). Moreover, the jobs often run into the GPU out-of-memory (OoM) problem so that the unlucky job has to restart all over. To address the problems, we propose Xonar to profile the deep learning jobs and order them in the queue. The experiments show that Xonar with TensorFlow v1.6 reduces the tail JCT by 44% with the OoM problem eliminated.","PeriodicalId":54281,"journal":{"name":"IEEE Cloud Computing","volume":"249 1","pages":"112-114"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Xonar: Profiling-based Job Orderer for Distributed Deep Learning\",\"authors\":\"Changyong Shin, Gyeongsik Yang, Yeonho Yoo, J. Lee, C. Yoo\",\"doi\":\"10.1109/CLOUD55607.2022.00030\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep learning models have a wide spectrum of GPU execution time and memory size. When running distributed training jobs, however, their GPU execution time and memory size have not been taken into account, which leads to the high variance of job completion time (JCT). Moreover, the jobs often run into the GPU out-of-memory (OoM) problem so that the unlucky job has to restart all over. To address the problems, we propose Xonar to profile the deep learning jobs and order them in the queue. The experiments show that Xonar with TensorFlow v1.6 reduces the tail JCT by 44% with the OoM problem eliminated.\",\"PeriodicalId\":54281,\"journal\":{\"name\":\"IEEE Cloud Computing\",\"volume\":\"249 1\",\"pages\":\"112-114\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Cloud Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CLOUD55607.2022.00030\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLOUD55607.2022.00030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 3

摘要

深度学习模型具有广泛的GPU执行时间和内存大小。然而，当运行分布式训练作业时，它们的GPU执行时间和内存大小没有考虑在内，这导致作业完成时间(JCT)的高方差。此外，作业经常遇到GPU内存不足(OoM)问题，因此不幸的作业必须重新启动。为了解决这些问题，我们建议Xonar分析深度学习任务并在队列中排序。实验表明，Xonar与TensorFlow v1.6在消除了OoM问题的情况下，尾部JCT减少了44%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Xonar: Profiling-based Job Orderer for Distributed Deep Learning

Deep learning models have a wide spectrum of GPU execution time and memory size. When running distributed training jobs, however, their GPU execution time and memory size have not been taken into account, which leads to the high variance of job completion time (JCT). Moreover, the jobs often run into the GPU out-of-memory (OoM) problem so that the unlucky job has to restart all over. To address the problems, we propose Xonar to profile the deep learning jobs and order them in the queue. The experiments show that Xonar with TensorFlow v1.6 reduces the tail JCT by 44% with the OoM problem eliminated.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Cloud Computing Computer Science-Computer Networks and Communications

CiteScore

11.20

自引率

0.00%

发文量

期刊介绍： Cessation. IEEE Cloud Computing is committed to the timely publication of peer-reviewed articles that provide innovative research ideas, applications results, and case studies in all areas of cloud computing. Topics relating to novel theory, algorithms, performance analyses and applications of techniques are covered. More specifically: Cloud software, Cloud security, Trade-offs between privacy and utility of cloud, Cloud in the business environment, Cloud economics, Cloud governance, Migrating to the cloud, Cloud standards, Development tools, Backup and recovery, Interoperability, Applications management, Data analytics, Communications protocols, Mobile cloud, Private clouds, Liability issues for data loss on clouds, Data integration, Big data, Cloud education, Cloud skill sets, Cloud energy consumption, The architecture of cloud computing, Applications in commerce, education, and industry, Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Software as a Service (SaaS), Business Process as a Service (BPaaS)