How to Rent GPUs on a Budget

arXiv - CS - Performance Pub Date : 2024-06-21 DOI:arxiv-2406.15560

Zhouzi Li, Benjamin Berg, Arpan Mukhopadhyay, Mor Harchol-Balter

{"title":"How to Rent GPUs on a Budget","authors":"Zhouzi Li, Benjamin Berg, Arpan Mukhopadhyay, Mor Harchol-Balter","doi":"arxiv-2406.15560","DOIUrl":null,"url":null,"abstract":"The explosion in Machine Learning (ML) over the past ten years has led to a\ndramatic increase in demand for GPUs to train ML models. Because it is\nprohibitively expensive for most users to build and maintain a large GPU\ncluster, large cloud providers (Microsoft Azure, Amazon AWS, Google Cloud) have\nseen explosive growth in demand for renting cloud-based GPUs. In this\ncloud-computing paradigm, a user must specify their demand for GPUs at every\nmoment in time, and will pay for every GPU-hour they use. ML training jobs are\nknown to be parallelizable to different degrees. Given a stream of ML training\njobs, a user typically wants to minimize the mean response time across all\njobs. Here, the response time of a job denotes the time from when a job arrives\nuntil it is complete. Additionally, the user is constrained by some operating\nbudget. Specifically, in this paper the user is constrained to use no more than\n$b$ GPUs per hour, over a long-run time average. The question is how to\nminimize mean response time while meeting the budget constraint. Because\ntraining jobs receive a diminishing marginal benefit from running on additional\nGPUs, allocating too many GPUs to a single training job can dramatically\nincrease the overall cost paid by the user. Hence, an optimal rental policy\nmust balance a tradeoff between training cost and mean response time. This\npaper derives the optimal rental policy for a stream of training jobs where the\njobs have different levels of parallelizability (specified by a speedup\nfunction) and different job sizes (amounts of inherent work). We make almost no\nassumptions about the arrival process and about the job size distribution. Our\noptimal policy specifies how many GPUs to rent at every moment in time and how\nto allocate these GPUs.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"56 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.15560","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The explosion in Machine Learning (ML) over the past ten years has led to a dramatic increase in demand for GPUs to train ML models. Because it is prohibitively expensive for most users to build and maintain a large GPU cluster, large cloud providers (Microsoft Azure, Amazon AWS, Google Cloud) have seen explosive growth in demand for renting cloud-based GPUs. In this cloud-computing paradigm, a user must specify their demand for GPUs at every moment in time, and will pay for every GPU-hour they use. ML training jobs are known to be parallelizable to different degrees. Given a stream of ML training jobs, a user typically wants to minimize the mean response time across all jobs. Here, the response time of a job denotes the time from when a job arrives until it is complete. Additionally, the user is constrained by some operating budget. Specifically, in this paper the user is constrained to use no more than $b$ GPUs per hour, over a long-run time average. The question is how to minimize mean response time while meeting the budget constraint. Because training jobs receive a diminishing marginal benefit from running on additional GPUs, allocating too many GPUs to a single training job can dramatically increase the overall cost paid by the user. Hence, an optimal rental policy must balance a tradeoff between training cost and mean response time. This paper derives the optimal rental policy for a stream of training jobs where the jobs have different levels of parallelizability (specified by a speedup function) and different job sizes (amounts of inherent work). We make almost no assumptions about the arrival process and about the job size distribution. Our optimal policy specifies how many GPUs to rent at every moment in time and how to allocate these GPUs.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

如何在预算范围内租用 GPU

在过去十年中，机器学习（ML）的迅猛发展导致用于训练 ML 模型的 GPU 需求急剧增加。由于大多数用户构建和维护大型 GPU 集群的成本过于昂贵，大型云计算提供商（微软 Azure、亚马逊 AWS、谷歌云）租用云计算 GPU 的需求出现了爆炸式增长。在这种云计算模式中，用户必须在每一时刻指定自己对 GPU 的需求，并为使用的每一 GPU 小时付费。众所周知，ML 训练任务在不同程度上是可并行的。给定一个 ML 训练作业流，用户通常希望最小化所有作业的平均响应时间。这里，作业的响应时间指的是从作业到达直到作业完成的时间。此外，用户还受到一些运营预算的限制。具体来说，在本文中，用户受限于在长期平均时间内每小时使用不超过 b$ 的 GPU。问题是如何在满足预算约束的同时最大限度地缩短平均响应时间。由于在额外 GPU 上运行时，培训作业获得的边际收益递减，因此为单个培训作业分配过多 GPU 会大幅增加用户支付的总成本。因此，最佳租用策略必须在训练成本和平均响应时间之间取得平衡。本文推导了训练作业流的最优租用策略，其中的作业具有不同的并行化水平（由加速函数指定）和不同的作业规模（固有工作量）。我们对作业的到达过程和作业大小分布几乎不做任何假设。Ouroptimal 策略规定了在每个时间点需要租用多少 GPU 以及如何分配这些 GPU。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Performance

自引率

0.00%

发文量