How to Rent GPUs on a Budget

Zhouzi Li, Benjamin Berg, Arpan Mukhopadhyay, Mor Harchol-Balter
{"title":"How to Rent GPUs on a Budget","authors":"Zhouzi Li, Benjamin Berg, Arpan Mukhopadhyay, Mor Harchol-Balter","doi":"arxiv-2406.15560","DOIUrl":null,"url":null,"abstract":"The explosion in Machine Learning (ML) over the past ten years has led to a\ndramatic increase in demand for GPUs to train ML models. Because it is\nprohibitively expensive for most users to build and maintain a large GPU\ncluster, large cloud providers (Microsoft Azure, Amazon AWS, Google Cloud) have\nseen explosive growth in demand for renting cloud-based GPUs. In this\ncloud-computing paradigm, a user must specify their demand for GPUs at every\nmoment in time, and will pay for every GPU-hour they use. ML training jobs are\nknown to be parallelizable to different degrees. Given a stream of ML training\njobs, a user typically wants to minimize the mean response time across all\njobs. Here, the response time of a job denotes the time from when a job arrives\nuntil it is complete. Additionally, the user is constrained by some operating\nbudget. Specifically, in this paper the user is constrained to use no more than\n$b$ GPUs per hour, over a long-run time average. The question is how to\nminimize mean response time while meeting the budget constraint. Because\ntraining jobs receive a diminishing marginal benefit from running on additional\nGPUs, allocating too many GPUs to a single training job can dramatically\nincrease the overall cost paid by the user. Hence, an optimal rental policy\nmust balance a tradeoff between training cost and mean response time. This\npaper derives the optimal rental policy for a stream of training jobs where the\njobs have different levels of parallelizability (specified by a speedup\nfunction) and different job sizes (amounts of inherent work). We make almost no\nassumptions about the arrival process and about the job size distribution. Our\noptimal policy specifies how many GPUs to rent at every moment in time and how\nto allocate these GPUs.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"56 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.15560","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The explosion in Machine Learning (ML) over the past ten years has led to a dramatic increase in demand for GPUs to train ML models. Because it is prohibitively expensive for most users to build and maintain a large GPU cluster, large cloud providers (Microsoft Azure, Amazon AWS, Google Cloud) have seen explosive growth in demand for renting cloud-based GPUs. In this cloud-computing paradigm, a user must specify their demand for GPUs at every moment in time, and will pay for every GPU-hour they use. ML training jobs are known to be parallelizable to different degrees. Given a stream of ML training jobs, a user typically wants to minimize the mean response time across all jobs. Here, the response time of a job denotes the time from when a job arrives until it is complete. Additionally, the user is constrained by some operating budget. Specifically, in this paper the user is constrained to use no more than $b$ GPUs per hour, over a long-run time average. The question is how to minimize mean response time while meeting the budget constraint. Because training jobs receive a diminishing marginal benefit from running on additional GPUs, allocating too many GPUs to a single training job can dramatically increase the overall cost paid by the user. Hence, an optimal rental policy must balance a tradeoff between training cost and mean response time. This paper derives the optimal rental policy for a stream of training jobs where the jobs have different levels of parallelizability (specified by a speedup function) and different job sizes (amounts of inherent work). We make almost no assumptions about the arrival process and about the job size distribution. Our optimal policy specifies how many GPUs to rent at every moment in time and how to allocate these GPUs.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
如何在预算范围内租用 GPU
在过去十年中,机器学习(ML)的迅猛发展导致用于训练 ML 模型的 GPU 需求急剧增加。由于大多数用户构建和维护大型 GPU 集群的成本过于昂贵,大型云计算提供商(微软 Azure、亚马逊 AWS、谷歌云)租用云计算 GPU 的需求出现了爆炸式增长。在这种云计算模式中,用户必须在每一时刻指定自己对 GPU 的需求,并为使用的每一 GPU 小时付费。众所周知,ML 训练任务在不同程度上是可并行的。给定一个 ML 训练作业流,用户通常希望最小化所有作业的平均响应时间。这里,作业的响应时间指的是从作业到达直到作业完成的时间。此外,用户还受到一些运营预算的限制。具体来说,在本文中,用户受限于在长期平均时间内每小时使用不超过 b$ 的 GPU。问题是如何在满足预算约束的同时最大限度地缩短平均响应时间。由于在额外 GPU 上运行时,培训作业获得的边际收益递减,因此为单个培训作业分配过多 GPU 会大幅增加用户支付的总成本。因此,最佳租用策略必须在训练成本和平均响应时间之间取得平衡。本文推导了训练作业流的最优租用策略,其中的作业具有不同的并行化水平(由加速函数指定)和不同的作业规模(固有工作量)。我们对作业的到达过程和作业大小分布几乎不做任何假设。Ouroptimal 策略规定了在每个时间点需要租用多少 GPU 以及如何分配这些 GPU。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
HRA: A Multi-Criteria Framework for Ranking Metaheuristic Optimization Algorithms Temporal Load Imbalance on Ondes3D Seismic Simulator for Different Multicore Architectures Can Graph Reordering Speed Up Graph Neural Network Training? An Experimental Study The Landscape of GPU-Centric Communication A Global Perspective on the Past, Present, and Future of Video Streaming over Starlink
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1