Reducing the Cost of GPU Cold Starts in Serverless Deep Learning Inference Serving

Justin San Juan, B. Wong
{"title":"Reducing the Cost of GPU Cold Starts in Serverless Deep Learning Inference Serving","authors":"Justin San Juan, B. Wong","doi":"10.1109/PerComWorkshops56833.2023.10150381","DOIUrl":null,"url":null,"abstract":"The rapid growth of Deep Learning (DL) has led to increasing demand for DL-as-a-Service. In this paradigm, DL inferences are served on-demand through a serverless cloud provider, which manages the scaling of hardware resources to satisfy dynamic workloads. This is enticing to businesses due to lower infrastructure management costs compared to dedicated on-site hosting. However, current serverless systems suffer from long cold starts where requests are queued until a server can be initialized with the DL model, which is especially problematic due to large DL model sizes. In addition, low-latency demands such as in real-time fraud detection and algorithmic trading cause long inferences in CPU-only systems to violate deadlines. To tackle this, current systems rely on over-provisioning expensive GPU resources to meet low-latency requirements, thus increasing the total cost of ownership for cloud service providers. In this work, we characterize the cold start problem in GPU-accelerated serverless systems. We then design and evaluate novel solutions based on two main techniques. Namely, we propose remote memory pooling and hierarchical sourcing with locality-aware autoscaling where we exploit underutilized memory and network resources to store and prioritize sourcing the DL model from existing host machines over remote host memory then cloud storage. We demonstrate through simulations that these techniques can perform up to 19.3× and 1.4× speedup in 99th percentile and median end-to-end latencies respectively compared to a baseline. Such speedups enable serverless systems to meet low-latency requirements despite dynamic workloads.","PeriodicalId":307197,"journal":{"name":"2023 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PerComWorkshops56833.2023.10150381","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The rapid growth of Deep Learning (DL) has led to increasing demand for DL-as-a-Service. In this paradigm, DL inferences are served on-demand through a serverless cloud provider, which manages the scaling of hardware resources to satisfy dynamic workloads. This is enticing to businesses due to lower infrastructure management costs compared to dedicated on-site hosting. However, current serverless systems suffer from long cold starts where requests are queued until a server can be initialized with the DL model, which is especially problematic due to large DL model sizes. In addition, low-latency demands such as in real-time fraud detection and algorithmic trading cause long inferences in CPU-only systems to violate deadlines. To tackle this, current systems rely on over-provisioning expensive GPU resources to meet low-latency requirements, thus increasing the total cost of ownership for cloud service providers. In this work, we characterize the cold start problem in GPU-accelerated serverless systems. We then design and evaluate novel solutions based on two main techniques. Namely, we propose remote memory pooling and hierarchical sourcing with locality-aware autoscaling where we exploit underutilized memory and network resources to store and prioritize sourcing the DL model from existing host machines over remote host memory then cloud storage. We demonstrate through simulations that these techniques can perform up to 19.3× and 1.4× speedup in 99th percentile and median end-to-end latencies respectively compared to a baseline. Such speedups enable serverless systems to meet low-latency requirements despite dynamic workloads.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
降低无服务器深度学习推理服务中GPU冷启动的成本
深度学习(DL)的快速发展导致对DL即服务的需求不断增加。在此范例中,深度学习推理通过无服务器云提供商按需提供,该提供商管理硬件资源的扩展以满足动态工作负载。这对企业来说很有吸引力,因为与专门的现场托管相比,基础设施管理成本更低。然而,当前的无服务器系统遭受长时间冷启动的困扰,其中请求排队,直到服务器可以用DL模型初始化,由于DL模型大小很大,这尤其有问题。此外,低延迟需求(例如实时欺诈检测和算法交易)会导致仅cpu系统中的长时间推断违反最后期限。为了解决这个问题,当前的系统依赖于过度配置昂贵的GPU资源来满足低延迟要求,从而增加了云服务提供商的总拥有成本。在这项工作中,我们描述了gpu加速无服务器系统中的冷启动问题。然后,我们设计和评估基于两种主要技术的新解决方案。也就是说,我们提出了远程内存池和具有位置感知自动扩展的分层溯源,其中我们利用未充分利用的内存和网络资源来存储DL模型,并优先考虑从现有主机获取DL模型,而不是远程主机内存,然后是云存储。我们通过模拟证明,与基线相比,这些技术在第99百分位和中位数端到端延迟上分别可以执行高达19.3倍和1.4倍的加速。这样的加速使无服务器系统能够在动态工作负载下满足低延迟需求。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
An Effective and Efficient Self-Attention Based Model for Next POI Recommendation Drone Formation for Efficient Swarm Energy Consumption Reducing the Cost of GPU Cold Starts in Serverless Deep Learning Inference Serving Channel State Information for Human Activity Recognition with Low Sampling Rates On Training Strategies for LSTMs in Sensor-Based Human Activity Recognition
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1