Reducing the Cost of GPU Cold Starts in Serverless Deep Learning Inference Serving

2023 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops) Pub Date : 2023-03-13 DOI:10.1109/PerComWorkshops56833.2023.10150381

Justin San Juan, B. Wong

{"title":"Reducing the Cost of GPU Cold Starts in Serverless Deep Learning Inference Serving","authors":"Justin San Juan, B. Wong","doi":"10.1109/PerComWorkshops56833.2023.10150381","DOIUrl":null,"url":null,"abstract":"The rapid growth of Deep Learning (DL) has led to increasing demand for DL-as-a-Service. In this paradigm, DL inferences are served on-demand through a serverless cloud provider, which manages the scaling of hardware resources to satisfy dynamic workloads. This is enticing to businesses due to lower infrastructure management costs compared to dedicated on-site hosting. However, current serverless systems suffer from long cold starts where requests are queued until a server can be initialized with the DL model, which is especially problematic due to large DL model sizes. In addition, low-latency demands such as in real-time fraud detection and algorithmic trading cause long inferences in CPU-only systems to violate deadlines. To tackle this, current systems rely on over-provisioning expensive GPU resources to meet low-latency requirements, thus increasing the total cost of ownership for cloud service providers. In this work, we characterize the cold start problem in GPU-accelerated serverless systems. We then design and evaluate novel solutions based on two main techniques. Namely, we propose remote memory pooling and hierarchical sourcing with locality-aware autoscaling where we exploit underutilized memory and network resources to store and prioritize sourcing the DL model from existing host machines over remote host memory then cloud storage. We demonstrate through simulations that these techniques can perform up to 19.3× and 1.4× speedup in 99th percentile and median end-to-end latencies respectively compared to a baseline. Such speedups enable serverless systems to meet low-latency requirements despite dynamic workloads.","PeriodicalId":307197,"journal":{"name":"2023 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PerComWorkshops56833.2023.10150381","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The rapid growth of Deep Learning (DL) has led to increasing demand for DL-as-a-Service. In this paradigm, DL inferences are served on-demand through a serverless cloud provider, which manages the scaling of hardware resources to satisfy dynamic workloads. This is enticing to businesses due to lower infrastructure management costs compared to dedicated on-site hosting. However, current serverless systems suffer from long cold starts where requests are queued until a server can be initialized with the DL model, which is especially problematic due to large DL model sizes. In addition, low-latency demands such as in real-time fraud detection and algorithmic trading cause long inferences in CPU-only systems to violate deadlines. To tackle this, current systems rely on over-provisioning expensive GPU resources to meet low-latency requirements, thus increasing the total cost of ownership for cloud service providers. In this work, we characterize the cold start problem in GPU-accelerated serverless systems. We then design and evaluate novel solutions based on two main techniques. Namely, we propose remote memory pooling and hierarchical sourcing with locality-aware autoscaling where we exploit underutilized memory and network resources to store and prioritize sourcing the DL model from existing host machines over remote host memory then cloud storage. We demonstrate through simulations that these techniques can perform up to 19.3× and 1.4× speedup in 99th percentile and median end-to-end latencies respectively compared to a baseline. Such speedups enable serverless systems to meet low-latency requirements despite dynamic workloads.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

降低无服务器深度学习推理服务中GPU冷启动的成本

深度学习(DL)的快速发展导致对DL即服务的需求不断增加。在此范例中，深度学习推理通过无服务器云提供商按需提供，该提供商管理硬件资源的扩展以满足动态工作负载。这对企业来说很有吸引力，因为与专门的现场托管相比，基础设施管理成本更低。然而，当前的无服务器系统遭受长时间冷启动的困扰，其中请求排队，直到服务器可以用DL模型初始化，由于DL模型大小很大，这尤其有问题。此外，低延迟需求(例如实时欺诈检测和算法交易)会导致仅cpu系统中的长时间推断违反最后期限。为了解决这个问题，当前的系统依赖于过度配置昂贵的GPU资源来满足低延迟要求，从而增加了云服务提供商的总拥有成本。在这项工作中，我们描述了gpu加速无服务器系统中的冷启动问题。然后，我们设计和评估基于两种主要技术的新解决方案。也就是说，我们提出了远程内存池和具有位置感知自动扩展的分层溯源，其中我们利用未充分利用的内存和网络资源来存储DL模型，并优先考虑从现有主机获取DL模型，而不是远程主机内存，然后是云存储。我们通过模拟证明，与基线相比，这些技术在第99百分位和中位数端到端延迟上分别可以执行高达19.3倍和1.4倍的加速。这样的加速使无服务器系统能够在动态工作负载下满足低延迟需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2023 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops)

自引率

0.00%

发文量