Serving DNN Inference With Fine-Grained Spatio-Temporal Sharing of GPU Servers

IF 5.5 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Transactions on Services Computing Pub Date : 2024-09-18 DOI:10.1109/TSC.2024.3463429

Yaqiong Peng;Weiguo Gao;Haocheng Peng

{"title":"Serving DNN Inference With Fine-Grained Spatio-Temporal Sharing of GPU Servers","authors":"Yaqiong Peng;Weiguo Gao;Haocheng Peng","doi":"10.1109/TSC.2024.3463429","DOIUrl":null,"url":null,"abstract":"<italic>Deep Neural Networks</i>\n (DNNs) are commonly deployed as online inference services. To meet interactive latency requirements of requests, DNN services require the use of \n<italic>Graphics Processing Unit</i>\n (GPU) to improve their responsiveness. The unique characteristics of inference workloads pose new challenges to manage GPU resources. First, the GPU scheduler needs to carefully manage requests to meet their latency targets. Second, a single inference task often underutilizes GPU resources. Third, the fluctuating patterns of inference workloads pose difficulties in determining the resources allocated to each DNN model. Therefore, it is critical for the GPU scheduler to maximize GPU utilization by collocating multiple DNN models without violating the latency \n<italic>Service-Level Objectives</i>\n (SLOs) of requests. However, we find that existing works are not adequate for achieving this goal among latency-sensitive inference tasks. Hence, we propose FineST, a scheduling framework for serving DNNs with fine-grained spatio-temporal sharing of GPU inference servers. To maximize GPU utilization, FineST allocates intra-GPU computing resources from both spatial and temporal dimensions across DNNs in a cost-effective way, while predicting interference overheads under diverse consolidated executions for controlling SLO violation rates. Compared to a state-of-the-art work, FineST improves the peak throughput of serving heterogeneous DNNsby up to 64.7% under SLO constraints.","PeriodicalId":13255,"journal":{"name":"IEEE Transactions on Services Computing","volume":"17 6","pages":"4310-4323"},"PeriodicalIF":5.5000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Services Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10684028/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Deep Neural Networks (DNNs) are commonly deployed as online inference services. To meet interactive latency requirements of requests, DNN services require the use of Graphics Processing Unit (GPU) to improve their responsiveness. The unique characteristics of inference workloads pose new challenges to manage GPU resources. First, the GPU scheduler needs to carefully manage requests to meet their latency targets. Second, a single inference task often underutilizes GPU resources. Third, the fluctuating patterns of inference workloads pose difficulties in determining the resources allocated to each DNN model. Therefore, it is critical for the GPU scheduler to maximize GPU utilization by collocating multiple DNN models without violating the latency Service-Level Objectives (SLOs) of requests. However, we find that existing works are not adequate for achieving this goal among latency-sensitive inference tasks. Hence, we propose FineST, a scheduling framework for serving DNNs with fine-grained spatio-temporal sharing of GPU inference servers. To maximize GPU utilization, FineST allocates intra-GPU computing resources from both spatial and temporal dimensions across DNNs in a cost-effective way, while predicting interference overheads under diverse consolidated executions for controlling SLO violation rates. Compared to a state-of-the-art work, FineST improves the peak throughput of serving heterogeneous DNNsby up to 64.7% under SLO constraints.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用细粒度时空共享 GPU 服务器为 DNN 推理提供服务

深度神经网络（dnn）通常被部署为在线推理服务。为了满足请求的交互延迟需求，DNN服务需要使用图形处理单元（GPU）来提高响应速度。推理工作负载的独特特性对GPU资源的管理提出了新的挑战。首先，GPU调度器需要仔细管理请求以满足其延迟目标。其次，单个推理任务往往未充分利用GPU资源。第三，推理工作负载的波动模式给确定分配给每个DNN模型的资源带来了困难。因此，对于GPU调度器来说，通过配置多个DNN模型而不违反请求的延迟服务水平目标（slo）来最大化GPU利用率是至关重要的。然而，我们发现现有的工作不足以在延迟敏感的推理任务中实现这一目标。因此，我们提出了FineST，这是一个调度框架，用于为dnn提供GPU推理服务器的细粒度时空共享。为了最大限度地提高GPU利用率，FineST以经济有效的方式从空间和时间维度分配GPU内部计算资源，同时预测不同合并执行下的干扰开销，以控制SLO违规率。与最先进的工作相比，FineST在SLO约束下将异构dnnsserver的峰值吞吐量提高了64.7%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Services Computing COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

11.50

自引率

6.20%

发文量

278

审稿时长

>12 weeks

期刊介绍： IEEE Transactions on Services Computing encompasses the computing and software aspects of the science and technology of services innovation research and development. It places emphasis on algorithmic, mathematical, statistical, and computational methods central to services computing. Topics covered include Service Oriented Architecture, Web Services, Business Process Integration, Solution Performance Management, and Services Operations and Management. The transactions address mathematical foundations, security, privacy, agreement, contract, discovery, negotiation, collaboration, and quality of service for web services. It also covers areas like composite web service creation, business and scientific applications, standards, utility models, business process modeling, integration, collaboration, and more in the realm of Services Computing.