{"title":"Serving DNN Inference With Fine-Grained Spatio-Temporal Sharing of GPU Servers","authors":"Yaqiong Peng;Weiguo Gao;Haocheng Peng","doi":"10.1109/TSC.2024.3463429","DOIUrl":null,"url":null,"abstract":"<italic>Deep Neural Networks</i>\n (DNNs) are commonly deployed as online inference services. To meet interactive latency requirements of requests, DNN services require the use of \n<italic>Graphics Processing Unit</i>\n (GPU) to improve their responsiveness. The unique characteristics of inference workloads pose new challenges to manage GPU resources. First, the GPU scheduler needs to carefully manage requests to meet their latency targets. Second, a single inference task often underutilizes GPU resources. Third, the fluctuating patterns of inference workloads pose difficulties in determining the resources allocated to each DNN model. Therefore, it is critical for the GPU scheduler to maximize GPU utilization by collocating multiple DNN models without violating the latency \n<italic>Service-Level Objectives</i>\n (SLOs) of requests. However, we find that existing works are not adequate for achieving this goal among latency-sensitive inference tasks. Hence, we propose FineST, a scheduling framework for serving DNNs with fine-grained spatio-temporal sharing of GPU inference servers. To maximize GPU utilization, FineST allocates intra-GPU computing resources from both spatial and temporal dimensions across DNNs in a cost-effective way, while predicting interference overheads under diverse consolidated executions for controlling SLO violation rates. Compared to a state-of-the-art work, FineST improves the peak throughput of serving heterogeneous DNNsby up to 64.7% under SLO constraints.","PeriodicalId":13255,"journal":{"name":"IEEE Transactions on Services Computing","volume":"17 6","pages":"4310-4323"},"PeriodicalIF":5.5000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Services Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10684028/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Deep Neural Networks
(DNNs) are commonly deployed as online inference services. To meet interactive latency requirements of requests, DNN services require the use of
Graphics Processing Unit
(GPU) to improve their responsiveness. The unique characteristics of inference workloads pose new challenges to manage GPU resources. First, the GPU scheduler needs to carefully manage requests to meet their latency targets. Second, a single inference task often underutilizes GPU resources. Third, the fluctuating patterns of inference workloads pose difficulties in determining the resources allocated to each DNN model. Therefore, it is critical for the GPU scheduler to maximize GPU utilization by collocating multiple DNN models without violating the latency
Service-Level Objectives
(SLOs) of requests. However, we find that existing works are not adequate for achieving this goal among latency-sensitive inference tasks. Hence, we propose FineST, a scheduling framework for serving DNNs with fine-grained spatio-temporal sharing of GPU inference servers. To maximize GPU utilization, FineST allocates intra-GPU computing resources from both spatial and temporal dimensions across DNNs in a cost-effective way, while predicting interference overheads under diverse consolidated executions for controlling SLO violation rates. Compared to a state-of-the-art work, FineST improves the peak throughput of serving heterogeneous DNNsby up to 64.7% under SLO constraints.
期刊介绍:
IEEE Transactions on Services Computing encompasses the computing and software aspects of the science and technology of services innovation research and development. It places emphasis on algorithmic, mathematical, statistical, and computational methods central to services computing. Topics covered include Service Oriented Architecture, Web Services, Business Process Integration, Solution Performance Management, and Services Operations and Management. The transactions address mathematical foundations, security, privacy, agreement, contract, discovery, negotiation, collaboration, and quality of service for web services. It also covers areas like composite web service creation, business and scientific applications, standards, utility models, business process modeling, integration, collaboration, and more in the realm of Services Computing.