Serving DNN Inference With Fine-Grained Spatio-Temporal Sharing of GPU Servers

IF 5.5 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Transactions on Services Computing Pub Date : 2024-09-18 DOI:10.1109/TSC.2024.3463429
Yaqiong Peng;Weiguo Gao;Haocheng Peng
{"title":"Serving DNN Inference With Fine-Grained Spatio-Temporal Sharing of GPU Servers","authors":"Yaqiong Peng;Weiguo Gao;Haocheng Peng","doi":"10.1109/TSC.2024.3463429","DOIUrl":null,"url":null,"abstract":"<italic>Deep Neural Networks</i>\n (DNNs) are commonly deployed as online inference services. To meet interactive latency requirements of requests, DNN services require the use of \n<italic>Graphics Processing Unit</i>\n (GPU) to improve their responsiveness. The unique characteristics of inference workloads pose new challenges to manage GPU resources. First, the GPU scheduler needs to carefully manage requests to meet their latency targets. Second, a single inference task often underutilizes GPU resources. Third, the fluctuating patterns of inference workloads pose difficulties in determining the resources allocated to each DNN model. Therefore, it is critical for the GPU scheduler to maximize GPU utilization by collocating multiple DNN models without violating the latency \n<italic>Service-Level Objectives</i>\n (SLOs) of requests. However, we find that existing works are not adequate for achieving this goal among latency-sensitive inference tasks. Hence, we propose FineST, a scheduling framework for serving DNNs with fine-grained spatio-temporal sharing of GPU inference servers. To maximize GPU utilization, FineST allocates intra-GPU computing resources from both spatial and temporal dimensions across DNNs in a cost-effective way, while predicting interference overheads under diverse consolidated executions for controlling SLO violation rates. Compared to a state-of-the-art work, FineST improves the peak throughput of serving heterogeneous DNNsby up to 64.7% under SLO constraints.","PeriodicalId":13255,"journal":{"name":"IEEE Transactions on Services Computing","volume":"17 6","pages":"4310-4323"},"PeriodicalIF":5.5000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Services Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10684028/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Deep Neural Networks (DNNs) are commonly deployed as online inference services. To meet interactive latency requirements of requests, DNN services require the use of Graphics Processing Unit (GPU) to improve their responsiveness. The unique characteristics of inference workloads pose new challenges to manage GPU resources. First, the GPU scheduler needs to carefully manage requests to meet their latency targets. Second, a single inference task often underutilizes GPU resources. Third, the fluctuating patterns of inference workloads pose difficulties in determining the resources allocated to each DNN model. Therefore, it is critical for the GPU scheduler to maximize GPU utilization by collocating multiple DNN models without violating the latency Service-Level Objectives (SLOs) of requests. However, we find that existing works are not adequate for achieving this goal among latency-sensitive inference tasks. Hence, we propose FineST, a scheduling framework for serving DNNs with fine-grained spatio-temporal sharing of GPU inference servers. To maximize GPU utilization, FineST allocates intra-GPU computing resources from both spatial and temporal dimensions across DNNs in a cost-effective way, while predicting interference overheads under diverse consolidated executions for controlling SLO violation rates. Compared to a state-of-the-art work, FineST improves the peak throughput of serving heterogeneous DNNsby up to 64.7% under SLO constraints.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用细粒度时空共享 GPU 服务器为 DNN 推理提供服务
深度神经网络(dnn)通常被部署为在线推理服务。为了满足请求的交互延迟需求,DNN服务需要使用图形处理单元(GPU)来提高响应速度。推理工作负载的独特特性对GPU资源的管理提出了新的挑战。首先,GPU调度器需要仔细管理请求以满足其延迟目标。其次,单个推理任务往往未充分利用GPU资源。第三,推理工作负载的波动模式给确定分配给每个DNN模型的资源带来了困难。因此,对于GPU调度器来说,通过配置多个DNN模型而不违反请求的延迟服务水平目标(slo)来最大化GPU利用率是至关重要的。然而,我们发现现有的工作不足以在延迟敏感的推理任务中实现这一目标。因此,我们提出了FineST,这是一个调度框架,用于为dnn提供GPU推理服务器的细粒度时空共享。为了最大限度地提高GPU利用率,FineST以经济有效的方式从空间和时间维度分配GPU内部计算资源,同时预测不同合并执行下的干扰开销,以控制SLO违规率。与最先进的工作相比,FineST在SLO约束下将异构dnnsserver的峰值吞吐量提高了64.7%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Transactions on Services Computing
IEEE Transactions on Services Computing COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, SOFTWARE ENGINEERING
CiteScore
11.50
自引率
6.20%
发文量
278
审稿时长
>12 weeks
期刊介绍: IEEE Transactions on Services Computing encompasses the computing and software aspects of the science and technology of services innovation research and development. It places emphasis on algorithmic, mathematical, statistical, and computational methods central to services computing. Topics covered include Service Oriented Architecture, Web Services, Business Process Integration, Solution Performance Management, and Services Operations and Management. The transactions address mathematical foundations, security, privacy, agreement, contract, discovery, negotiation, collaboration, and quality of service for web services. It also covers areas like composite web service creation, business and scientific applications, standards, utility models, business process modeling, integration, collaboration, and more in the realm of Services Computing.
期刊最新文献
Intelligent Transaction Generation Control for Permissioned Blockchain-based Services Large-Scale Service Mesh Orchestration with Probabilistic Routing in Cloud Data Centers Federated Contrastive Learning for Cross-Domain Recommendation LogNotion: Highlighting Massive Logs to Assist Human Reading and Decision Making A Hybrid Optimization Framework for Age of Information Minimization in UAV-assisted MCS
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1