导弹多租户 DNN 推理的细粒度硬件级 GPU 资源隔离

Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yang Li, Xiaowen Chu, Huaicheng Li
{"title":"导弹多租户 DNN 推理的细粒度硬件级 GPU 资源隔离","authors":"Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yang Li, Xiaowen Chu, Huaicheng Li","doi":"arxiv-2407.13996","DOIUrl":null,"url":null,"abstract":"Colocating high-priority, latency-sensitive (LS) and low-priority,\nbest-effort (BE) DNN inference services reduces the total cost of ownership\n(TCO) of GPU clusters. Limited by bottlenecks such as VRAM channel conflicts\nand PCIe bus contentions, existing GPU sharing solutions are unable to avoid\nresource conflicts among concurrently executing tasks, failing to achieve both\nlow latency for LS tasks and high throughput for BE tasks. To bridge this gap,\nthis paper presents Missile, a general GPU sharing solution for multi-tenant\nDNN inference on NVIDIA GPUs. Missile approximates fine-grained GPU hardware\nresource isolation between multiple LS and BE DNN tasks at software level.\nThrough comprehensive reverse engineering, Missile first reveals a general VRAM\nchannel hash mapping architecture of NVIDIA GPUs and eliminates VRAM channel\nconflicts using software-level cache coloring. It also isolates the PCIe bus\nand fairly allocates PCIe bandwidth using completely fair scheduler. We\nevaluate 12 mainstream DNNs with synthetic and real-world workloads on four\nGPUs. The results show that compared to the state-of-the-art GPU sharing\nsolutions, Missile reduces tail latency for LS services by up to ~50%, achieves\nup to 6.1x BE job throughput, and allocates PCIe bus bandwidth to tenants\non-demand for optimal performance.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"2013 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference\",\"authors\":\"Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yang Li, Xiaowen Chu, Huaicheng Li\",\"doi\":\"arxiv-2407.13996\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Colocating high-priority, latency-sensitive (LS) and low-priority,\\nbest-effort (BE) DNN inference services reduces the total cost of ownership\\n(TCO) of GPU clusters. Limited by bottlenecks such as VRAM channel conflicts\\nand PCIe bus contentions, existing GPU sharing solutions are unable to avoid\\nresource conflicts among concurrently executing tasks, failing to achieve both\\nlow latency for LS tasks and high throughput for BE tasks. To bridge this gap,\\nthis paper presents Missile, a general GPU sharing solution for multi-tenant\\nDNN inference on NVIDIA GPUs. Missile approximates fine-grained GPU hardware\\nresource isolation between multiple LS and BE DNN tasks at software level.\\nThrough comprehensive reverse engineering, Missile first reveals a general VRAM\\nchannel hash mapping architecture of NVIDIA GPUs and eliminates VRAM channel\\nconflicts using software-level cache coloring. It also isolates the PCIe bus\\nand fairly allocates PCIe bandwidth using completely fair scheduler. We\\nevaluate 12 mainstream DNNs with synthetic and real-world workloads on four\\nGPUs. The results show that compared to the state-of-the-art GPU sharing\\nsolutions, Missile reduces tail latency for LS services by up to ~50%, achieves\\nup to 6.1x BE job throughput, and allocates PCIe bus bandwidth to tenants\\non-demand for optimal performance.\",\"PeriodicalId\":501291,\"journal\":{\"name\":\"arXiv - CS - Performance\",\"volume\":\"2013 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Performance\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.13996\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.13996","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

将高优先级、延迟敏感型(LS)和低优先级、尽力而为型(BE)DNN推理服务进行共享,可以降低 GPU 集群的总体拥有成本(TCO)。受限于VRAM通道冲突和PCIe总线争用等瓶颈,现有的GPU共享解决方案无法避免并发执行任务之间的资源冲突,无法实现LS任务的低延迟和BE任务的高吞吐量。为了弥补这一缺陷,本文提出了一种通用的 GPU 共享解决方案--Missile,用于英伟达 GPU 上的多租户 DNN 推理。通过全面的逆向工程,Missile首先揭示了英伟达™(NVIDIA®)GPU的通用VRAM通道哈希映射架构,并利用软件级缓存着色消除了VRAM通道冲突。它还隔离了 PCIe 总线,并使用完全公平的调度程序公平分配 PCIe 带宽。我们在四台 GPU 上使用合成和实际工作负载评估了 12 种主流 DNN。结果表明,与最先进的 GPU 共享解决方案相比,Missile 将 LS 服务的尾部延迟降低了约 50%,实现了高达 6.1 倍的 BE 作业吞吐量,并按需分配了租户的 PCIe 总线带宽,从而实现了最佳性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference
Colocating high-priority, latency-sensitive (LS) and low-priority, best-effort (BE) DNN inference services reduces the total cost of ownership (TCO) of GPU clusters. Limited by bottlenecks such as VRAM channel conflicts and PCIe bus contentions, existing GPU sharing solutions are unable to avoid resource conflicts among concurrently executing tasks, failing to achieve both low latency for LS tasks and high throughput for BE tasks. To bridge this gap, this paper presents Missile, a general GPU sharing solution for multi-tenant DNN inference on NVIDIA GPUs. Missile approximates fine-grained GPU hardware resource isolation between multiple LS and BE DNN tasks at software level. Through comprehensive reverse engineering, Missile first reveals a general VRAM channel hash mapping architecture of NVIDIA GPUs and eliminates VRAM channel conflicts using software-level cache coloring. It also isolates the PCIe bus and fairly allocates PCIe bandwidth using completely fair scheduler. We evaluate 12 mainstream DNNs with synthetic and real-world workloads on four GPUs. The results show that compared to the state-of-the-art GPU sharing solutions, Missile reduces tail latency for LS services by up to ~50%, achieves up to 6.1x BE job throughput, and allocates PCIe bus bandwidth to tenants on-demand for optimal performance.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
HRA: A Multi-Criteria Framework for Ranking Metaheuristic Optimization Algorithms Temporal Load Imbalance on Ondes3D Seismic Simulator for Different Multicore Architectures Can Graph Reordering Speed Up Graph Neural Network Training? An Experimental Study The Landscape of GPU-Centric Communication A Global Perspective on the Past, Present, and Future of Video Streaming over Starlink
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1