导弹多租户 DNN 推理的细粒度硬件级 GPU 资源隔离

arXiv - CS - Performance Pub Date : 2024-07-19 DOI:arxiv-2407.13996

Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yang Li, Xiaowen Chu, Huaicheng Li

{"title":"导弹多租户 DNN 推理的细粒度硬件级 GPU 资源隔离","authors":"Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yang Li, Xiaowen Chu, Huaicheng Li","doi":"arxiv-2407.13996","DOIUrl":null,"url":null,"abstract":"Colocating high-priority, latency-sensitive (LS) and low-priority,\nbest-effort (BE) DNN inference services reduces the total cost of ownership\n(TCO) of GPU clusters. Limited by bottlenecks such as VRAM channel conflicts\nand PCIe bus contentions, existing GPU sharing solutions are unable to avoid\nresource conflicts among concurrently executing tasks, failing to achieve both\nlow latency for LS tasks and high throughput for BE tasks. To bridge this gap,\nthis paper presents Missile, a general GPU sharing solution for multi-tenant\nDNN inference on NVIDIA GPUs. Missile approximates fine-grained GPU hardware\nresource isolation between multiple LS and BE DNN tasks at software level.\nThrough comprehensive reverse engineering, Missile first reveals a general VRAM\nchannel hash mapping architecture of NVIDIA GPUs and eliminates VRAM channel\nconflicts using software-level cache coloring. It also isolates the PCIe bus\nand fairly allocates PCIe bandwidth using completely fair scheduler. We\nevaluate 12 mainstream DNNs with synthetic and real-world workloads on four\nGPUs. The results show that compared to the state-of-the-art GPU sharing\nsolutions, Missile reduces tail latency for LS services by up to ~50%, achieves\nup to 6.1x BE job throughput, and allocates PCIe bus bandwidth to tenants\non-demand for optimal performance.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"2013 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference\",\"authors\":\"Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yang Li, Xiaowen Chu, Huaicheng Li\",\"doi\":\"arxiv-2407.13996\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Colocating high-priority, latency-sensitive (LS) and low-priority,\\nbest-effort (BE) DNN inference services reduces the total cost of ownership\\n(TCO) of GPU clusters. Limited by bottlenecks such as VRAM channel conflicts\\nand PCIe bus contentions, existing GPU sharing solutions are unable to avoid\\nresource conflicts among concurrently executing tasks, failing to achieve both\\nlow latency for LS tasks and high throughput for BE tasks. To bridge this gap,\\nthis paper presents Missile, a general GPU sharing solution for multi-tenant\\nDNN inference on NVIDIA GPUs. Missile approximates fine-grained GPU hardware\\nresource isolation between multiple LS and BE DNN tasks at software level.\\nThrough comprehensive reverse engineering, Missile first reveals a general VRAM\\nchannel hash mapping architecture of NVIDIA GPUs and eliminates VRAM channel\\nconflicts using software-level cache coloring. It also isolates the PCIe bus\\nand fairly allocates PCIe bandwidth using completely fair scheduler. We\\nevaluate 12 mainstream DNNs with synthetic and real-world workloads on four\\nGPUs. The results show that compared to the state-of-the-art GPU sharing\\nsolutions, Missile reduces tail latency for LS services by up to ~50%, achieves\\nup to 6.1x BE job throughput, and allocates PCIe bus bandwidth to tenants\\non-demand for optimal performance.\",\"PeriodicalId\":501291,\"journal\":{\"name\":\"arXiv - CS - Performance\",\"volume\":\"2013 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Performance\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.13996\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.13996","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

将高优先级、延迟敏感型（LS）和低优先级、尽力而为型（BE）DNN推理服务进行共享，可以降低 GPU 集群的总体拥有成本（TCO）。受限于VRAM通道冲突和PCIe总线争用等瓶颈，现有的GPU共享解决方案无法避免并发执行任务之间的资源冲突，无法实现LS任务的低延迟和BE任务的高吞吐量。为了弥补这一缺陷，本文提出了一种通用的 GPU 共享解决方案--Missile，用于英伟达 GPU 上的多租户 DNN 推理。通过全面的逆向工程，Missile首先揭示了英伟达™（NVIDIA®）GPU的通用VRAM通道哈希映射架构，并利用软件级缓存着色消除了VRAM通道冲突。它还隔离了 PCIe 总线，并使用完全公平的调度程序公平分配 PCIe 带宽。我们在四台 GPU 上使用合成和实际工作负载评估了 12 种主流 DNN。结果表明，与最先进的 GPU 共享解决方案相比，Missile 将 LS 服务的尾部延迟降低了约 50%，实现了高达 6.1 倍的 BE 作业吞吐量，并按需分配了租户的 PCIe 总线带宽，从而实现了最佳性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference

Colocating high-priority, latency-sensitive (LS) and low-priority, best-effort (BE) DNN inference services reduces the total cost of ownership (TCO) of GPU clusters. Limited by bottlenecks such as VRAM channel conflicts and PCIe bus contentions, existing GPU sharing solutions are unable to avoid resource conflicts among concurrently executing tasks, failing to achieve both low latency for LS tasks and high throughput for BE tasks. To bridge this gap, this paper presents Missile, a general GPU sharing solution for multi-tenant DNN inference on NVIDIA GPUs. Missile approximates fine-grained GPU hardware resource isolation between multiple LS and BE DNN tasks at software level. Through comprehensive reverse engineering, Missile first reveals a general VRAM channel hash mapping architecture of NVIDIA GPUs and eliminates VRAM channel conflicts using software-level cache coloring. It also isolates the PCIe bus and fairly allocates PCIe bandwidth using completely fair scheduler. We evaluate 12 mainstream DNNs with synthetic and real-world workloads on four GPUs. The results show that compared to the state-of-the-art GPU sharing solutions, Missile reduces tail latency for LS services by up to ~50%, achieves up to 6.1x BE job throughput, and allocates PCIe bus bandwidth to tenants on-demand for optimal performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Performance

自引率

0.00%

发文量