Jiechen Zhao, Ran Shu, Katie Lim, Zewen Fan, Thomas Anderson, Mingyu Gao, Natalie Enright Jerger
{"title":"Accelerator-as-a-Service in Public Clouds: An Intra-Host Traffic Management View for Performance Isolation in the Wild","authors":"Jiechen Zhao, Ran Shu, Katie Lim, Zewen Fan, Thomas Anderson, Mingyu Gao, Natalie Enright Jerger","doi":"arxiv-2407.10098","DOIUrl":null,"url":null,"abstract":"I/O devices in public clouds have integrated increasing numbers of hardware\naccelerators, e.g., AWS Nitro, Azure FPGA and Nvidia BlueField. However, such\nspecialized compute (1) is not explicitly accessible to cloud users with\nperformance guarantee, (2) cannot be leveraged simultaneously by both providers\nand users, unlike general-purpose compute (e.g., CPUs). Through ten\nobservations, we present that the fundamental difficulty of democratizing\naccelerators is insufficient performance isolation support. The key obstacles\nto enforcing accelerator isolation are (1) too many unknown traffic patterns in\npublic clouds and (2) too many possible contention sources in the datapath. In\nthis work, instead of scheduling such complex traffic on-the-fly and augmenting\nisolation support on each system component, we propose to model traffic as\nnetwork flows and proactively re-shape the traffic to avoid unpredictable\ncontention. We discuss the implications of our findings on the design of future\nI/O management stacks and device interfaces.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"74 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.10098","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
I/O devices in public clouds have integrated increasing numbers of hardware
accelerators, e.g., AWS Nitro, Azure FPGA and Nvidia BlueField. However, such
specialized compute (1) is not explicitly accessible to cloud users with
performance guarantee, (2) cannot be leveraged simultaneously by both providers
and users, unlike general-purpose compute (e.g., CPUs). Through ten
observations, we present that the fundamental difficulty of democratizing
accelerators is insufficient performance isolation support. The key obstacles
to enforcing accelerator isolation are (1) too many unknown traffic patterns in
public clouds and (2) too many possible contention sources in the datapath. In
this work, instead of scheduling such complex traffic on-the-fly and augmenting
isolation support on each system component, we propose to model traffic as
network flows and proactively re-shape the traffic to avoid unpredictable
contention. We discuss the implications of our findings on the design of future
I/O management stacks and device interfaces.