{"title":"QoS-Aware Scheduling of Heterogeneous Servers for Inference in Deep Neural Networks","authors":"Zhou Fang, Tong Yu, O. Mengshoel, Rajesh K. Gupta","doi":"10.1145/3132847.3133045","DOIUrl":null,"url":null,"abstract":"Deep neural networks (DNNs) are popular in diverse fields such as computer vision and natural language processing. DNN inference tasks are emerging as a service provided by cloud computing environments. However, cloud-hosted DNN inference faces new challenges in workload scheduling for the best Quality of Service (QoS), due to dependence on batch size, model complexity and resource allocation. This paper represents the QoS metric as a utility function of response delay and inference accuracy. We first propose a simple and effective heuristic approach that keeps low response delay and satisfies the requirement on processing throughput. Then we describe an advanced deep reinforcement learning (RL) approach that learns to schedule from experience. The RL scheduler is trained to maximize QoS, using a set of system statuses as the input to the RL policy model. Our approach performs scheduling actions only when there are free GPUs, thus reduces scheduling overhead over common RL schedulers that run at every continuous time step. We evaluate the schedulers on a simulation platform and demonstrate the advantages of RL over heuristics.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"2 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3132847.3133045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 30
Abstract
Deep neural networks (DNNs) are popular in diverse fields such as computer vision and natural language processing. DNN inference tasks are emerging as a service provided by cloud computing environments. However, cloud-hosted DNN inference faces new challenges in workload scheduling for the best Quality of Service (QoS), due to dependence on batch size, model complexity and resource allocation. This paper represents the QoS metric as a utility function of response delay and inference accuracy. We first propose a simple and effective heuristic approach that keeps low response delay and satisfies the requirement on processing throughput. Then we describe an advanced deep reinforcement learning (RL) approach that learns to schedule from experience. The RL scheduler is trained to maximize QoS, using a set of system statuses as the input to the RL policy model. Our approach performs scheduling actions only when there are free GPUs, thus reduces scheduling overhead over common RL schedulers that run at every continuous time step. We evaluate the schedulers on a simulation platform and demonstrate the advantages of RL over heuristics.