{"title":"一种DNN推理延迟感知GPU电源管理方案","authors":"Junyeol Yu, Jongseok Kim, Euiseong Seo","doi":"10.1109/ECICE52819.2021.9645654","DOIUrl":null,"url":null,"abstract":"Graphics Processing Units (GPUs) are widely used for deep learning training as well as inference due to their high processing speed and programmability. Modern GPUs dynamically adjust the clock frequency according to their power management scheme. However, under the default scheme, the clock frequency of a GPU is only determined by utilization rate while being blind to target latency SLO, leading to unnecessary high clock frequency which causes excessive power consumption. In this paper, we propose a method to increase the energy efficiency of a GPU while satisfying latency SLO through performance scaling. It dynamically monitors the queue length of the inference engine to determine the optimal clock that can satisfy latency SLO. We implemented an efficient inference service using GPU DVFS on the existing inference engine. According to the result of experiments on inference over image classification models using three types of GPUs, all the 99th percentile latency in our method satisfied latency SLO while exhibiting better power efficiency. In particular, when processing the VGG19 model on Titan RTX, the energy consumption of the GPU is reduced by up to 49.5% compared to the default clock management when processing the same request rates.","PeriodicalId":176225,"journal":{"name":"2021 IEEE 3rd Eurasia Conference on IOT, Communication and Engineering (ECICE)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A DNN Inference Latency-aware GPU Power Management Scheme\",\"authors\":\"Junyeol Yu, Jongseok Kim, Euiseong Seo\",\"doi\":\"10.1109/ECICE52819.2021.9645654\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graphics Processing Units (GPUs) are widely used for deep learning training as well as inference due to their high processing speed and programmability. Modern GPUs dynamically adjust the clock frequency according to their power management scheme. However, under the default scheme, the clock frequency of a GPU is only determined by utilization rate while being blind to target latency SLO, leading to unnecessary high clock frequency which causes excessive power consumption. In this paper, we propose a method to increase the energy efficiency of a GPU while satisfying latency SLO through performance scaling. It dynamically monitors the queue length of the inference engine to determine the optimal clock that can satisfy latency SLO. We implemented an efficient inference service using GPU DVFS on the existing inference engine. According to the result of experiments on inference over image classification models using three types of GPUs, all the 99th percentile latency in our method satisfied latency SLO while exhibiting better power efficiency. In particular, when processing the VGG19 model on Titan RTX, the energy consumption of the GPU is reduced by up to 49.5% compared to the default clock management when processing the same request rates.\",\"PeriodicalId\":176225,\"journal\":{\"name\":\"2021 IEEE 3rd Eurasia Conference on IOT, Communication and Engineering (ECICE)\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 3rd Eurasia Conference on IOT, Communication and Engineering (ECICE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ECICE52819.2021.9645654\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 3rd Eurasia Conference on IOT, Communication and Engineering (ECICE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ECICE52819.2021.9645654","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A DNN Inference Latency-aware GPU Power Management Scheme
Graphics Processing Units (GPUs) are widely used for deep learning training as well as inference due to their high processing speed and programmability. Modern GPUs dynamically adjust the clock frequency according to their power management scheme. However, under the default scheme, the clock frequency of a GPU is only determined by utilization rate while being blind to target latency SLO, leading to unnecessary high clock frequency which causes excessive power consumption. In this paper, we propose a method to increase the energy efficiency of a GPU while satisfying latency SLO through performance scaling. It dynamically monitors the queue length of the inference engine to determine the optimal clock that can satisfy latency SLO. We implemented an efficient inference service using GPU DVFS on the existing inference engine. According to the result of experiments on inference over image classification models using three types of GPUs, all the 99th percentile latency in our method satisfied latency SLO while exhibiting better power efficiency. In particular, when processing the VGG19 model on Titan RTX, the energy consumption of the GPU is reduced by up to 49.5% compared to the default clock management when processing the same request rates.