Ziliang Wang;Shiyi Zhu;Jianguo Li;Wei Jiang;K. K. Ramakrishnan;Meng Yan;Xiaohong Zhang;Alex X. Liu
{"title":"DeepScaling:为大规模生产云系统自动扩展具有稳定 CPU 利用率的微服务","authors":"Ziliang Wang;Shiyi Zhu;Jianguo Li;Wei Jiang;K. K. Ramakrishnan;Meng Yan;Xiaohong Zhang;Alex X. Liu","doi":"10.1109/TNET.2024.3400953","DOIUrl":null,"url":null,"abstract":"Cloud service providers often provision excessive resources to meet the desired Service Level Objectives (SLOs), by setting lower CPU utilization targets. This can result in a waste of resources and a noticeable increase in power consumption in large-scale cloud deployments. To address this issue, this paper presents DeepScaling, an innovative solution for minimizing resource cost while ensuring SLO requirements are met in a dynamic, large-scale production microservice-based system. We propose DeepScaling, which introduces three innovative components to adaptively refine the target CPU utilization of servers in the data center, and we maintain it at a stable value to meet SLO constraints while using minimum amount of system resources. First, DeepScaling forecasts workloads for each service using a Spatio-temporal Graph Neural Network. Secondly, it estimates CPU utilization with a Deep Neural Network, considering factors such as periodic tasks and traffic. Finally, it uses a modified Deep Q-Network (DQN) to generate an autoscaling policy that controls service resources to maximize service stability while meeting SLOs. Evaluation of DeepScaling in Ant Group’s large-scale cloud environment shows that it outperforms state-of-the-art autoscaling approaches in terms of maintaining stable performance and resource savings. The deployment of DeepScaling in the real-world environment of 1900+ microservices saves the provisioning of over 100,000 CPU cores per day, on average.","PeriodicalId":13443,"journal":{"name":"IEEE/ACM Transactions on Networking","volume":"32 5","pages":"3961-3976"},"PeriodicalIF":3.6000,"publicationDate":"2024-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DeepScaling: Autoscaling Microservices With Stable CPU Utilization for Large Scale Production Cloud Systems\",\"authors\":\"Ziliang Wang;Shiyi Zhu;Jianguo Li;Wei Jiang;K. K. Ramakrishnan;Meng Yan;Xiaohong Zhang;Alex X. Liu\",\"doi\":\"10.1109/TNET.2024.3400953\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cloud service providers often provision excessive resources to meet the desired Service Level Objectives (SLOs), by setting lower CPU utilization targets. This can result in a waste of resources and a noticeable increase in power consumption in large-scale cloud deployments. To address this issue, this paper presents DeepScaling, an innovative solution for minimizing resource cost while ensuring SLO requirements are met in a dynamic, large-scale production microservice-based system. We propose DeepScaling, which introduces three innovative components to adaptively refine the target CPU utilization of servers in the data center, and we maintain it at a stable value to meet SLO constraints while using minimum amount of system resources. First, DeepScaling forecasts workloads for each service using a Spatio-temporal Graph Neural Network. Secondly, it estimates CPU utilization with a Deep Neural Network, considering factors such as periodic tasks and traffic. Finally, it uses a modified Deep Q-Network (DQN) to generate an autoscaling policy that controls service resources to maximize service stability while meeting SLOs. Evaluation of DeepScaling in Ant Group’s large-scale cloud environment shows that it outperforms state-of-the-art autoscaling approaches in terms of maintaining stable performance and resource savings. The deployment of DeepScaling in the real-world environment of 1900+ microservices saves the provisioning of over 100,000 CPU cores per day, on average.\",\"PeriodicalId\":13443,\"journal\":{\"name\":\"IEEE/ACM Transactions on Networking\",\"volume\":\"32 5\",\"pages\":\"3961-3976\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2024-03-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE/ACM Transactions on Networking\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10542703/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Networking","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10542703/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
摘要
云服务提供商通常会通过设定较低的 CPU 利用率目标来提供过多的资源,以满足所需的服务级别目标 (SLO)。这会造成资源浪费,并明显增加大规模云部署的功耗。为解决这一问题,本文提出了 DeepScaling,这是一种创新的解决方案,可最大限度地降低资源成本,同时确保在基于微服务的动态、大规模生产系统中满足 SLO 要求。我们提出的 DeepScaling 引入了三个创新组件,用于自适应地改进数据中心服务器的目标 CPU 利用率,并将其保持在一个稳定值,以满足 SLO 约束,同时使用最少的系统资源。首先,DeepScaling 利用时空图神经网络预测每个服务的工作负载。其次,它使用深度神经网络估算 CPU 利用率,同时考虑周期性任务和流量等因素。最后,它使用修改后的深度 Q 网络(DQN)生成自动缩放策略,控制服务资源,在满足 SLO 的同时最大限度地提高服务稳定性。在蚂蚁金服集团的大规模云环境中对DeepScaling的评估表明,它在保持性能稳定和节省资源方面优于最先进的自动缩放方法。在包含 1900 多个微服务的实际环境中部署 DeepScaling,平均每天可节省超过 10 万个 CPU 内核的调配。
DeepScaling: Autoscaling Microservices With Stable CPU Utilization for Large Scale Production Cloud Systems
Cloud service providers often provision excessive resources to meet the desired Service Level Objectives (SLOs), by setting lower CPU utilization targets. This can result in a waste of resources and a noticeable increase in power consumption in large-scale cloud deployments. To address this issue, this paper presents DeepScaling, an innovative solution for minimizing resource cost while ensuring SLO requirements are met in a dynamic, large-scale production microservice-based system. We propose DeepScaling, which introduces three innovative components to adaptively refine the target CPU utilization of servers in the data center, and we maintain it at a stable value to meet SLO constraints while using minimum amount of system resources. First, DeepScaling forecasts workloads for each service using a Spatio-temporal Graph Neural Network. Secondly, it estimates CPU utilization with a Deep Neural Network, considering factors such as periodic tasks and traffic. Finally, it uses a modified Deep Q-Network (DQN) to generate an autoscaling policy that controls service resources to maximize service stability while meeting SLOs. Evaluation of DeepScaling in Ant Group’s large-scale cloud environment shows that it outperforms state-of-the-art autoscaling approaches in terms of maintaining stable performance and resource savings. The deployment of DeepScaling in the real-world environment of 1900+ microservices saves the provisioning of over 100,000 CPU cores per day, on average.
期刊介绍:
The IEEE/ACM Transactions on Networking’s high-level objective is to publish high-quality, original research results derived from theoretical or experimental exploration of the area of communication/computer networking, covering all sorts of information transport networks over all sorts of physical layer technologies, both wireline (all kinds of guided media: e.g., copper, optical) and wireless (e.g., radio-frequency, acoustic (e.g., underwater), infra-red), or hybrids of these. The journal welcomes applied contributions reporting on novel experiences and experiments with actual systems.