DeepScaling: Autoscaling Microservices With Stable CPU Utilization for Large Scale Production Cloud Systems

IF 3.6 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE IEEE/ACM Transactions on Networking Pub Date : 2024-03-31 DOI:10.1109/TNET.2024.3400953

Ziliang Wang;Shiyi Zhu;Jianguo Li;Wei Jiang;K. K. Ramakrishnan;Meng Yan;Xiaohong Zhang;Alex X. Liu

{"title":"DeepScaling: Autoscaling Microservices With Stable CPU Utilization for Large Scale Production Cloud Systems","authors":"Ziliang Wang;Shiyi Zhu;Jianguo Li;Wei Jiang;K. K. Ramakrishnan;Meng Yan;Xiaohong Zhang;Alex X. Liu","doi":"10.1109/TNET.2024.3400953","DOIUrl":null,"url":null,"abstract":"Cloud service providers often provision excessive resources to meet the desired Service Level Objectives (SLOs), by setting lower CPU utilization targets. This can result in a waste of resources and a noticeable increase in power consumption in large-scale cloud deployments. To address this issue, this paper presents DeepScaling, an innovative solution for minimizing resource cost while ensuring SLO requirements are met in a dynamic, large-scale production microservice-based system. We propose DeepScaling, which introduces three innovative components to adaptively refine the target CPU utilization of servers in the data center, and we maintain it at a stable value to meet SLO constraints while using minimum amount of system resources. First, DeepScaling forecasts workloads for each service using a Spatio-temporal Graph Neural Network. Secondly, it estimates CPU utilization with a Deep Neural Network, considering factors such as periodic tasks and traffic. Finally, it uses a modified Deep Q-Network (DQN) to generate an autoscaling policy that controls service resources to maximize service stability while meeting SLOs. Evaluation of DeepScaling in Ant Group’s large-scale cloud environment shows that it outperforms state-of-the-art autoscaling approaches in terms of maintaining stable performance and resource savings. The deployment of DeepScaling in the real-world environment of 1900+ microservices saves the provisioning of over 100,000 CPU cores per day, on average.","PeriodicalId":13443,"journal":{"name":"IEEE/ACM Transactions on Networking","volume":"32 5","pages":"3961-3976"},"PeriodicalIF":3.6000,"publicationDate":"2024-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Networking","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10542703/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Cloud service providers often provision excessive resources to meet the desired Service Level Objectives (SLOs), by setting lower CPU utilization targets. This can result in a waste of resources and a noticeable increase in power consumption in large-scale cloud deployments. To address this issue, this paper presents DeepScaling, an innovative solution for minimizing resource cost while ensuring SLO requirements are met in a dynamic, large-scale production microservice-based system. We propose DeepScaling, which introduces three innovative components to adaptively refine the target CPU utilization of servers in the data center, and we maintain it at a stable value to meet SLO constraints while using minimum amount of system resources. First, DeepScaling forecasts workloads for each service using a Spatio-temporal Graph Neural Network. Secondly, it estimates CPU utilization with a Deep Neural Network, considering factors such as periodic tasks and traffic. Finally, it uses a modified Deep Q-Network (DQN) to generate an autoscaling policy that controls service resources to maximize service stability while meeting SLOs. Evaluation of DeepScaling in Ant Group’s large-scale cloud environment shows that it outperforms state-of-the-art autoscaling approaches in terms of maintaining stable performance and resource savings. The deployment of DeepScaling in the real-world environment of 1900+ microservices saves the provisioning of over 100,000 CPU cores per day, on average.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

DeepScaling：为大规模生产云系统自动扩展具有稳定 CPU 利用率的微服务

云服务提供商通常会通过设定较低的 CPU 利用率目标来提供过多的资源，以满足所需的服务级别目标 (SLO)。这会造成资源浪费，并明显增加大规模云部署的功耗。为解决这一问题，本文提出了 DeepScaling，这是一种创新的解决方案，可最大限度地降低资源成本，同时确保在基于微服务的动态、大规模生产系统中满足 SLO 要求。我们提出的 DeepScaling 引入了三个创新组件，用于自适应地改进数据中心服务器的目标 CPU 利用率，并将其保持在一个稳定值，以满足 SLO 约束，同时使用最少的系统资源。首先，DeepScaling 利用时空图神经网络预测每个服务的工作负载。其次，它使用深度神经网络估算 CPU 利用率，同时考虑周期性任务和流量等因素。最后，它使用修改后的深度 Q 网络（DQN）生成自动缩放策略，控制服务资源，在满足 SLO 的同时最大限度地提高服务稳定性。在蚂蚁金服集团的大规模云环境中对DeepScaling的评估表明，它在保持性能稳定和节省资源方面优于最先进的自动缩放方法。在包含 1900 多个微服务的实际环境中部署 DeepScaling，平均每天可节省超过 10 万个 CPU 内核的调配。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE/ACM Transactions on Networking 工程技术-电信学

CiteScore

8.20

自引率

5.40%

发文量

246

审稿时长

4-8 weeks

期刊介绍： The IEEE/ACM Transactions on Networking’s high-level objective is to publish high-quality, original research results derived from theoretical or experimental exploration of the area of communication/computer networking, covering all sorts of information transport networks over all sorts of physical layer technologies, both wireline (all kinds of guided media: e.g., copper, optical) and wireless (e.g., radio-frequency, acoustic (e.g., underwater), infra-red), or hybrids of these. The journal welcomes applied contributions reporting on novel experiences and experiments with actual systems.

期刊最新文献

Table of Contents IEEE/ACM Transactions on Networking Information for Authors IEEE/ACM Transactions on Networking Society Information IEEE/ACM Transactions on Networking Publication Information FPCA: Parasitic Coding Authentication for UAVs by FM Signals