HephaestusForge: Optimal microservice deployment across the Compute Continuum via Reinforcement Learning

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Future Generation Computer Systems-The International Journal of Escience Pub Date : 2025-01-01 DOI:10.1016/j.future.2024.107680

José Santos , Mattia Zaccarini , Filippo Poltronieri , Mauro Tortonesi , Cesare Stefanelli , Nicola Di Cicco , Filip De Turck

{"title":"HephaestusForge: Optimal microservice deployment across the Compute Continuum via Reinforcement Learning","authors":"José Santos , Mattia Zaccarini , Filippo Poltronieri , Mauro Tortonesi , Cesare Stefanelli , Nicola Di Cicco , Filip De Turck","doi":"10.1016/j.future.2024.107680","DOIUrl":null,"url":null,"abstract":"<div><div>With the advent of containerization technologies, microservices have revolutionized application deployment by converting old monolithic software into a group of loosely coupled containers, aiming to offer greater flexibility and improve operational efficiency. This transition made applications more complex, consisting of tens to hundreds of microservices. Designing effective orchestration mechanisms remains a crucial challenge, especially for emerging distributed cloud paradigms such as the Compute Continuum (CC). Orchestration across multiple clusters is still not extensively explored in the literature since most works consider single-cluster scenarios. In the CC scenario, the orchestrator must decide the optimal locations for each microservice, deciding whether instances are deployed altogether or placed across different clusters, significantly increasing orchestration complexity. This paper addresses orchestration in a containerized CC environment by studying a Reinforcement Learning (RL) approach for efficient microservice deployment in Kubernetes (K8s) clusters, a widely adopted container orchestration platform. This work demonstrates the effectiveness of RL in achieving near-optimal deployment schemes under dynamic conditions, where network latency and resource capacity fluctuate. We extensively evaluate a multi-objective reward function that aims to minimize overall latency, reduce deployment costs, and promote fair distribution of microservice instances, and we compare it against typical heuristic-based approaches. The results from an implemented OpenAI Gym framework, named as <em>HephaestusForge</em>, show that RL algorithms achieve minimal rejection rates (as low as 0.002%, 90x less than the baseline Karmada scheduler). Cost-aware strategies result in lower deployment costs (2.5 units), and latency-aware functions achieve lower latency (268–290 ms), improving by 1.5x and 1.3x, respectively, over the best-performing baselines. <em>HephaestusForge</em> is available in a public open-source repository, allowing researchers to validate their own placement algorithms. This study also highlights the adaptability of the DeepSets (DS) neural network in optimizing microservice placement across diverse multi-cluster setups without retraining. The DS neural network can handle inputs and outputs as arbitrarily sized sets, enabling the RL algorithm to learn a policy not bound to a fixed number of clusters.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"166 ","pages":"Article 107680"},"PeriodicalIF":6.2000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X24006447","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

With the advent of containerization technologies, microservices have revolutionized application deployment by converting old monolithic software into a group of loosely coupled containers, aiming to offer greater flexibility and improve operational efficiency. This transition made applications more complex, consisting of tens to hundreds of microservices. Designing effective orchestration mechanisms remains a crucial challenge, especially for emerging distributed cloud paradigms such as the Compute Continuum (CC). Orchestration across multiple clusters is still not extensively explored in the literature since most works consider single-cluster scenarios. In the CC scenario, the orchestrator must decide the optimal locations for each microservice, deciding whether instances are deployed altogether or placed across different clusters, significantly increasing orchestration complexity. This paper addresses orchestration in a containerized CC environment by studying a Reinforcement Learning (RL) approach for efficient microservice deployment in Kubernetes (K8s) clusters, a widely adopted container orchestration platform. This work demonstrates the effectiveness of RL in achieving near-optimal deployment schemes under dynamic conditions, where network latency and resource capacity fluctuate. We extensively evaluate a multi-objective reward function that aims to minimize overall latency, reduce deployment costs, and promote fair distribution of microservice instances, and we compare it against typical heuristic-based approaches. The results from an implemented OpenAI Gym framework, named as HephaestusForge, show that RL algorithms achieve minimal rejection rates (as low as 0.002%, 90x less than the baseline Karmada scheduler). Cost-aware strategies result in lower deployment costs (2.5 units), and latency-aware functions achieve lower latency (268–290 ms), improving by 1.5x and 1.3x, respectively, over the best-performing baselines. HephaestusForge is available in a public open-source repository, allowing researchers to validate their own placement algorithms. This study also highlights the adaptability of the DeepSets (DS) neural network in optimizing microservice placement across diverse multi-cluster setups without retraining. The DS neural network can handle inputs and outputs as arbitrarily sized sets, enabling the RL algorithm to learn a policy not bound to a fixed number of clusters.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

HephaestusForge：通过强化学习实现跨计算连续体的最佳微服务部署

随着容器化技术的出现，微服务通过将旧的单片软件转换为一组松散耦合的容器，彻底改变了应用程序部署，旨在提供更大的灵活性并提高操作效率。这种转变使应用程序更加复杂，由数十到数百个微服务组成。设计有效的编排机制仍然是一个关键的挑战，特别是对于新兴的分布式云范式，如Compute Continuum （CC）。由于大多数作品考虑的是单集群场景，因此跨多个集群的编排在文献中仍然没有得到广泛的探讨。在CC场景中，编排者必须决定每个微服务的最佳位置，决定实例是一起部署还是跨不同集群部署，这会显著增加编排的复杂性。本文通过研究在Kubernetes （K8s）集群中高效部署微服务的强化学习（RL）方法来解决容器化CC环境中的编排问题，Kubernetes （K8s）集群是一种被广泛采用的容器编排平台。这项工作证明了RL在网络延迟和资源容量波动的动态条件下实现接近最佳部署方案的有效性。我们广泛地评估了一个多目标奖励函数，该函数旨在最小化总体延迟，降低部署成本，促进微服务实例的公平分配，并将其与典型的基于启发式的方法进行了比较。一个名为HephaestusForge的OpenAI Gym框架的实现结果表明，RL算法实现了最小的拒绝率（低至0.002%，比基线karma scheduler低90倍）。成本感知策略可以降低部署成本（2.5个单位），延迟感知功能可以实现更低的延迟（268-290毫秒），分别比性能最佳的基准提高1.5倍和1.3倍。HephaestusForge是一个公共开源存储库，允许研究人员验证他们自己的放置算法。本研究还强调了DeepSets （DS）神经网络在无需再训练的情况下优化不同多集群设置的微服务布局方面的适应性。DS神经网络可以将输入和输出处理为任意大小的集合，使RL算法能够学习不受固定数量集群约束的策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Future Generation Computer Systems-The International Journal of Escience 工程技术-计算机：理论方法

CiteScore

19.90

自引率

2.70%

发文量

376

审稿时长

10.6 months

期刊介绍： Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.