HephaestusForge: Optimal microservice deployment across the Compute Continuum via Reinforcement Learning

IF 6.2 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Future Generation Computer Systems-The International Journal of Escience Pub Date : 2025-01-01 DOI:10.1016/j.future.2024.107680
José Santos , Mattia Zaccarini , Filippo Poltronieri , Mauro Tortonesi , Cesare Stefanelli , Nicola Di Cicco , Filip De Turck
{"title":"HephaestusForge: Optimal microservice deployment across the Compute Continuum via Reinforcement Learning","authors":"José Santos ,&nbsp;Mattia Zaccarini ,&nbsp;Filippo Poltronieri ,&nbsp;Mauro Tortonesi ,&nbsp;Cesare Stefanelli ,&nbsp;Nicola Di Cicco ,&nbsp;Filip De Turck","doi":"10.1016/j.future.2024.107680","DOIUrl":null,"url":null,"abstract":"<div><div>With the advent of containerization technologies, microservices have revolutionized application deployment by converting old monolithic software into a group of loosely coupled containers, aiming to offer greater flexibility and improve operational efficiency. This transition made applications more complex, consisting of tens to hundreds of microservices. Designing effective orchestration mechanisms remains a crucial challenge, especially for emerging distributed cloud paradigms such as the Compute Continuum (CC). Orchestration across multiple clusters is still not extensively explored in the literature since most works consider single-cluster scenarios. In the CC scenario, the orchestrator must decide the optimal locations for each microservice, deciding whether instances are deployed altogether or placed across different clusters, significantly increasing orchestration complexity. This paper addresses orchestration in a containerized CC environment by studying a Reinforcement Learning (RL) approach for efficient microservice deployment in Kubernetes (K8s) clusters, a widely adopted container orchestration platform. This work demonstrates the effectiveness of RL in achieving near-optimal deployment schemes under dynamic conditions, where network latency and resource capacity fluctuate. We extensively evaluate a multi-objective reward function that aims to minimize overall latency, reduce deployment costs, and promote fair distribution of microservice instances, and we compare it against typical heuristic-based approaches. The results from an implemented OpenAI Gym framework, named as <em>HephaestusForge</em>, show that RL algorithms achieve minimal rejection rates (as low as 0.002%, 90x less than the baseline Karmada scheduler). Cost-aware strategies result in lower deployment costs (2.5 units), and latency-aware functions achieve lower latency (268–290 ms), improving by 1.5x and 1.3x, respectively, over the best-performing baselines. <em>HephaestusForge</em> is available in a public open-source repository, allowing researchers to validate their own placement algorithms. This study also highlights the adaptability of the DeepSets (DS) neural network in optimizing microservice placement across diverse multi-cluster setups without retraining. The DS neural network can handle inputs and outputs as arbitrarily sized sets, enabling the RL algorithm to learn a policy not bound to a fixed number of clusters.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"166 ","pages":"Article 107680"},"PeriodicalIF":6.2000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X24006447","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

With the advent of containerization technologies, microservices have revolutionized application deployment by converting old monolithic software into a group of loosely coupled containers, aiming to offer greater flexibility and improve operational efficiency. This transition made applications more complex, consisting of tens to hundreds of microservices. Designing effective orchestration mechanisms remains a crucial challenge, especially for emerging distributed cloud paradigms such as the Compute Continuum (CC). Orchestration across multiple clusters is still not extensively explored in the literature since most works consider single-cluster scenarios. In the CC scenario, the orchestrator must decide the optimal locations for each microservice, deciding whether instances are deployed altogether or placed across different clusters, significantly increasing orchestration complexity. This paper addresses orchestration in a containerized CC environment by studying a Reinforcement Learning (RL) approach for efficient microservice deployment in Kubernetes (K8s) clusters, a widely adopted container orchestration platform. This work demonstrates the effectiveness of RL in achieving near-optimal deployment schemes under dynamic conditions, where network latency and resource capacity fluctuate. We extensively evaluate a multi-objective reward function that aims to minimize overall latency, reduce deployment costs, and promote fair distribution of microservice instances, and we compare it against typical heuristic-based approaches. The results from an implemented OpenAI Gym framework, named as HephaestusForge, show that RL algorithms achieve minimal rejection rates (as low as 0.002%, 90x less than the baseline Karmada scheduler). Cost-aware strategies result in lower deployment costs (2.5 units), and latency-aware functions achieve lower latency (268–290 ms), improving by 1.5x and 1.3x, respectively, over the best-performing baselines. HephaestusForge is available in a public open-source repository, allowing researchers to validate their own placement algorithms. This study also highlights the adaptability of the DeepSets (DS) neural network in optimizing microservice placement across diverse multi-cluster setups without retraining. The DS neural network can handle inputs and outputs as arbitrarily sized sets, enabling the RL algorithm to learn a policy not bound to a fixed number of clusters.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
HephaestusForge:通过强化学习实现跨计算连续体的最佳微服务部署
随着容器化技术的出现,微服务通过将旧的单片软件转换为一组松散耦合的容器,彻底改变了应用程序部署,旨在提供更大的灵活性并提高操作效率。这种转变使应用程序更加复杂,由数十到数百个微服务组成。设计有效的编排机制仍然是一个关键的挑战,特别是对于新兴的分布式云范式,如Compute Continuum (CC)。由于大多数作品考虑的是单集群场景,因此跨多个集群的编排在文献中仍然没有得到广泛的探讨。在CC场景中,编排者必须决定每个微服务的最佳位置,决定实例是一起部署还是跨不同集群部署,这会显著增加编排的复杂性。本文通过研究在Kubernetes (K8s)集群中高效部署微服务的强化学习(RL)方法来解决容器化CC环境中的编排问题,Kubernetes (K8s)集群是一种被广泛采用的容器编排平台。这项工作证明了RL在网络延迟和资源容量波动的动态条件下实现接近最佳部署方案的有效性。我们广泛地评估了一个多目标奖励函数,该函数旨在最小化总体延迟,降低部署成本,促进微服务实例的公平分配,并将其与典型的基于启发式的方法进行了比较。一个名为HephaestusForge的OpenAI Gym框架的实现结果表明,RL算法实现了最小的拒绝率(低至0.002%,比基线karma scheduler低90倍)。成本感知策略可以降低部署成本(2.5个单位),延迟感知功能可以实现更低的延迟(268-290毫秒),分别比性能最佳的基准提高1.5倍和1.3倍。HephaestusForge是一个公共开源存储库,允许研究人员验证他们自己的放置算法。本研究还强调了DeepSets (DS)神经网络在无需再训练的情况下优化不同多集群设置的微服务布局方面的适应性。DS神经网络可以将输入和输出处理为任意大小的集合,使RL算法能够学习不受固定数量集群约束的策略。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
19.90
自引率
2.70%
发文量
376
审稿时长
10.6 months
期刊介绍: Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.
期刊最新文献
Self-sovereign identity framework with user-friendly private key generation and rule table Accelerating complex graph queries by summary-based hybrid partitioning for discovering vulnerabilities of distribution equipment DNA: Dual-radio Dual-constraint Node Activation scheduling for energy-efficient data dissemination in IoT Blending lossy and lossless data compression methods to support health data streaming in smart cities Energy–time modelling of distributed multi-population genetic algorithms with dynamic workload in HPC clusters
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1