Understanding Exploration and Exploitation of Q-Learning Agents in B5G Network Management

2021 IEEE Globecom Workshops (GC Wkshps) Pub Date : 2021-12-01 DOI:10.1109/GCWkshps52748.2021.9682129

Sayantini Majumdar, R. Trivisonno, G. Carle

{"title":"Understanding Exploration and Exploitation of Q-Learning Agents in B5G Network Management","authors":"Sayantini Majumdar, R. Trivisonno, G. Carle","doi":"10.1109/GCWkshps52748.2021.9682129","DOIUrl":null,"url":null,"abstract":"Auto-scaling is a lifecycle management approach that automatically scales resources (CPU, memory etc.) based on incoming load to optimize resource utilization. Centralized orchestration, although optimal, comes at the cost of high signaling overhead. Alternatively, decentralized RL-based approaches such as Q-Learning (QL) are envisaged to be more suitable for the strict latency and overhead requirements of B5G/6G use cases, while also minimizing the number of resource allocation conflicts encountered in a distributed setting. Before QL agents can take optimal auto-scaling decisions, they need to explore or evaluate their actions based on the feedback they receive from the environment. The faster they learn, the sooner they could begin to exploit their knowledge. However, it is not clear when these agents have explored long enough to start taking management actions. This paper focuses on understanding when the exploration should end such that agents may start exploiting built knowledge. In our approach, we posit that the knowledge accrued by the agents in their Q-tables should indicate whether to explore or exploit. Hence, we conceive Knowledge Indicators (KIs) derived from their Q-tables. These KIs enable agents to learn autonomously, thereby enabling adjustment of the exploration parameter epsilon in the epsilon-greedy approach. Convergence results and corresponding impact on the system performance validate the proposed approach. This work has the potential to speed up the convergence of QL agents, thereby providing critical hints to operators targeting live deployments of B5G/6G decentralized network management.","PeriodicalId":6802,"journal":{"name":"2021 IEEE Globecom Workshops (GC Wkshps)","volume":"25 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Globecom Workshops (GC Wkshps)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GCWkshps52748.2021.9682129","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Auto-scaling is a lifecycle management approach that automatically scales resources (CPU, memory etc.) based on incoming load to optimize resource utilization. Centralized orchestration, although optimal, comes at the cost of high signaling overhead. Alternatively, decentralized RL-based approaches such as Q-Learning (QL) are envisaged to be more suitable for the strict latency and overhead requirements of B5G/6G use cases, while also minimizing the number of resource allocation conflicts encountered in a distributed setting. Before QL agents can take optimal auto-scaling decisions, they need to explore or evaluate their actions based on the feedback they receive from the environment. The faster they learn, the sooner they could begin to exploit their knowledge. However, it is not clear when these agents have explored long enough to start taking management actions. This paper focuses on understanding when the exploration should end such that agents may start exploiting built knowledge. In our approach, we posit that the knowledge accrued by the agents in their Q-tables should indicate whether to explore or exploit. Hence, we conceive Knowledge Indicators (KIs) derived from their Q-tables. These KIs enable agents to learn autonomously, thereby enabling adjustment of the exploration parameter epsilon in the epsilon-greedy approach. Convergence results and corresponding impact on the system performance validate the proposed approach. This work has the potential to speed up the convergence of QL agents, thereby providing critical hints to operators targeting live deployments of B5G/6G decentralized network management.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

理解B5G网络管理中q -学习代理的探索与利用

自动伸缩是一种生命周期管理方法，它根据传入的负载自动伸缩资源(CPU、内存等)，以优化资源利用率。集中式编排虽然是最佳的，但代价是高昂的信令开销。另外，基于去中心化学习的方法，如Q-Learning (QL)，被认为更适合B5G/6G用例的严格延迟和开销要求，同时也最大限度地减少了在分布式环境中遇到的资源分配冲突的数量。在QL代理做出最优的自动扩展决策之前，它们需要根据从环境接收到的反馈来探索或评估它们的操作。他们学得越快，就能越早开始利用他们的知识。然而，尚不清楚这些代理人何时已经探索足够长的时间来开始采取管理行动。本文的重点是理解探索应该在什么时候结束，这样智能体就可以开始利用构建的知识。在我们的方法中，我们假设代理在其q表中积累的知识应该指示是探索还是利用。因此，我们从他们的q表中推导出知识指标(KIs)。这些ki使代理能够自主学习，从而能够在贪心方法中调整探索参数epsilon。收敛结果和对系统性能的影响验证了该方法的有效性。这项工作有可能加速QL代理的融合，从而为瞄准B5G/6G分散网络管理的实时部署的运营商提供关键提示。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2021 IEEE Globecom Workshops (GC Wkshps)

自引率

0.00%

发文量