v -Learning——一个简单、高效、分散的多智能体强化学习算法

IF 1.4 3区数学 Q2 MATHEMATICS, APPLIED Mathematics of Operations Research Pub Date : 2023-11-17 DOI:10.1287/moor.2021.0317

Chi Jin, Qinghua Liu, Yuanhao Wang, Tiancheng Yu

{"title":"v -Learning——一个简单、高效、分散的多智能体强化学习算法","authors":"Chi Jin, Qinghua Liu, Yuanhao Wang, Tiancheng Yu","doi":"10.1287/moor.2021.0317","DOIUrl":null,"url":null,"abstract":"A major challenge of multiagent reinforcement learning (MARL) is the curse of multiagents, where the size of the joint action space scales exponentially with the number of agents. This remains to be a bottleneck for designing efficient MARL algorithms, even in a basic scenario with finitely many states and actions. This paper resolves this challenge for the model of episodic Markov games. We design a new class of fully decentralized algorithms—V-learning, which provably learns Nash equilibria (in the two-player zero-sum setting), correlated equilibria, and coarse correlated equilibria (in the multiplayer general-sum setting) in a number of samples that only scales with [Formula: see text], where Ai is the number of actions for the ith player. This is in sharp contrast to the size of the joint action space, which is [Formula: see text]. V-learning (in its basic form) is a new class of single-agent reinforcement learning (RL) algorithms that convert any adversarial bandit algorithm with suitable regret guarantees into an RL algorithm. Similar to the classical Q-learning algorithm, it performs incremental updates to the value functions. Different from Q-learning, it only maintains the estimates of V-values instead of Q-values. This key difference allows V-learning to achieve the claimed guarantees in the MARL setting by simply letting all agents run V-learning independently.Funding: This work was partially supported by Office of Naval Research Grant N00014-22-1-2253.","PeriodicalId":49852,"journal":{"name":"Mathematics of Operations Research","volume":"35 1","pages":""},"PeriodicalIF":1.4000,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"V-Learning—A Simple, Efficient, Decentralized Algorithm for Multiagent Reinforcement Learning\",\"authors\":\"Chi Jin, Qinghua Liu, Yuanhao Wang, Tiancheng Yu\",\"doi\":\"10.1287/moor.2021.0317\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A major challenge of multiagent reinforcement learning (MARL) is the curse of multiagents, where the size of the joint action space scales exponentially with the number of agents. This remains to be a bottleneck for designing efficient MARL algorithms, even in a basic scenario with finitely many states and actions. This paper resolves this challenge for the model of episodic Markov games. We design a new class of fully decentralized algorithms—V-learning, which provably learns Nash equilibria (in the two-player zero-sum setting), correlated equilibria, and coarse correlated equilibria (in the multiplayer general-sum setting) in a number of samples that only scales with [Formula: see text], where Ai is the number of actions for the ith player. This is in sharp contrast to the size of the joint action space, which is [Formula: see text]. V-learning (in its basic form) is a new class of single-agent reinforcement learning (RL) algorithms that convert any adversarial bandit algorithm with suitable regret guarantees into an RL algorithm. Similar to the classical Q-learning algorithm, it performs incremental updates to the value functions. Different from Q-learning, it only maintains the estimates of V-values instead of Q-values. This key difference allows V-learning to achieve the claimed guarantees in the MARL setting by simply letting all agents run V-learning independently.Funding: This work was partially supported by Office of Naval Research Grant N00014-22-1-2253.\",\"PeriodicalId\":49852,\"journal\":{\"name\":\"Mathematics of Operations Research\",\"volume\":\"35 1\",\"pages\":\"\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2023-11-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Mathematics of Operations Research\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1287/moor.2021.0317\",\"RegionNum\":3,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICS, APPLIED\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mathematics of Operations Research","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1287/moor.2021.0317","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICS, APPLIED","Score":null,"Total":0}

引用次数: 0

摘要

多智能体强化学习(MARL)的一个主要挑战是多智能体的诅咒，其中联合动作空间的大小随着智能体的数量呈指数级增长。这仍然是设计高效MARL算法的瓶颈，即使在具有有限多个状态和动作的基本场景中也是如此。本文解决了章节马尔可夫博弈模型的这一挑战。我们设计了一种全新的完全去中心化算法——v -learning，它可以在一些样本中学习纳什均衡(在两个人的零和设置中)、相关均衡和粗相关均衡(在多人的一般和设置中)，这些样本只能缩放[公式:见文本]，其中Ai是第i个玩家的行动数量。这与联合动作空间的大小形成鲜明对比，即[公式:见文本]。V-learning(其基本形式)是一类新的单智能体强化学习(RL)算法，它将任何具有适当后悔保证的对抗性强盗算法转换为RL算法。与经典的Q-learning算法类似，它对值函数执行增量更新。与Q-learning不同的是，它只维持v值的估计，而不维持q值的估计。这个关键的区别允许V-learning通过简单地让所有代理独立运行V-learning来实现MARL设置中声称的保证。资助:本工作部分由海军研究办公室资助N00014-22-1-2253。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

V-Learning—A Simple, Efficient, Decentralized Algorithm for Multiagent Reinforcement Learning

A major challenge of multiagent reinforcement learning (MARL) is the curse of multiagents, where the size of the joint action space scales exponentially with the number of agents. This remains to be a bottleneck for designing efficient MARL algorithms, even in a basic scenario with finitely many states and actions. This paper resolves this challenge for the model of episodic Markov games. We design a new class of fully decentralized algorithms—V-learning, which provably learns Nash equilibria (in the two-player zero-sum setting), correlated equilibria, and coarse correlated equilibria (in the multiplayer general-sum setting) in a number of samples that only scales with [Formula: see text], where Ai is the number of actions for the ith player. This is in sharp contrast to the size of the joint action space, which is [Formula: see text]. V-learning (in its basic form) is a new class of single-agent reinforcement learning (RL) algorithms that convert any adversarial bandit algorithm with suitable regret guarantees into an RL algorithm. Similar to the classical Q-learning algorithm, it performs incremental updates to the value functions. Different from Q-learning, it only maintains the estimates of V-values instead of Q-values. This key difference allows V-learning to achieve the claimed guarantees in the MARL setting by simply letting all agents run V-learning independently.Funding: This work was partially supported by Office of Naval Research Grant N00014-22-1-2253.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Mathematics of Operations Research 管理科学-应用数学

CiteScore

3.40

自引率

5.90%

发文量

178

审稿时长

15.0 months

期刊介绍： Mathematics of Operations Research is an international journal of the Institute for Operations Research and the Management Sciences (INFORMS). The journal invites articles concerned with the mathematical and computational foundations in the areas of continuous, discrete, and stochastic optimization; mathematical programming; dynamic programming; stochastic processes; stochastic models; simulation methodology; control and adaptation; networks; game theory; and decision theory. Also sought are contributions to learning theory and machine learning that have special relevance to decision making, operations research, and management science. The emphasis is on originality, quality, and importance; correctness alone is not sufficient. Significant developments in operations research and management science not having substantial mathematical interest should be directed to other journals such as Management Science or Operations Research.

期刊最新文献

Dual Solutions in Convex Stochastic Optimization Exit Game with Private Information A Retrospective Approximation Approach for Smooth Stochastic Optimization The Minimax Property in Infinite Two-Person Win-Lose Games Envy-Free Division of Multilayered Cakes