平均场多代理强化学习：分散网络方法

IF 1.4 3区数学 Q2 MATHEMATICS, APPLIED Mathematics of Operations Research Pub Date : 2024-03-13 DOI:10.1287/moor.2022.0055

Haotian Gu, Xin Guo, Xiaoli Wei, Renyuan Xu

{"title":"平均场多代理强化学习：分散网络方法","authors":"Haotian Gu, Xin Guo, Xiaoli Wei, Renyuan Xu","doi":"10.1287/moor.2022.0055","DOIUrl":null,"url":null,"abstract":"One of the challenges for multiagent reinforcement learning (MARL) is designing efficient learning algorithms for a large system in which each agent has only limited or partial information of the entire system. Whereas exciting progress has been made to analyze decentralized MARL with the network of agents for social networks and team video games, little is known theoretically for decentralized MARL with the network of states for modeling self-driving vehicles, ride-sharing, and data and traffic routing. This paper proposes a framework of localized training and decentralized execution to study MARL with the network of states. Localized training means that agents only need to collect local information in their neighboring states during the training phase; decentralized execution implies that agents can execute afterward the learned decentralized policies, which depend only on agents’ current states. The theoretical analysis consists of three key components: the first is the reformulation of the MARL system as a networked Markov decision process with teams of agents, enabling updating the associated team Q-function in a localized fashion; the second is the Bellman equation for the value function and the appropriate Q-function on the probability measure space; and the third is the exponential decay property of the team Q-function, facilitating its approximation with efficient sample efficiency and controllable error. The theoretical analysis paves the way for a new algorithm LTDE-Neural-AC, in which the actor–critic approach with overparameterized neural networks is proposed. The convergence and sample complexity are established and shown to be scalable with respect to the sizes of both agents and states. To the best of our knowledge, this is the first neural network–based MARL algorithm with network structure and provable convergence guarantee.Funding: X. Wei is partially supported by NSFC no. 12201343. R. Xu is partially supported by the NSF CAREER award DMS-2339240.","PeriodicalId":49852,"journal":{"name":"Mathematics of Operations Research","volume":"30 1","pages":""},"PeriodicalIF":1.4000,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Mean-Field Multiagent Reinforcement Learning: A Decentralized Network Approach\",\"authors\":\"Haotian Gu, Xin Guo, Xiaoli Wei, Renyuan Xu\",\"doi\":\"10.1287/moor.2022.0055\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One of the challenges for multiagent reinforcement learning (MARL) is designing efficient learning algorithms for a large system in which each agent has only limited or partial information of the entire system. Whereas exciting progress has been made to analyze decentralized MARL with the network of agents for social networks and team video games, little is known theoretically for decentralized MARL with the network of states for modeling self-driving vehicles, ride-sharing, and data and traffic routing. This paper proposes a framework of localized training and decentralized execution to study MARL with the network of states. Localized training means that agents only need to collect local information in their neighboring states during the training phase; decentralized execution implies that agents can execute afterward the learned decentralized policies, which depend only on agents’ current states. The theoretical analysis consists of three key components: the first is the reformulation of the MARL system as a networked Markov decision process with teams of agents, enabling updating the associated team Q-function in a localized fashion; the second is the Bellman equation for the value function and the appropriate Q-function on the probability measure space; and the third is the exponential decay property of the team Q-function, facilitating its approximation with efficient sample efficiency and controllable error. The theoretical analysis paves the way for a new algorithm LTDE-Neural-AC, in which the actor–critic approach with overparameterized neural networks is proposed. The convergence and sample complexity are established and shown to be scalable with respect to the sizes of both agents and states. To the best of our knowledge, this is the first neural network–based MARL algorithm with network structure and provable convergence guarantee.Funding: X. Wei is partially supported by NSFC no. 12201343. R. Xu is partially supported by the NSF CAREER award DMS-2339240.\",\"PeriodicalId\":49852,\"journal\":{\"name\":\"Mathematics of Operations Research\",\"volume\":\"30 1\",\"pages\":\"\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2024-03-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Mathematics of Operations Research\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1287/moor.2022.0055\",\"RegionNum\":3,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICS, APPLIED\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mathematics of Operations Research","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1287/moor.2022.0055","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICS, APPLIED","Score":null,"Total":0}

引用次数: 0

摘要

多代理强化学习（MARL）面临的挑战之一，是为一个大型系统设计高效的学习算法，而在这个系统中，每个代理只掌握整个系统的有限或部分信息。虽然在分析社交网络和团队视频游戏中的代理网络分散强化学习方面已经取得了令人振奋的进展，但对于自动驾驶汽车建模、共享乘车、数据和交通路由的状态网络分散强化学习，理论界却知之甚少。本文提出了一个本地化训练和分散执行的框架，以研究具有状态网络的 MARL。本地化训练指的是，代理在训练阶段只需收集其相邻状态的本地信息；分散执行指的是，代理可以在事后执行所学到的分散策略，这些策略只取决于代理的当前状态。理论分析由三个关键部分组成：第一部分是将 MARL 系统重新表述为一个具有代理团队的网络化马尔可夫决策过程，从而能够以本地化的方式更新相关的团队 Q 函数；第二部分是概率度量空间上的价值函数和适当 Q 函数的贝尔曼方程；第三部分是团队 Q 函数的指数衰减特性，有助于以高效的样本效率和可控的误差对其进行近似。理论分析为新算法 LTDE-Neural-AC 的提出铺平了道路。该算法的收敛性和采样复杂度均已确定，并证明可根据代理和状态的大小进行扩展。据我们所知，这是第一个基于神经网络的 MARL 算法，具有网络结构和可证明的收敛性保证：国家自然科学基金委员会编号：12201343。12201343.R. Xu 部分获得国家自然科学基金 CAREER 奖 DMS-2339240 的资助。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Mean-Field Multiagent Reinforcement Learning: A Decentralized Network Approach

One of the challenges for multiagent reinforcement learning (MARL) is designing efficient learning algorithms for a large system in which each agent has only limited or partial information of the entire system. Whereas exciting progress has been made to analyze decentralized MARL with the network of agents for social networks and team video games, little is known theoretically for decentralized MARL with the network of states for modeling self-driving vehicles, ride-sharing, and data and traffic routing. This paper proposes a framework of localized training and decentralized execution to study MARL with the network of states. Localized training means that agents only need to collect local information in their neighboring states during the training phase; decentralized execution implies that agents can execute afterward the learned decentralized policies, which depend only on agents’ current states. The theoretical analysis consists of three key components: the first is the reformulation of the MARL system as a networked Markov decision process with teams of agents, enabling updating the associated team Q-function in a localized fashion; the second is the Bellman equation for the value function and the appropriate Q-function on the probability measure space; and the third is the exponential decay property of the team Q-function, facilitating its approximation with efficient sample efficiency and controllable error. The theoretical analysis paves the way for a new algorithm LTDE-Neural-AC, in which the actor–critic approach with overparameterized neural networks is proposed. The convergence and sample complexity are established and shown to be scalable with respect to the sizes of both agents and states. To the best of our knowledge, this is the first neural network–based MARL algorithm with network structure and provable convergence guarantee.Funding: X. Wei is partially supported by NSFC no. 12201343. R. Xu is partially supported by the NSF CAREER award DMS-2339240.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Mathematics of Operations Research 管理科学-应用数学

CiteScore

3.40

自引率

5.90%

发文量

178

审稿时长

15.0 months

期刊介绍： Mathematics of Operations Research is an international journal of the Institute for Operations Research and the Management Sciences (INFORMS). The journal invites articles concerned with the mathematical and computational foundations in the areas of continuous, discrete, and stochastic optimization; mathematical programming; dynamic programming; stochastic processes; stochastic models; simulation methodology; control and adaptation; networks; game theory; and decision theory. Also sought are contributions to learning theory and machine learning that have special relevance to decision making, operations research, and management science. The emphasis is on originality, quality, and importance; correctness alone is not sufficient. Significant developments in operations research and management science not having substantial mathematical interest should be directed to other journals such as Management Science or Operations Research.

期刊最新文献

Dual Solutions in Convex Stochastic Optimization Exit Game with Private Information A Retrospective Approximation Approach for Smooth Stochastic Optimization The Minimax Property in Infinite Two-Person Win-Lose Games Envy-Free Division of Multilayered Cakes