Jian Zhao;Mingyu Yang;Youpeng Zhao;Xunhan Hu;Wengang Zhou;Houqiang Li
{"title":"MCMARL: Parameterizing Value Function via Mixture of Categorical Distributions for Multi-Agent Reinforcement Learning","authors":"Jian Zhao;Mingyu Yang;Youpeng Zhao;Xunhan Hu;Wengang Zhou;Houqiang Li","doi":"10.1109/TG.2023.3310150","DOIUrl":null,"url":null,"abstract":"In cooperative multi-agent tasks, a team of agents jointly interact with an environment by taking actions, receiving a team reward, and observing the next state. During the interactions, the uncertainty of environment and reward will inevitably induce stochasticity in the long-term returns, and the randomness can be exacerbated with the increasing number of agents. However, such randomness is ignored by most of the existing value-based multi-agent reinforcement learning (MARL) methods, which only model the expectation of \n<inline-formula><tex-math>$Q$</tex-math></inline-formula>\n-value for both the individual agents and the team. Compared to using the expectations of the long-term returns, it is preferable to directly model the stochasticity by estimating the returns through distributions. With this motivation, this article proposes a novel value-based MARL framework from a distributional perspective, i.e., parameterizing value function via \n<underline>M</u>\nixture of \n<underline>C</u>\nategorical distributions for MARL (MCMARL). Specifically, we model both the individual and global \n<inline-formula><tex-math>$Q$</tex-math></inline-formula>\n-values with categorical distribution. To integrate categorical distributions, we define five basic operations on the distribution, which allow the generalization of expected value function factorization methods (e.g., value decomposition networks (VDN) and QMIX) to their MCMARL variants. We further prove that our MCMARL framework satisfies the \n<italic>Distributional-Individual-Global-Max</i>\n principle with respect to the expectation of distribution, which guarantees the consistency between joint and individual greedy action selections in the global and individual \n<inline-formula><tex-math>$Q$</tex-math></inline-formula>\n-values. Empirically, we evaluate MCMARL on both the stochastic matrix game and the challenging set of \n<italic>StarCraft II</i>\n micromanagement tasks, showing the efficacy of our framework.","PeriodicalId":55977,"journal":{"name":"IEEE Transactions on Games","volume":"16 3","pages":"556-565"},"PeriodicalIF":1.7000,"publicationDate":"2023-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Games","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10234567/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In cooperative multi-agent tasks, a team of agents jointly interact with an environment by taking actions, receiving a team reward, and observing the next state. During the interactions, the uncertainty of environment and reward will inevitably induce stochasticity in the long-term returns, and the randomness can be exacerbated with the increasing number of agents. However, such randomness is ignored by most of the existing value-based multi-agent reinforcement learning (MARL) methods, which only model the expectation of
$Q$
-value for both the individual agents and the team. Compared to using the expectations of the long-term returns, it is preferable to directly model the stochasticity by estimating the returns through distributions. With this motivation, this article proposes a novel value-based MARL framework from a distributional perspective, i.e., parameterizing value function via
M
ixture of
C
ategorical distributions for MARL (MCMARL). Specifically, we model both the individual and global
$Q$
-values with categorical distribution. To integrate categorical distributions, we define five basic operations on the distribution, which allow the generalization of expected value function factorization methods (e.g., value decomposition networks (VDN) and QMIX) to their MCMARL variants. We further prove that our MCMARL framework satisfies the
Distributional-Individual-Global-Max
principle with respect to the expectation of distribution, which guarantees the consistency between joint and individual greedy action selections in the global and individual
$Q$
-values. Empirically, we evaluate MCMARL on both the stochastic matrix game and the challenging set of
StarCraft II
micromanagement tasks, showing the efficacy of our framework.