Constraining an Unconstrained Multi-agent Policy with offline data

IF 6.3 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neural Networks Pub Date : 2025-06-01 Epub Date: 2025-02-13 DOI:10.1016/j.neunet.2025.107253

Cong Guan, Tao Jiang, Yi-Chen Li, Zongzhang Zhang, Lei Yuan, Yang Yu

{"title":"Constraining an Unconstrained Multi-agent Policy with offline data","authors":"Cong Guan, Tao Jiang, Yi-Chen Li, Zongzhang Zhang, Lei Yuan, Yang Yu","doi":"10.1016/j.neunet.2025.107253","DOIUrl":null,"url":null,"abstract":"<div><div>Real-world multi-agent decision-making systems often have to satisfy some constraints, such as harmfulness, economics, etc., spurring the emergence of Constrained Multi-Agent Reinforcement Learning (CMARL). Existing studies of CMARL mainly focus on training a constrained policy in an online manner, that is, not only maximizing cumulative rewards but also not violating constraints. However, in practice, online learning may be infeasible due to safety restrictions or a lack of high-fidelity simulators. Moreover, as the learned policy runs, new constraints, that are not taken into account during training, may occur. To deal with the above two issues, we propose a method called Constraining an UnconsTrained Multi-Agent Policy with offline data, dubbed CUTMAP, following the popular centralized training with decentralized execution paradigm. Specifically, we have formulated a scalable optimization objective within the framework of multi-agent maximum entropy reinforcement learning for CMARL. This approach is designed to estimate a decomposable Q-function by leveraging an unconstrained “prior policy”1 in conjunction with cost signals extracted from offline data. When a new constraint comes, CUTMAP can reuse the prior policy without re-training it. To tackle the distribution shift challenge in offline learning, we also incorporate a conservative loss term when updating the Q-function. Therefore, the unconstrained prior policy can be trained to satisfy cost constraints through CUTMAP without the need for expensive interactions with the real environment, facilitating the practical application of MARL algorithms. Empirical results in several cooperative multi-agent benchmarks, including StarCraft games, particle games, food search games, and robot control, demonstrate the superior performance of our method.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"186 ","pages":"Article 107253"},"PeriodicalIF":6.3000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025001327","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/13 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Real-world multi-agent decision-making systems often have to satisfy some constraints, such as harmfulness, economics, etc., spurring the emergence of Constrained Multi-Agent Reinforcement Learning (CMARL). Existing studies of CMARL mainly focus on training a constrained policy in an online manner, that is, not only maximizing cumulative rewards but also not violating constraints. However, in practice, online learning may be infeasible due to safety restrictions or a lack of high-fidelity simulators. Moreover, as the learned policy runs, new constraints, that are not taken into account during training, may occur. To deal with the above two issues, we propose a method called Constraining an UnconsTrained Multi-Agent Policy with offline data, dubbed CUTMAP, following the popular centralized training with decentralized execution paradigm. Specifically, we have formulated a scalable optimization objective within the framework of multi-agent maximum entropy reinforcement learning for CMARL. This approach is designed to estimate a decomposable Q-function by leveraging an unconstrained “prior policy”¹ in conjunction with cost signals extracted from offline data. When a new constraint comes, CUTMAP can reuse the prior policy without re-training it. To tackle the distribution shift challenge in offline learning, we also incorporate a conservative loss term when updating the Q-function. Therefore, the unconstrained prior policy can be trained to satisfy cost constraints through CUTMAP without the need for expensive interactions with the real environment, facilitating the practical application of MARL algorithms. Empirical results in several cooperative multi-agent benchmarks, including StarCraft games, particle games, food search games, and robot control, demonstrate the superior performance of our method.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用离线数据约束无约束多代理策略

现实世界中的多智能体决策系统往往需要满足一些约束条件，如危害性、经济性等，这促使了约束多智能体强化学习（Constrained multi-agent Reinforcement Learning， CMARL）的出现。现有的CMARL研究主要集中在以在线方式训练约束策略，即既要使累积奖励最大化，又不违反约束。然而，在实践中，由于安全限制或缺乏高保真模拟器，在线学习可能是不可行的。此外，随着学习策略的运行，可能会出现在训练期间没有考虑到的新约束。为了解决上述两个问题，我们提出了一种使用离线数据约束无约束多代理策略（称为CUTMAP）的方法，该方法遵循了流行的集中式训练与分散执行范式。具体来说，我们在CMARL的多智能体最大熵强化学习框架内制定了一个可扩展的优化目标。该方法旨在通过利用无约束的“优先策略”1和从离线数据中提取的成本信号来估计可分解的q函数。当出现新的约束时，CUTMAP可以重用先前的策略，而无需重新训练它。为了解决离线学习中的分布移位问题，我们还在更新q函数时加入了一个保守损失项。因此，通过CUTMAP可以训练出满足成本约束的无约束先验策略，而不需要与真实环境进行昂贵的交互，便于MARL算法的实际应用。包括星际争霸游戏、粒子游戏、食物搜索游戏和机器人控制在内的多个协作多智能体基准测试的实证结果表明，我们的方法具有优越的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.