Constraining an Unconstrained Multi-agent Policy with offline data

IF 6.3 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neural Networks Pub Date : 2025-06-01 Epub Date: 2025-02-13 DOI:10.1016/j.neunet.2025.107253
Cong Guan, Tao Jiang, Yi-Chen Li, Zongzhang Zhang, Lei Yuan, Yang Yu
{"title":"Constraining an Unconstrained Multi-agent Policy with offline data","authors":"Cong Guan,&nbsp;Tao Jiang,&nbsp;Yi-Chen Li,&nbsp;Zongzhang Zhang,&nbsp;Lei Yuan,&nbsp;Yang Yu","doi":"10.1016/j.neunet.2025.107253","DOIUrl":null,"url":null,"abstract":"<div><div>Real-world multi-agent decision-making systems often have to satisfy some constraints, such as harmfulness, economics, etc., spurring the emergence of Constrained Multi-Agent Reinforcement Learning (CMARL). Existing studies of CMARL mainly focus on training a constrained policy in an online manner, that is, not only maximizing cumulative rewards but also not violating constraints. However, in practice, online learning may be infeasible due to safety restrictions or a lack of high-fidelity simulators. Moreover, as the learned policy runs, new constraints, that are not taken into account during training, may occur. To deal with the above two issues, we propose a method called <strong>C</strong>onstraining an <strong>U</strong>ncons<strong>T</strong>rained <strong>M</strong>ulti-<strong>A</strong>gent <strong>P</strong>olicy with offline data, dubbed <strong>CUTMAP</strong>, following the popular centralized training with decentralized execution paradigm. Specifically, we have formulated a scalable optimization objective within the framework of multi-agent maximum entropy reinforcement learning for CMARL. This approach is designed to estimate a decomposable Q-function by leveraging an unconstrained “prior policy”<span><span><sup>1</sup></span></span> in conjunction with cost signals extracted from offline data. When a new constraint comes, CUTMAP can reuse the prior policy without re-training it. To tackle the distribution shift challenge in offline learning, we also incorporate a conservative loss term when updating the Q-function. Therefore, the unconstrained prior policy can be trained to satisfy cost constraints through CUTMAP without the need for expensive interactions with the real environment, facilitating the practical application of MARL algorithms. Empirical results in several cooperative multi-agent benchmarks, including StarCraft games, particle games, food search games, and robot control, demonstrate the superior performance of our method.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"186 ","pages":"Article 107253"},"PeriodicalIF":6.3000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025001327","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/13 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Real-world multi-agent decision-making systems often have to satisfy some constraints, such as harmfulness, economics, etc., spurring the emergence of Constrained Multi-Agent Reinforcement Learning (CMARL). Existing studies of CMARL mainly focus on training a constrained policy in an online manner, that is, not only maximizing cumulative rewards but also not violating constraints. However, in practice, online learning may be infeasible due to safety restrictions or a lack of high-fidelity simulators. Moreover, as the learned policy runs, new constraints, that are not taken into account during training, may occur. To deal with the above two issues, we propose a method called Constraining an UnconsTrained Multi-Agent Policy with offline data, dubbed CUTMAP, following the popular centralized training with decentralized execution paradigm. Specifically, we have formulated a scalable optimization objective within the framework of multi-agent maximum entropy reinforcement learning for CMARL. This approach is designed to estimate a decomposable Q-function by leveraging an unconstrained “prior policy”1 in conjunction with cost signals extracted from offline data. When a new constraint comes, CUTMAP can reuse the prior policy without re-training it. To tackle the distribution shift challenge in offline learning, we also incorporate a conservative loss term when updating the Q-function. Therefore, the unconstrained prior policy can be trained to satisfy cost constraints through CUTMAP without the need for expensive interactions with the real environment, facilitating the practical application of MARL algorithms. Empirical results in several cooperative multi-agent benchmarks, including StarCraft games, particle games, food search games, and robot control, demonstrate the superior performance of our method.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用离线数据约束无约束多代理策略
现实世界中的多智能体决策系统往往需要满足一些约束条件,如危害性、经济性等,这促使了约束多智能体强化学习(Constrained multi-agent Reinforcement Learning, CMARL)的出现。现有的CMARL研究主要集中在以在线方式训练约束策略,即既要使累积奖励最大化,又不违反约束。然而,在实践中,由于安全限制或缺乏高保真模拟器,在线学习可能是不可行的。此外,随着学习策略的运行,可能会出现在训练期间没有考虑到的新约束。为了解决上述两个问题,我们提出了一种使用离线数据约束无约束多代理策略(称为CUTMAP)的方法,该方法遵循了流行的集中式训练与分散执行范式。具体来说,我们在CMARL的多智能体最大熵强化学习框架内制定了一个可扩展的优化目标。该方法旨在通过利用无约束的“优先策略”1和从离线数据中提取的成本信号来估计可分解的q函数。当出现新的约束时,CUTMAP可以重用先前的策略,而无需重新训练它。为了解决离线学习中的分布移位问题,我们还在更新q函数时加入了一个保守损失项。因此,通过CUTMAP可以训练出满足成本约束的无约束先验策略,而不需要与真实环境进行昂贵的交互,便于MARL算法的实际应用。包括星际争霸游戏、粒子游戏、食物搜索游戏和机器人控制在内的多个协作多智能体基准测试的实证结果表明,我们的方法具有优越的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Neural Networks
Neural Networks 工程技术-计算机:人工智能
CiteScore
13.90
自引率
7.70%
发文量
425
审稿时长
67 days
期刊介绍: Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.
期刊最新文献
Multi-neurotransmitter synergistically regulated basal ganglia reinforcement learning model HC-GLAD: Dual hyperbolic contrastive learning for unsupervised graph-level anomaly detection Revisiting deep information propagation: Fractal frontier and finite-size effects Topology structure optimization of reservoirs using GLMY homology A text-guided cross-hierarchical fusion and multi-task learning framework for multimodal sentiment analysis
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1