从人类反馈中进行多代理强化学习:数据覆盖和算法技术

Natalia Zhang, Xinqi Wang, Qiwen Cui, Runlong Zhou, Sham M. Kakade, Simon S. Du
{"title":"从人类反馈中进行多代理强化学习:数据覆盖和算法技术","authors":"Natalia Zhang, Xinqi Wang, Qiwen Cui, Runlong Zhou, Sham M. Kakade, Simon S. Du","doi":"arxiv-2409.00717","DOIUrl":null,"url":null,"abstract":"We initiate the study of Multi-Agent Reinforcement Learning from Human\nFeedback (MARLHF), exploring both theoretical foundations and empirical\nvalidations. We define the task as identifying Nash equilibrium from a\npreference-only offline dataset in general-sum games, a problem marked by the\nchallenge of sparse feedback signals. Our theory establishes the upper\ncomplexity bounds for Nash Equilibrium in effective MARLHF, demonstrating that\nsingle-policy coverage is inadequate and highlighting the importance of\nunilateral dataset coverage. These theoretical insights are verified through\ncomprehensive experiments. To enhance the practical performance, we further\nintroduce two algorithmic techniques. (1) We propose a Mean Squared Error (MSE)\nregularization along the time axis to achieve a more uniform reward\ndistribution and improve reward learning outcomes. (2) We utilize imitation\nlearning to approximate the reference policy, ensuring stability and\neffectiveness in training. Our findings underscore the multifaceted approach\nrequired for MARLHF, paving the way for effective preference-based multi-agent\nsystems.","PeriodicalId":501315,"journal":{"name":"arXiv - CS - Multiagent Systems","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques\",\"authors\":\"Natalia Zhang, Xinqi Wang, Qiwen Cui, Runlong Zhou, Sham M. Kakade, Simon S. Du\",\"doi\":\"arxiv-2409.00717\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We initiate the study of Multi-Agent Reinforcement Learning from Human\\nFeedback (MARLHF), exploring both theoretical foundations and empirical\\nvalidations. We define the task as identifying Nash equilibrium from a\\npreference-only offline dataset in general-sum games, a problem marked by the\\nchallenge of sparse feedback signals. Our theory establishes the upper\\ncomplexity bounds for Nash Equilibrium in effective MARLHF, demonstrating that\\nsingle-policy coverage is inadequate and highlighting the importance of\\nunilateral dataset coverage. These theoretical insights are verified through\\ncomprehensive experiments. To enhance the practical performance, we further\\nintroduce two algorithmic techniques. (1) We propose a Mean Squared Error (MSE)\\nregularization along the time axis to achieve a more uniform reward\\ndistribution and improve reward learning outcomes. (2) We utilize imitation\\nlearning to approximate the reference policy, ensuring stability and\\neffectiveness in training. Our findings underscore the multifaceted approach\\nrequired for MARLHF, paving the way for effective preference-based multi-agent\\nsystems.\",\"PeriodicalId\":501315,\"journal\":{\"name\":\"arXiv - CS - Multiagent Systems\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multiagent Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.00717\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multiagent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.00717","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

我们开始研究从人类反馈进行多代理强化学习(MARLHF),探索理论基础和经验验证。我们将任务定义为从仅有参考的离线数据集中识别通和博弈中的纳什均衡,这是一个以反馈信号稀疏为特征的难题。我们的理论确立了有效 MARLHF 中纳什均衡的复杂度上限,证明了单一政策覆盖是不够的,并强调了单边数据集覆盖的重要性。这些理论见解通过全面的实验得到了验证。为了提高实际性能,我们进一步引入了两种算法技术。(1) 我们提出了沿时间轴的均方误差(MSE)正则化,以实现更均匀的奖励分布,提高奖励学习效果。(2) 我们利用模仿学习来近似参考策略,确保训练的稳定性和有效性。我们的研究结果强调了 MARLHF 所需的多方面方法,为有效的基于偏好的多代理系统铺平了道路。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques
We initiate the study of Multi-Agent Reinforcement Learning from Human Feedback (MARLHF), exploring both theoretical foundations and empirical validations. We define the task as identifying Nash equilibrium from a preference-only offline dataset in general-sum games, a problem marked by the challenge of sparse feedback signals. Our theory establishes the upper complexity bounds for Nash Equilibrium in effective MARLHF, demonstrating that single-policy coverage is inadequate and highlighting the importance of unilateral dataset coverage. These theoretical insights are verified through comprehensive experiments. To enhance the practical performance, we further introduce two algorithmic techniques. (1) We propose a Mean Squared Error (MSE) regularization along the time axis to achieve a more uniform reward distribution and improve reward learning outcomes. (2) We utilize imitation learning to approximate the reference policy, ensuring stability and effectiveness in training. Our findings underscore the multifaceted approach required for MARLHF, paving the way for effective preference-based multi-agent systems.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Putting Data at the Centre of Offline Multi-Agent Reinforcement Learning HARP: Human-Assisted Regrouping with Permutation Invariant Critic for Multi-Agent Reinforcement Learning On-policy Actor-Critic Reinforcement Learning for Multi-UAV Exploration CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark Multi-agent Path Finding in Continuous Environment
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1