平衡法:LLM 设计的不安定强盗奖励的优先级策略

Shresth Verma, Niclas Boehmer, Lingkai Kong, Milind Tambe
{"title":"平衡法:LLM 设计的不安定强盗奖励的优先级策略","authors":"Shresth Verma, Niclas Boehmer, Lingkai Kong, Milind Tambe","doi":"arxiv-2408.12112","DOIUrl":null,"url":null,"abstract":"LLMs are increasingly used to design reward functions based on human\npreferences in Reinforcement Learning (RL). We focus on LLM-designed rewards\nfor Restless Multi-Armed Bandits, a framework for allocating limited resources\namong agents. In applications such as public health, this approach empowers\ngrassroots health workers to tailor automated allocation decisions to community\nneeds. In the presence of multiple agents, altering the reward function based\non human preferences can impact subpopulations very differently, leading to\ncomplex tradeoffs and a multi-objective resource allocation problem. We are the\nfirst to present a principled method termed Social Choice Language Model for\ndealing with these tradeoffs for LLM-designed rewards for multiagent planners\nin general and restless bandits in particular. The novel part of our model is a\ntransparent and configurable selection component, called an adjudicator,\nexternal to the LLM that controls complex tradeoffs via a user-selected social\nwelfare function. Our experiments demonstrate that our model reliably selects\nmore effective, aligned, and balanced reward functions compared to purely\nLLM-based approaches.","PeriodicalId":501315,"journal":{"name":"arXiv - CS - Multiagent Systems","volume":"8 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards\",\"authors\":\"Shresth Verma, Niclas Boehmer, Lingkai Kong, Milind Tambe\",\"doi\":\"arxiv-2408.12112\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"LLMs are increasingly used to design reward functions based on human\\npreferences in Reinforcement Learning (RL). We focus on LLM-designed rewards\\nfor Restless Multi-Armed Bandits, a framework for allocating limited resources\\namong agents. In applications such as public health, this approach empowers\\ngrassroots health workers to tailor automated allocation decisions to community\\nneeds. In the presence of multiple agents, altering the reward function based\\non human preferences can impact subpopulations very differently, leading to\\ncomplex tradeoffs and a multi-objective resource allocation problem. We are the\\nfirst to present a principled method termed Social Choice Language Model for\\ndealing with these tradeoffs for LLM-designed rewards for multiagent planners\\nin general and restless bandits in particular. The novel part of our model is a\\ntransparent and configurable selection component, called an adjudicator,\\nexternal to the LLM that controls complex tradeoffs via a user-selected social\\nwelfare function. Our experiments demonstrate that our model reliably selects\\nmore effective, aligned, and balanced reward functions compared to purely\\nLLM-based approaches.\",\"PeriodicalId\":501315,\"journal\":{\"name\":\"arXiv - CS - Multiagent Systems\",\"volume\":\"8 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multiagent Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.12112\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multiagent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.12112","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在强化学习(RL)中,LLM 越来越多地被用来设计基于人类偏好的奖励函数。我们将重点放在为无休止多臂强盗(Restless Multi-Armed Bandits)设计的 LLM 奖励上,这是一种在代理之间分配有限资源的框架。在公共卫生等应用中,这种方法能让基层卫生工作者根据社区需求做出自动分配决策。在存在多个代理的情况下,根据人类偏好改变奖励函数会对子群体产生截然不同的影响,从而导致复杂的权衡和多目标资源分配问题。我们首次提出了一种称为 "社会选择语言模型 "的原则性方法,用于处理 LLM 设计的多代理规划者奖励的这些权衡问题,特别是 "不安分的强盗 "问题。我们模型的新颖之处在于,它是一个透明的、可配置的选择组件,称为 "裁决者",它位于 LLM 外部,通过用户选择的社会福利函数来控制复杂的权衡。我们的实验证明,与纯粹基于 LLM 的方法相比,我们的模型能够可靠地选择更加有效、一致和平衡的奖励函数。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards
LLMs are increasingly used to design reward functions based on human preferences in Reinforcement Learning (RL). We focus on LLM-designed rewards for Restless Multi-Armed Bandits, a framework for allocating limited resources among agents. In applications such as public health, this approach empowers grassroots health workers to tailor automated allocation decisions to community needs. In the presence of multiple agents, altering the reward function based on human preferences can impact subpopulations very differently, leading to complex tradeoffs and a multi-objective resource allocation problem. We are the first to present a principled method termed Social Choice Language Model for dealing with these tradeoffs for LLM-designed rewards for multiagent planners in general and restless bandits in particular. The novel part of our model is a transparent and configurable selection component, called an adjudicator, external to the LLM that controls complex tradeoffs via a user-selected social welfare function. Our experiments demonstrate that our model reliably selects more effective, aligned, and balanced reward functions compared to purely LLM-based approaches.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Putting Data at the Centre of Offline Multi-Agent Reinforcement Learning HARP: Human-Assisted Regrouping with Permutation Invariant Critic for Multi-Agent Reinforcement Learning On-policy Actor-Critic Reinforcement Learning for Multi-UAV Exploration CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark Multi-agent Path Finding in Continuous Environment
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1