Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards

arXiv - CS - Multiagent Systems Pub Date : 2024-08-22 DOI:arxiv-2408.12112

Shresth Verma, Niclas Boehmer, Lingkai Kong, Milind Tambe

{"title":"Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards","authors":"Shresth Verma, Niclas Boehmer, Lingkai Kong, Milind Tambe","doi":"arxiv-2408.12112","DOIUrl":null,"url":null,"abstract":"LLMs are increasingly used to design reward functions based on human\npreferences in Reinforcement Learning (RL). We focus on LLM-designed rewards\nfor Restless Multi-Armed Bandits, a framework for allocating limited resources\namong agents. In applications such as public health, this approach empowers\ngrassroots health workers to tailor automated allocation decisions to community\nneeds. In the presence of multiple agents, altering the reward function based\non human preferences can impact subpopulations very differently, leading to\ncomplex tradeoffs and a multi-objective resource allocation problem. We are the\nfirst to present a principled method termed Social Choice Language Model for\ndealing with these tradeoffs for LLM-designed rewards for multiagent planners\nin general and restless bandits in particular. The novel part of our model is a\ntransparent and configurable selection component, called an adjudicator,\nexternal to the LLM that controls complex tradeoffs via a user-selected social\nwelfare function. Our experiments demonstrate that our model reliably selects\nmore effective, aligned, and balanced reward functions compared to purely\nLLM-based approaches.","PeriodicalId":501315,"journal":{"name":"arXiv - CS - Multiagent Systems","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multiagent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.12112","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

LLMs are increasingly used to design reward functions based on human preferences in Reinforcement Learning (RL). We focus on LLM-designed rewards for Restless Multi-Armed Bandits, a framework for allocating limited resources among agents. In applications such as public health, this approach empowers grassroots health workers to tailor automated allocation decisions to community needs. In the presence of multiple agents, altering the reward function based on human preferences can impact subpopulations very differently, leading to complex tradeoffs and a multi-objective resource allocation problem. We are the first to present a principled method termed Social Choice Language Model for dealing with these tradeoffs for LLM-designed rewards for multiagent planners in general and restless bandits in particular. The novel part of our model is a transparent and configurable selection component, called an adjudicator, external to the LLM that controls complex tradeoffs via a user-selected social welfare function. Our experiments demonstrate that our model reliably selects more effective, aligned, and balanced reward functions compared to purely LLM-based approaches.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

平衡法：LLM 设计的不安定强盗奖励的优先级策略

在强化学习（RL）中，LLM 越来越多地被用来设计基于人类偏好的奖励函数。我们将重点放在为无休止多臂强盗（Restless Multi-Armed Bandits）设计的 LLM 奖励上，这是一种在代理之间分配有限资源的框架。在公共卫生等应用中，这种方法能让基层卫生工作者根据社区需求做出自动分配决策。在存在多个代理的情况下，根据人类偏好改变奖励函数会对子群体产生截然不同的影响，从而导致复杂的权衡和多目标资源分配问题。我们首次提出了一种称为 "社会选择语言模型 "的原则性方法，用于处理 LLM 设计的多代理规划者奖励的这些权衡问题，特别是 "不安分的强盗 "问题。我们模型的新颖之处在于，它是一个透明的、可配置的选择组件，称为 "裁决者"，它位于 LLM 外部，通过用户选择的社会福利函数来控制复杂的权衡。我们的实验证明，与纯粹基于 LLM 的方法相比，我们的模型能够可靠地选择更加有效、一致和平衡的奖励函数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Multiagent Systems

自引率

0.00%

发文量