{"title":"平衡法:LLM 设计的不安定强盗奖励的优先级策略","authors":"Shresth Verma, Niclas Boehmer, Lingkai Kong, Milind Tambe","doi":"arxiv-2408.12112","DOIUrl":null,"url":null,"abstract":"LLMs are increasingly used to design reward functions based on human\npreferences in Reinforcement Learning (RL). We focus on LLM-designed rewards\nfor Restless Multi-Armed Bandits, a framework for allocating limited resources\namong agents. In applications such as public health, this approach empowers\ngrassroots health workers to tailor automated allocation decisions to community\nneeds. In the presence of multiple agents, altering the reward function based\non human preferences can impact subpopulations very differently, leading to\ncomplex tradeoffs and a multi-objective resource allocation problem. We are the\nfirst to present a principled method termed Social Choice Language Model for\ndealing with these tradeoffs for LLM-designed rewards for multiagent planners\nin general and restless bandits in particular. The novel part of our model is a\ntransparent and configurable selection component, called an adjudicator,\nexternal to the LLM that controls complex tradeoffs via a user-selected social\nwelfare function. Our experiments demonstrate that our model reliably selects\nmore effective, aligned, and balanced reward functions compared to purely\nLLM-based approaches.","PeriodicalId":501315,"journal":{"name":"arXiv - CS - Multiagent Systems","volume":"8 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards\",\"authors\":\"Shresth Verma, Niclas Boehmer, Lingkai Kong, Milind Tambe\",\"doi\":\"arxiv-2408.12112\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"LLMs are increasingly used to design reward functions based on human\\npreferences in Reinforcement Learning (RL). We focus on LLM-designed rewards\\nfor Restless Multi-Armed Bandits, a framework for allocating limited resources\\namong agents. In applications such as public health, this approach empowers\\ngrassroots health workers to tailor automated allocation decisions to community\\nneeds. In the presence of multiple agents, altering the reward function based\\non human preferences can impact subpopulations very differently, leading to\\ncomplex tradeoffs and a multi-objective resource allocation problem. We are the\\nfirst to present a principled method termed Social Choice Language Model for\\ndealing with these tradeoffs for LLM-designed rewards for multiagent planners\\nin general and restless bandits in particular. The novel part of our model is a\\ntransparent and configurable selection component, called an adjudicator,\\nexternal to the LLM that controls complex tradeoffs via a user-selected social\\nwelfare function. Our experiments demonstrate that our model reliably selects\\nmore effective, aligned, and balanced reward functions compared to purely\\nLLM-based approaches.\",\"PeriodicalId\":501315,\"journal\":{\"name\":\"arXiv - CS - Multiagent Systems\",\"volume\":\"8 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multiagent Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.12112\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multiagent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.12112","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards
LLMs are increasingly used to design reward functions based on human
preferences in Reinforcement Learning (RL). We focus on LLM-designed rewards
for Restless Multi-Armed Bandits, a framework for allocating limited resources
among agents. In applications such as public health, this approach empowers
grassroots health workers to tailor automated allocation decisions to community
needs. In the presence of multiple agents, altering the reward function based
on human preferences can impact subpopulations very differently, leading to
complex tradeoffs and a multi-objective resource allocation problem. We are the
first to present a principled method termed Social Choice Language Model for
dealing with these tradeoffs for LLM-designed rewards for multiagent planners
in general and restless bandits in particular. The novel part of our model is a
transparent and configurable selection component, called an adjudicator,
external to the LLM that controls complex tradeoffs via a user-selected social
welfare function. Our experiments demonstrate that our model reliably selects
more effective, aligned, and balanced reward functions compared to purely
LLM-based approaches.