Multi-Objective Reinforcement Learning with Non-Linear Scalarization

Mridul Agarwal, V. Aggarwal, Tian Lan
{"title":"Multi-Objective Reinforcement Learning with Non-Linear Scalarization","authors":"Mridul Agarwal, V. Aggarwal, Tian Lan","doi":"10.5555/3535850.3535853","DOIUrl":null,"url":null,"abstract":"Multi-Objective Reinforcement Learning (MORL) setup naturally arises in many places where an agent optimizes multiple objectives. We consider the problem of MORL where multiple objectives are combined using a non-linear scalarization. We combine the vector objectives with a concave scalarization function and maximize this scalar objective. To work with the non-linear scalarization, in this paper, we propose a solution using steady-state occupancy measures and long-term average rewards. We show that when the scalarization function is element-wise increasing, the optimal policy for the scalarization is also Pareto optimal. To maximize the scalarized objective, we propose a model-based posterior sampling algorithm. Using a novel Bellman error analysis for infinite horizon MDPs based proof, we show that the proposed algorithm obtains a regret bound of ˜ O ( LKDS (cid:112) A / T ) for K objectives, and L -Lipschitz continous scalarization function for MDP with S states, A actions, and diameter D . Additionally, we propose policy-gradient and actor-critic algorithms for MORL. For the policy gradient actor, we obtain the gradient using chain rule, and we learn different critics for each of the K objectives. Finally, we implement our algorithms on multiple environments including deep-sea treasure, and network scheduling setups to demonstrate that the proposed algorithms can optimize non-linear scalarization of multiple objectives.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Adaptive Agents and Multi-Agent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5555/3535850.3535853","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

Multi-Objective Reinforcement Learning (MORL) setup naturally arises in many places where an agent optimizes multiple objectives. We consider the problem of MORL where multiple objectives are combined using a non-linear scalarization. We combine the vector objectives with a concave scalarization function and maximize this scalar objective. To work with the non-linear scalarization, in this paper, we propose a solution using steady-state occupancy measures and long-term average rewards. We show that when the scalarization function is element-wise increasing, the optimal policy for the scalarization is also Pareto optimal. To maximize the scalarized objective, we propose a model-based posterior sampling algorithm. Using a novel Bellman error analysis for infinite horizon MDPs based proof, we show that the proposed algorithm obtains a regret bound of ˜ O ( LKDS (cid:112) A / T ) for K objectives, and L -Lipschitz continous scalarization function for MDP with S states, A actions, and diameter D . Additionally, we propose policy-gradient and actor-critic algorithms for MORL. For the policy gradient actor, we obtain the gradient using chain rule, and we learn different critics for each of the K objectives. Finally, we implement our algorithms on multiple environments including deep-sea treasure, and network scheduling setups to demonstrate that the proposed algorithms can optimize non-linear scalarization of multiple objectives.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
非线性标量化的多目标强化学习
多目标强化学习(MORL)设置自然出现在智能体优化多个目标的许多地方。我们考虑了使用非线性标量化方法组合多个目标的MORL问题。我们将矢量目标与凹标化函数相结合,并最大化该标量目标。为了解决非线性尺度化问题,本文提出了一种使用稳态占用度量和长期平均奖励的解决方案。我们证明了当标量化函数逐元素递增时,最优策略也是帕累托最优策略。为了使标化目标最大化,我们提出了一种基于模型的后验抽样算法。通过对无限视界MDP的新颖Bellman误差分析,我们证明了该算法对于K个目标的遗憾界为~ O (LKDS (cid:112) a / T),对于具有S个状态、a个动作和直径D的MDP,得到了L -Lipschitz连续标化函数。此外,我们还提出了针对MORL的策略梯度和行为者批评算法。对于政策梯度参与者,我们使用链式法则获得梯度,并且我们为K个目标中的每个目标学习不同的批评。最后,我们在包括深海宝藏和网络调度设置在内的多个环境中实现了我们的算法,以证明所提出的算法可以优化多目标的非线性标量化。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Discovering Consistent Subelections Strategic Cost Selection in Participatory Budgeting Minimizing State Exploration While Searching Graphs with Unknown Obstacles vMFER: von Mises-Fisher Experience Resampling Based on Uncertainty of Gradient Directions for Policy Improvement of Actor-Critic Algorithms Reinforcement Nash Equilibrium Solver
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1