Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization

L. N. Alegre, A. Bazzan, Diederik M. Roijers, Ann Now'e, Bruno C. da Silva
{"title":"Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization","authors":"L. N. Alegre, A. Bazzan, Diederik M. Roijers, Ann Now'e, Bruno C. da Silva","doi":"10.48550/arXiv.2301.07784","DOIUrl":null,"url":null,"abstract":"Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences over (possibly conflicting) reward functions. Such algorithms often learn a set of policies (each optimized for a particular agent preference) that can later be used to solve problems with novel preferences. We introduce a novel algorithm that uses Generalized Policy Improvement (GPI) to define principled, formally-derived prioritization schemes that improve sample-efficient learning. They implement active-learning strategies by which the agent can (i) identify the most promising preferences/objectives to train on at each moment, to more rapidly solve a given MORL problem; and (ii) identify which previous experiences are most relevant when learning a policy for a particular agent preference, via a novel Dyna-style MORL method. We prove our algorithm is guaranteed to always converge to an optimal solution in a finite number of steps, or an $\\epsilon$-optimal solution (for a bounded $\\epsilon$) if the agent is limited and can only identify possibly sub-optimal policies. We also prove that our method monotonically improves the quality of its partial solutions while learning. Finally, we introduce a bound that characterizes the maximum utility loss (with respect to the optimal solution) incurred by the partial solutions computed by our method throughout learning. We empirically show that our method outperforms state-of-the-art MORL algorithms in challenging multi-objective tasks, both with discrete and continuous state and action spaces.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Adaptive Agents and Multi-Agent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2301.07784","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences over (possibly conflicting) reward functions. Such algorithms often learn a set of policies (each optimized for a particular agent preference) that can later be used to solve problems with novel preferences. We introduce a novel algorithm that uses Generalized Policy Improvement (GPI) to define principled, formally-derived prioritization schemes that improve sample-efficient learning. They implement active-learning strategies by which the agent can (i) identify the most promising preferences/objectives to train on at each moment, to more rapidly solve a given MORL problem; and (ii) identify which previous experiences are most relevant when learning a policy for a particular agent preference, via a novel Dyna-style MORL method. We prove our algorithm is guaranteed to always converge to an optimal solution in a finite number of steps, or an $\epsilon$-optimal solution (for a bounded $\epsilon$) if the agent is limited and can only identify possibly sub-optimal policies. We also prove that our method monotonically improves the quality of its partial solutions while learning. Finally, we introduce a bound that characterizes the maximum utility loss (with respect to the optimal solution) incurred by the partial solutions computed by our method throughout learning. We empirically show that our method outperforms state-of-the-art MORL algorithms in challenging multi-objective tasks, both with discrete and continuous state and action spaces.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于广义策略改进优先级的样本高效多目标学习
多目标强化学习(MORL)算法解决顺序决策问题,其中代理可能对奖励函数有不同的偏好(可能相互冲突)。这样的算法通常会学习一组策略(每个策略都针对特定的代理偏好进行了优化),这些策略以后可以用于解决具有新偏好的问题。我们介绍了一种新的算法,该算法使用广义策略改进(GPI)来定义原则性的,正式派生的优先级方案,以提高样本效率学习。它们执行主动学习策略,通过该策略,智能体可以(i)识别每个时刻最有希望训练的偏好/目标,以更快地解决给定的MORL问题;(ii)通过一种新颖的dyna风格的MORL方法,确定在学习特定代理偏好的策略时,哪些先前的经验是最相关的。我们证明了我们的算法保证在有限的步骤中总是收敛到一个最优解,或者如果代理是有限的并且只能识别可能的次优策略,那么我们的算法保证总是收敛到一个$\epsilon$最优解(对于有界$\epsilon$)。我们还证明了该方法在学习过程中单调地提高了部分解的质量。最后,我们引入了一个界,它表征了在整个学习过程中由我们的方法计算的部分解所引起的最大效用损失(相对于最优解)。我们的经验表明,我们的方法在具有离散和连续状态和动作空间的挑战性多目标任务中优于最先进的MORL算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Discovering Consistent Subelections Strategic Cost Selection in Participatory Budgeting Minimizing State Exploration While Searching Graphs with Unknown Obstacles vMFER: von Mises-Fisher Experience Resampling Based on Uncertainty of Gradient Directions for Policy Improvement of Actor-Critic Algorithms Reinforcement Nash Equilibrium Solver
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1