{"title":"Multi-Objective Reinforcement Learning with Non-Linear Scalarization","authors":"Mridul Agarwal, V. Aggarwal, Tian Lan","doi":"10.5555/3535850.3535853","DOIUrl":null,"url":null,"abstract":"Multi-Objective Reinforcement Learning (MORL) setup naturally arises in many places where an agent optimizes multiple objectives. We consider the problem of MORL where multiple objectives are combined using a non-linear scalarization. We combine the vector objectives with a concave scalarization function and maximize this scalar objective. To work with the non-linear scalarization, in this paper, we propose a solution using steady-state occupancy measures and long-term average rewards. We show that when the scalarization function is element-wise increasing, the optimal policy for the scalarization is also Pareto optimal. To maximize the scalarized objective, we propose a model-based posterior sampling algorithm. Using a novel Bellman error analysis for infinite horizon MDPs based proof, we show that the proposed algorithm obtains a regret bound of ˜ O ( LKDS (cid:112) A / T ) for K objectives, and L -Lipschitz continous scalarization function for MDP with S states, A actions, and diameter D . Additionally, we propose policy-gradient and actor-critic algorithms for MORL. For the policy gradient actor, we obtain the gradient using chain rule, and we learn different critics for each of the K objectives. Finally, we implement our algorithms on multiple environments including deep-sea treasure, and network scheduling setups to demonstrate that the proposed algorithms can optimize non-linear scalarization of multiple objectives.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Adaptive Agents and Multi-Agent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5555/3535850.3535853","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13
Abstract
Multi-Objective Reinforcement Learning (MORL) setup naturally arises in many places where an agent optimizes multiple objectives. We consider the problem of MORL where multiple objectives are combined using a non-linear scalarization. We combine the vector objectives with a concave scalarization function and maximize this scalar objective. To work with the non-linear scalarization, in this paper, we propose a solution using steady-state occupancy measures and long-term average rewards. We show that when the scalarization function is element-wise increasing, the optimal policy for the scalarization is also Pareto optimal. To maximize the scalarized objective, we propose a model-based posterior sampling algorithm. Using a novel Bellman error analysis for infinite horizon MDPs based proof, we show that the proposed algorithm obtains a regret bound of ˜ O ( LKDS (cid:112) A / T ) for K objectives, and L -Lipschitz continous scalarization function for MDP with S states, A actions, and diameter D . Additionally, we propose policy-gradient and actor-critic algorithms for MORL. For the policy gradient actor, we obtain the gradient using chain rule, and we learn different critics for each of the K objectives. Finally, we implement our algorithms on multiple environments including deep-sea treasure, and network scheduling setups to demonstrate that the proposed algorithms can optimize non-linear scalarization of multiple objectives.
多目标强化学习(MORL)设置自然出现在智能体优化多个目标的许多地方。我们考虑了使用非线性标量化方法组合多个目标的MORL问题。我们将矢量目标与凹标化函数相结合,并最大化该标量目标。为了解决非线性尺度化问题,本文提出了一种使用稳态占用度量和长期平均奖励的解决方案。我们证明了当标量化函数逐元素递增时,最优策略也是帕累托最优策略。为了使标化目标最大化,我们提出了一种基于模型的后验抽样算法。通过对无限视界MDP的新颖Bellman误差分析,我们证明了该算法对于K个目标的遗憾界为~ O (LKDS (cid:112) a / T),对于具有S个状态、a个动作和直径D的MDP,得到了L -Lipschitz连续标化函数。此外,我们还提出了针对MORL的策略梯度和行为者批评算法。对于政策梯度参与者,我们使用链式法则获得梯度,并且我们为K个目标中的每个目标学习不同的批评。最后,我们在包括深海宝藏和网络调度设置在内的多个环境中实现了我们的算法,以证明所提出的算法可以优化多目标的非线性标量化。