MoVEMo: A Structured Approach for Engineering Reward Functions

2018 Second IEEE International Conference on Robotic Computing (IRC) Pub Date : 1900-01-01 DOI:10.1109/IRC.2018.00053

Piergiuseppe Mallozzi, Raúl Pardo, Vincent Duplessis, Patrizio Pelliccione, G. Schneider

{"title":"MoVEMo: A Structured Approach for Engineering Reward Functions","authors":"Piergiuseppe Mallozzi, Raúl Pardo, Vincent Duplessis, Patrizio Pelliccione, G. Schneider","doi":"10.1109/IRC.2018.00053","DOIUrl":null,"url":null,"abstract":"Reinforcement learning (RL) is a machine learning technique that has been increasingly used in robotic systems. In reinforcement learning, instead of manually pre-program what action to take at each step, we convey the goal a software agent in terms of reward functions. The agent tries different actions in order to maximize a numerical value, i.e. the reward. A misspecified reward function can cause problems such as reward hacking, where the agent finds out ways that maximize the reward without achieving the intended goal. As RL agents become more general and autonomous, the design of reward functions that elicit the desired behaviour in the agent becomes more important and cumbersome. In this paper, we present a technique to formally express reward functions in a structured way; this stimulates a proper reward function design and as well enables the formal verification of it. We start by defining the reward function using state machines. In this way, we can statically check that the reward function satisfies certain properties, e.g., high-level requirements of the function to learn. Later we automatically generate a runtime monitor — which runs in parallel with the learning agent — that provides the rewards according to the definition of the state machine and based on the behaviour of the agent. We use the UPPAAL model checker to design the reward model and verify the TCTL properties that model high-level requirements of the reward function and LARVA to monitor and enforce the reward model to the RL agent at runtime.","PeriodicalId":416113,"journal":{"name":"2018 Second IEEE International Conference on Robotic Computing (IRC)","volume":"132 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Second IEEE International Conference on Robotic Computing (IRC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRC.2018.00053","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Reinforcement learning (RL) is a machine learning technique that has been increasingly used in robotic systems. In reinforcement learning, instead of manually pre-program what action to take at each step, we convey the goal a software agent in terms of reward functions. The agent tries different actions in order to maximize a numerical value, i.e. the reward. A misspecified reward function can cause problems such as reward hacking, where the agent finds out ways that maximize the reward without achieving the intended goal. As RL agents become more general and autonomous, the design of reward functions that elicit the desired behaviour in the agent becomes more important and cumbersome. In this paper, we present a technique to formally express reward functions in a structured way; this stimulates a proper reward function design and as well enables the formal verification of it. We start by defining the reward function using state machines. In this way, we can statically check that the reward function satisfies certain properties, e.g., high-level requirements of the function to learn. Later we automatically generate a runtime monitor — which runs in parallel with the learning agent — that provides the rewards according to the definition of the state machine and based on the behaviour of the agent. We use the UPPAAL model checker to design the reward model and verify the TCTL properties that model high-level requirements of the reward function and LARVA to monitor and enforce the reward model to the RL agent at runtime.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MoVEMo:工程奖励函数的结构化方法

强化学习(RL)是一种机器学习技术，在机器人系统中得到越来越多的应用。在强化学习中，我们不是手动预编程在每一步采取什么行动，而是根据奖励函数将目标传达给软件代理。代理尝试不同的行动以最大化数值，即奖励。错误指定的奖励功能可能会导致诸如奖励黑客之类的问题，即代理在没有实现预期目标的情况下找到最大化奖励的方法。随着强化学习代理变得更加通用和自主，在代理中引发期望行为的奖励函数的设计变得更加重要和繁琐。在本文中，我们提出了一种以结构化的方式正式表达奖励函数的技术;这激发了适当的奖励功能设计，并使其能够进行正式验证。我们首先使用状态机定义奖励函数。通过这种方式，我们可以静态地检查奖励函数是否满足某些属性，例如，该函数对学习的高级要求。之后，我们自动生成一个运行时监视器——它与学习代理并行运行——根据状态机的定义和代理的行为提供奖励。我们使用UPPAAL模型检查器来设计奖励模型，并验证TCTL属性，这些属性建模了奖励函数的高级需求，LARVA在运行时监控并强制RL代理执行奖励模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2018 Second IEEE International Conference on Robotic Computing (IRC)

自引率

0.00%

发文量