MoVEMo: A Structured Approach for Engineering Reward Functions

Piergiuseppe Mallozzi, Raúl Pardo, Vincent Duplessis, Patrizio Pelliccione, G. Schneider
{"title":"MoVEMo: A Structured Approach for Engineering Reward Functions","authors":"Piergiuseppe Mallozzi, Raúl Pardo, Vincent Duplessis, Patrizio Pelliccione, G. Schneider","doi":"10.1109/IRC.2018.00053","DOIUrl":null,"url":null,"abstract":"Reinforcement learning (RL) is a machine learning technique that has been increasingly used in robotic systems. In reinforcement learning, instead of manually pre-program what action to take at each step, we convey the goal a software agent in terms of reward functions. The agent tries different actions in order to maximize a numerical value, i.e. the reward. A misspecified reward function can cause problems such as reward hacking, where the agent finds out ways that maximize the reward without achieving the intended goal. As RL agents become more general and autonomous, the design of reward functions that elicit the desired behaviour in the agent becomes more important and cumbersome. In this paper, we present a technique to formally express reward functions in a structured way; this stimulates a proper reward function design and as well enables the formal verification of it. We start by defining the reward function using state machines. In this way, we can statically check that the reward function satisfies certain properties, e.g., high-level requirements of the function to learn. Later we automatically generate a runtime monitor — which runs in parallel with the learning agent — that provides the rewards according to the definition of the state machine and based on the behaviour of the agent. We use the UPPAAL model checker to design the reward model and verify the TCTL properties that model high-level requirements of the reward function and LARVA to monitor and enforce the reward model to the RL agent at runtime.","PeriodicalId":416113,"journal":{"name":"2018 Second IEEE International Conference on Robotic Computing (IRC)","volume":"132 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Second IEEE International Conference on Robotic Computing (IRC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRC.2018.00053","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

Reinforcement learning (RL) is a machine learning technique that has been increasingly used in robotic systems. In reinforcement learning, instead of manually pre-program what action to take at each step, we convey the goal a software agent in terms of reward functions. The agent tries different actions in order to maximize a numerical value, i.e. the reward. A misspecified reward function can cause problems such as reward hacking, where the agent finds out ways that maximize the reward without achieving the intended goal. As RL agents become more general and autonomous, the design of reward functions that elicit the desired behaviour in the agent becomes more important and cumbersome. In this paper, we present a technique to formally express reward functions in a structured way; this stimulates a proper reward function design and as well enables the formal verification of it. We start by defining the reward function using state machines. In this way, we can statically check that the reward function satisfies certain properties, e.g., high-level requirements of the function to learn. Later we automatically generate a runtime monitor — which runs in parallel with the learning agent — that provides the rewards according to the definition of the state machine and based on the behaviour of the agent. We use the UPPAAL model checker to design the reward model and verify the TCTL properties that model high-level requirements of the reward function and LARVA to monitor and enforce the reward model to the RL agent at runtime.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
MoVEMo:工程奖励函数的结构化方法
强化学习(RL)是一种机器学习技术,在机器人系统中得到越来越多的应用。在强化学习中,我们不是手动预编程在每一步采取什么行动,而是根据奖励函数将目标传达给软件代理。代理尝试不同的行动以最大化数值,即奖励。错误指定的奖励功能可能会导致诸如奖励黑客之类的问题,即代理在没有实现预期目标的情况下找到最大化奖励的方法。随着强化学习代理变得更加通用和自主,在代理中引发期望行为的奖励函数的设计变得更加重要和繁琐。在本文中,我们提出了一种以结构化的方式正式表达奖励函数的技术;这激发了适当的奖励功能设计,并使其能够进行正式验证。我们首先使用状态机定义奖励函数。通过这种方式,我们可以静态地检查奖励函数是否满足某些属性,例如,该函数对学习的高级要求。之后,我们自动生成一个运行时监视器——它与学习代理并行运行——根据状态机的定义和代理的行为提供奖励。我们使用UPPAAL模型检查器来设计奖励模型,并验证TCTL属性,这些属性建模了奖励函数的高级需求,LARVA在运行时监控并强制RL代理执行奖励模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Learning a Set of Interrelated Tasks by Using Sequences of Motor Policies for a Strategic Intrinsically Motivated Learner Improving Code Quality in ROS Packages Using a Temporal Extension of First-Order Logic Rapid Qualification of Mereotopological Relationships Using Signed Distance Fields Towards a Multi-mission QoS and Energy Manager for Autonomous Mobile Robots A Computational Framework for Complementary Situational Awareness (CSA) in Surgical Assistant Robots
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1