Piergiuseppe Mallozzi, Raúl Pardo, Vincent Duplessis, Patrizio Pelliccione, G. Schneider
{"title":"MoVEMo: A Structured Approach for Engineering Reward Functions","authors":"Piergiuseppe Mallozzi, Raúl Pardo, Vincent Duplessis, Patrizio Pelliccione, G. Schneider","doi":"10.1109/IRC.2018.00053","DOIUrl":null,"url":null,"abstract":"Reinforcement learning (RL) is a machine learning technique that has been increasingly used in robotic systems. In reinforcement learning, instead of manually pre-program what action to take at each step, we convey the goal a software agent in terms of reward functions. The agent tries different actions in order to maximize a numerical value, i.e. the reward. A misspecified reward function can cause problems such as reward hacking, where the agent finds out ways that maximize the reward without achieving the intended goal. As RL agents become more general and autonomous, the design of reward functions that elicit the desired behaviour in the agent becomes more important and cumbersome. In this paper, we present a technique to formally express reward functions in a structured way; this stimulates a proper reward function design and as well enables the formal verification of it. We start by defining the reward function using state machines. In this way, we can statically check that the reward function satisfies certain properties, e.g., high-level requirements of the function to learn. Later we automatically generate a runtime monitor — which runs in parallel with the learning agent — that provides the rewards according to the definition of the state machine and based on the behaviour of the agent. We use the UPPAAL model checker to design the reward model and verify the TCTL properties that model high-level requirements of the reward function and LARVA to monitor and enforce the reward model to the RL agent at runtime.","PeriodicalId":416113,"journal":{"name":"2018 Second IEEE International Conference on Robotic Computing (IRC)","volume":"132 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Second IEEE International Conference on Robotic Computing (IRC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRC.2018.00053","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Reinforcement learning (RL) is a machine learning technique that has been increasingly used in robotic systems. In reinforcement learning, instead of manually pre-program what action to take at each step, we convey the goal a software agent in terms of reward functions. The agent tries different actions in order to maximize a numerical value, i.e. the reward. A misspecified reward function can cause problems such as reward hacking, where the agent finds out ways that maximize the reward without achieving the intended goal. As RL agents become more general and autonomous, the design of reward functions that elicit the desired behaviour in the agent becomes more important and cumbersome. In this paper, we present a technique to formally express reward functions in a structured way; this stimulates a proper reward function design and as well enables the formal verification of it. We start by defining the reward function using state machines. In this way, we can statically check that the reward function satisfies certain properties, e.g., high-level requirements of the function to learn. Later we automatically generate a runtime monitor — which runs in parallel with the learning agent — that provides the rewards according to the definition of the state machine and based on the behaviour of the agent. We use the UPPAAL model checker to design the reward model and verify the TCTL properties that model high-level requirements of the reward function and LARVA to monitor and enforce the reward model to the RL agent at runtime.