Continuous Deep Maximum Entropy Inverse Reinforcement Learning using online POMDP

Júnior A. R. Silva, V. Grassi, D. Wolf
{"title":"Continuous Deep Maximum Entropy Inverse Reinforcement Learning using online POMDP","authors":"Júnior A. R. Silva, V. Grassi, D. Wolf","doi":"10.1109/ICAR46387.2019.8981548","DOIUrl":null,"url":null,"abstract":"A vehicle navigating in an urban environment must obey traffic rules by properly setting its speed, such as staying below the road speed limit and avoiding collision with other vehicles. This is presumably the scenario that autonomous vehicles will face: they will share the traffic roads with other vehicles (autonomous or not), cooperatively interacting with them. In other words, autonomous vehicles should not only follow traffic rules, but should also behave in such a way that resembles other vehicles behavior. However, manually specification of such behavior is a time-consuming and error-prone task, since driving in urban roads is a complex task, which involves many factors. This paper presents a multitask decision making framework that learns an expert driver's behavior driving in an urban scenario containing traffic lights and other vehicles. For this purpose, Inverse Reinforcement Learning (IRL) is used to learn a reward function that explains the expert driver's behavior. Most IRL approaches require solving a Markov Decision Process (MDP) in each iteration of the algorithm to compute the optimal policy given the current rewards. Nevertheless, the computational cost of solving an MDP is high when considering large state spaces. To overcome this issue, the optimal policy is estimated by sampling trajectories in regions of the space with higher rewards. To do so, the problem is modeled as a continuous Partially Observed Markov Decision Process (POMDP), in which the intentions of other vehicles are only partially observed. An online solver is employed in order to sample trajectories given the current rewards. The efficiency of the proposed framework is demonstrated through simulations, showing that the controlled vehicle is be able to mimic an expert driver's behavior.","PeriodicalId":6606,"journal":{"name":"2019 19th International Conference on Advanced Robotics (ICAR)","volume":"20 1","pages":"382-387"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th International Conference on Advanced Robotics (ICAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAR46387.2019.8981548","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

A vehicle navigating in an urban environment must obey traffic rules by properly setting its speed, such as staying below the road speed limit and avoiding collision with other vehicles. This is presumably the scenario that autonomous vehicles will face: they will share the traffic roads with other vehicles (autonomous or not), cooperatively interacting with them. In other words, autonomous vehicles should not only follow traffic rules, but should also behave in such a way that resembles other vehicles behavior. However, manually specification of such behavior is a time-consuming and error-prone task, since driving in urban roads is a complex task, which involves many factors. This paper presents a multitask decision making framework that learns an expert driver's behavior driving in an urban scenario containing traffic lights and other vehicles. For this purpose, Inverse Reinforcement Learning (IRL) is used to learn a reward function that explains the expert driver's behavior. Most IRL approaches require solving a Markov Decision Process (MDP) in each iteration of the algorithm to compute the optimal policy given the current rewards. Nevertheless, the computational cost of solving an MDP is high when considering large state spaces. To overcome this issue, the optimal policy is estimated by sampling trajectories in regions of the space with higher rewards. To do so, the problem is modeled as a continuous Partially Observed Markov Decision Process (POMDP), in which the intentions of other vehicles are only partially observed. An online solver is employed in order to sample trajectories given the current rewards. The efficiency of the proposed framework is demonstrated through simulations, showing that the controlled vehicle is be able to mimic an expert driver's behavior.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用在线POMDP的连续深度最大熵逆强化学习
在城市环境中行驶的车辆必须遵守交通规则,适当设置速度,如保持在道路限速以下,避免与其他车辆发生碰撞。这大概是自动驾驶汽车将面临的场景:它们将与其他车辆(无论是否自动驾驶)共享交通道路,并与它们合作互动。换句话说,自动驾驶汽车不仅要遵守交通规则,还应该以类似于其他车辆的行为方式行事。然而,手动规范这种行为是一项耗时且容易出错的任务,因为在城市道路上驾驶是一项复杂的任务,涉及许多因素。本文提出了一个多任务决策框架,该框架可以学习专家驾驶员在包含交通灯和其他车辆的城市场景中的驾驶行为。为此,使用逆强化学习(IRL)来学习解释专家驾驶员行为的奖励函数。大多数IRL方法需要在算法的每次迭代中求解马尔可夫决策过程(MDP)来计算给定当前奖励的最优策略。然而,在考虑大型状态空间时,求解MDP的计算成本很高。为了克服这个问题,通过在具有较高奖励的空间区域中采样轨迹来估计最优策略。为此,该问题被建模为连续的部分可观察马尔可夫决策过程(POMDP),其中其他车辆的意图仅被部分观察到。为了对给定当前奖励的轨迹进行采样,使用了在线求解器。通过仿真验证了所提框架的有效性,表明被控车辆能够模仿专家驾驶员的行为。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Evaluation of Domain Randomization Techniques for Transfer Learning Robotito: programming robots from preschool to undergraduate school level A Novel Approach for Parameter Extraction of an NMPC-based Visual Follower Model Automated Conflict Resolution of Lane Change Utilizing Probability Collectives Estimating and Localizing External Forces Applied on Flexible Instruments by Shape Sensing
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1