Dynamic Programming vs Q-learning for Feedback Motion Planning of Manipulators

2023 5th International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA) Pub Date : 2023-06-08 DOI:10.1109/HORA58378.2023.10155782

U. Yıldıran

{"title":"Dynamic Programming vs Q-learning for Feedback Motion Planning of Manipulators","authors":"U. Yıldıran","doi":"10.1109/HORA58378.2023.10155782","DOIUrl":null,"url":null,"abstract":"Reinforcement Learning (RL) based methods have became popular for control and motion planning of robots, recently. Unlike sampling based motion planners, optimal policies computed by them provide feedback motion plans which eliminates the need for re-computing (optimal) trajectories when a robot starts from a different initial configuration each time. In related studies, an optimal policy (actor) and the associated value function (critic) are usually calculated preforming training in a simulation environment. During training, RL allows learning by interactions with the environment in a physically realistic manner. However, in a simulation system, it is possible to make physically unimplementable moves. Thus, instead of RL, one can make use of Dynamic Programming approaches such as Value Iteration for computing optimal policies, which does not require an exploration component and known to have better convergence properties. In addition, dimension of a value function is smaller than that of a Q-fuction, thereby lessening the severity of the curse of dimensionality. Motivated by these facts, the aim of this paper is to employ Value Iteration algorithm for motion planning of robot manipulators and elaborate its effectiveness compared to a popular RL method, Q-learning.","PeriodicalId":247679,"journal":{"name":"2023 5th International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)","volume":"33 5‐6","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 5th International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HORA58378.2023.10155782","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Reinforcement Learning (RL) based methods have became popular for control and motion planning of robots, recently. Unlike sampling based motion planners, optimal policies computed by them provide feedback motion plans which eliminates the need for re-computing (optimal) trajectories when a robot starts from a different initial configuration each time. In related studies, an optimal policy (actor) and the associated value function (critic) are usually calculated preforming training in a simulation environment. During training, RL allows learning by interactions with the environment in a physically realistic manner. However, in a simulation system, it is possible to make physically unimplementable moves. Thus, instead of RL, one can make use of Dynamic Programming approaches such as Value Iteration for computing optimal policies, which does not require an exploration component and known to have better convergence properties. In addition, dimension of a value function is smaller than that of a Q-fuction, thereby lessening the severity of the curse of dimensionality. Motivated by these facts, the aim of this paper is to employ Value Iteration algorithm for motion planning of robot manipulators and elaborate its effectiveness compared to a popular RL method, Q-learning.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

机械臂反馈运动规划的动态规划与q -学习

近年来，基于强化学习(RL)的方法在机器人控制和运动规划中越来越受欢迎。与基于采样的运动规划器不同，由它们计算的最优策略提供反馈运动计划，从而消除了每次机器人从不同初始配置开始时重新计算(最优)轨迹的需要。在相关研究中，通常在模拟环境中进行训练前计算最优策略(actor)和相关的价值函数(critic)。在训练期间，强化学习允许以物理现实的方式与环境相互作用来学习。然而，在模拟系统中，有可能做出物理上无法实现的移动。因此，可以使用动态规划方法，如值迭代来计算最优策略，而不是RL，它不需要探索组件，并且已知具有更好的收敛特性。此外，值函数的维数比q函数的维数小，从而减轻了维数诅咒的严重程度。基于这些事实，本文的目的是将值迭代算法用于机器人操作器的运动规划，并阐述其与流行的强化学习方法Q-learning相比的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2023 5th International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)

自引率

0.00%

发文量