状态相关折现因子的强化学习

2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL) Pub Date : 2013-11-04 DOI:10.1109/DEVLRN.2013.6652533

N. Yoshida, E. Uchibe, K. Doya

{"title":"状态相关折现因子的强化学习","authors":"N. Yoshida, E. Uchibe, K. Doya","doi":"10.1109/DEVLRN.2013.6652533","DOIUrl":null,"url":null,"abstract":"Conventional reinforcement learning algorithms have several parameters which determine the feature of learning process, called meta-parameters. In this study, we focus on the discount factor that influences the time scale of the tradeoff between immediate and delayed rewards. The discount factor is usually considered as a constant value, but we introduce the state-dependent discount function and a new optimization criterion for the reinforcement learning algorithm. We first derive a new algorithm under the criterion, named ExQ-learning and we prove that the algorithm converges to the optimal action-value function in the meaning of new criterion w.p.1. We then present a framework to optimize the discount factor and the discount function by using an evolutionary algorithm. In order to validate the proposed method, we conduct a simple computer simulation and show that the proposed algorithm can find an appropriate state-dependent discount function with which performs better than that with a constant discount factor.","PeriodicalId":106997,"journal":{"name":"2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"Reinforcement learning with state-dependent discount factor\",\"authors\":\"N. Yoshida, E. Uchibe, K. Doya\",\"doi\":\"10.1109/DEVLRN.2013.6652533\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Conventional reinforcement learning algorithms have several parameters which determine the feature of learning process, called meta-parameters. In this study, we focus on the discount factor that influences the time scale of the tradeoff between immediate and delayed rewards. The discount factor is usually considered as a constant value, but we introduce the state-dependent discount function and a new optimization criterion for the reinforcement learning algorithm. We first derive a new algorithm under the criterion, named ExQ-learning and we prove that the algorithm converges to the optimal action-value function in the meaning of new criterion w.p.1. We then present a framework to optimize the discount factor and the discount function by using an evolutionary algorithm. In order to validate the proposed method, we conduct a simple computer simulation and show that the proposed algorithm can find an appropriate state-dependent discount function with which performs better than that with a constant discount factor.\",\"PeriodicalId\":106997,\"journal\":{\"name\":\"2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL)\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-11-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DEVLRN.2013.6652533\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEVLRN.2013.6652533","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

摘要

传统的强化学习算法有几个参数，这些参数决定了学习过程的特征，称为元参数。在本研究中，我们关注影响即时和延迟奖励权衡的时间尺度的折扣因子。折扣因子通常被认为是一个常数，但我们引入了状态相关的折扣函数和一个新的强化学习算法优化准则。我们首先在该准则下推导了一个新的算法ExQ-learning，并证明了该算法收敛于新准则w.p.1意义下的最优动作值函数。然后，我们提出了一个利用进化算法优化折现因子和折现函数的框架。为了验证所提出的方法，我们进行了简单的计算机模拟，并表明所提出的算法可以找到合适的状态相关折扣函数，该函数比使用恒定折扣因子的算法性能更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Reinforcement learning with state-dependent discount factor

Conventional reinforcement learning algorithms have several parameters which determine the feature of learning process, called meta-parameters. In this study, we focus on the discount factor that influences the time scale of the tradeoff between immediate and delayed rewards. The discount factor is usually considered as a constant value, but we introduce the state-dependent discount function and a new optimization criterion for the reinforcement learning algorithm. We first derive a new algorithm under the criterion, named ExQ-learning and we prove that the algorithm converges to the optimal action-value function in the meaning of new criterion w.p.1. We then present a framework to optimize the discount factor and the discount function by using an evolutionary algorithm. In order to validate the proposed method, we conduct a simple computer simulation and show that the proposed algorithm can find an appropriate state-dependent discount function with which performs better than that with a constant discount factor.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL)

自引率

0.00%

发文量

期刊最新文献

Epigenetic adaptation through hormone modulation in autonomous robots Attentional constraints and statistics in toddlers' word learning Do humans need learning to read humanoid lifting actions? Temporal emphasis for goal extraction in task demonstration to a humanoid robot by naive users Developing learnability — The case for reduced dimensionality