Exploration versus Exploitation in Reinforcement Learning: A Stochastic Control Approach

Haoran Wang, T. Zariphopoulou, X. Zhou
{"title":"Exploration versus Exploitation in Reinforcement Learning: A Stochastic Control Approach","authors":"Haoran Wang, T. Zariphopoulou, X. Zhou","doi":"10.2139/ssrn.3316387","DOIUrl":null,"url":null,"abstract":"We consider reinforcement learning (RL) in continuous time and study the problem of achieving the best trade-off between exploration of a black box environment and exploitation of current knowledge. We propose an entropy-regularized reward function involving the differential entropy of the distributions of actions, and motivate and devise an exploratory formulation for the feature dynamics that captures repetitive learning under exploration. The resulting optimization problem is a revitalization of the classical relaxed stochastic control. We carry out a complete analysis of the problem in the linear--quadratic (LQ) setting and deduce that the optimal feedback control distribution for balancing exploitation and exploration is Gaussian. This in turn interprets and justifies the widely adopted Gaussian exploration in RL, beyond its simplicity for sampling. Moreover, the exploitation and exploration are captured, respectively and mutual-exclusively, by the mean and variance of the Gaussian distribution. We also find that a more random environment contains more learning opportunities in the sense that less exploration is needed. We characterize the cost of exploration, which, for the LQ case, is shown to be proportional to the entropy regularization weight and inversely proportional to the discount rate. Finally, as the weight of exploration decays to zero, we prove the convergence of the solution of the entropy-regularized LQ problem to the one of the classical LQ problem.","PeriodicalId":299310,"journal":{"name":"Econometrics: Mathematical Methods & Programming eJournal","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"37","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Econometrics: Mathematical Methods & Programming eJournal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2139/ssrn.3316387","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 37

Abstract

We consider reinforcement learning (RL) in continuous time and study the problem of achieving the best trade-off between exploration of a black box environment and exploitation of current knowledge. We propose an entropy-regularized reward function involving the differential entropy of the distributions of actions, and motivate and devise an exploratory formulation for the feature dynamics that captures repetitive learning under exploration. The resulting optimization problem is a revitalization of the classical relaxed stochastic control. We carry out a complete analysis of the problem in the linear--quadratic (LQ) setting and deduce that the optimal feedback control distribution for balancing exploitation and exploration is Gaussian. This in turn interprets and justifies the widely adopted Gaussian exploration in RL, beyond its simplicity for sampling. Moreover, the exploitation and exploration are captured, respectively and mutual-exclusively, by the mean and variance of the Gaussian distribution. We also find that a more random environment contains more learning opportunities in the sense that less exploration is needed. We characterize the cost of exploration, which, for the LQ case, is shown to be proportional to the entropy regularization weight and inversely proportional to the discount rate. Finally, as the weight of exploration decays to zero, we prove the convergence of the solution of the entropy-regularized LQ problem to the one of the classical LQ problem.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
强化学习中的探索与利用:随机控制方法
我们考虑了连续时间下的强化学习(RL),并研究了在探索黑箱环境和利用现有知识之间实现最佳权衡的问题。我们提出了一个涉及动作分布的微分熵的熵正则化奖励函数,并激发和设计了一个探索性的特征动力学公式,以捕获探索过程中的重复学习。由此产生的优化问题是经典的松弛随机控制的复兴。在线性二次型(LQ)环境下对问题进行了完整的分析,并推导出平衡开采和勘探的最优反馈控制分布为高斯分布。这反过来解释和证明了在强化学习中广泛采用的高斯探索,超越了抽样的简单性。此外,利用高斯分布的均值和方差分别捕获了开采和勘探,并且相互排斥。我们还发现,更随机的环境包含更多的学习机会,因为需要更少的探索。我们描述了探索的成本,对于LQ的情况,它被证明与熵正则化权成正比,与贴现率成反比。最后,当探索权值衰减到零时,我们证明了熵正则化LQ问题解收敛于经典LQ问题解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Parameters Estimation of Photovoltaic Models Using a Novel Hybrid Seagull Optimization Algorithm Input Indivisibility and the Measurement Error Quadratically optimal equilibrium points of weakly potential bi-matrix games Mathematical Background of the Theory of Cycle of Money An Example for 'Weak Monotone Comparative Statics'
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1