{"title":"Optimal Scheduling of Entropy Regularizer for Continuous-Time Linear-Quadratic Reinforcement Learning","authors":"Lukasz Szpruch, Tanut Treetanthiploet, Yufei Zhang","doi":"10.1137/22m1515744","DOIUrl":null,"url":null,"abstract":"SIAM Journal on Control and Optimization, Volume 62, Issue 1, Page 135-166, February 2024. <br/> Abstract. This work uses the entropy-regularized relaxed stochastic control perspective as a principled framework for designing reinforcement learning (RL) algorithms. Herein, an agent interacts with the environment by generating noisy controls distributed according to the optimal relaxed policy. The noisy policies, on the one hand, explore the space and hence facilitate learning, but, on the other hand, they introduce bias by assigning a positive probability to nonoptimal actions. This exploration-exploitation trade-off is determined by the strength of entropy regularization. We study algorithms resulting from two entropy regularization formulations: the exploratory control approach, where entropy is added to the cost objective, and the proximal policy update approach, where entropy penalizes policy divergence between consecutive episodes. We focus on the finite horizon continuous-time linear-quadratic (LQ) RL problem, where a linear dynamics with unknown drift coefficients is controlled subject to quadratic costs. In this setting, both algorithms yield a Gaussian relaxed policy. We quantify the precise difference between the value functions of a Gaussian policy and its noisy evaluation and show that the execution noise must be independent across time. By tuning the frequency of sampling from relaxed policies and the parameter governing the strength of entropy regularization, we prove that the regret, for both learning algorithms, is of the order [math] (up to a logarithmic factor) over [math] episodes, matching the best known result from the literature.","PeriodicalId":49531,"journal":{"name":"SIAM Journal on Control and Optimization","volume":null,"pages":null},"PeriodicalIF":2.2000,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIAM Journal on Control and Optimization","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1137/22m1515744","RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
SIAM Journal on Control and Optimization, Volume 62, Issue 1, Page 135-166, February 2024. Abstract. This work uses the entropy-regularized relaxed stochastic control perspective as a principled framework for designing reinforcement learning (RL) algorithms. Herein, an agent interacts with the environment by generating noisy controls distributed according to the optimal relaxed policy. The noisy policies, on the one hand, explore the space and hence facilitate learning, but, on the other hand, they introduce bias by assigning a positive probability to nonoptimal actions. This exploration-exploitation trade-off is determined by the strength of entropy regularization. We study algorithms resulting from two entropy regularization formulations: the exploratory control approach, where entropy is added to the cost objective, and the proximal policy update approach, where entropy penalizes policy divergence between consecutive episodes. We focus on the finite horizon continuous-time linear-quadratic (LQ) RL problem, where a linear dynamics with unknown drift coefficients is controlled subject to quadratic costs. In this setting, both algorithms yield a Gaussian relaxed policy. We quantify the precise difference between the value functions of a Gaussian policy and its noisy evaluation and show that the execution noise must be independent across time. By tuning the frequency of sampling from relaxed policies and the parameter governing the strength of entropy regularization, we prove that the regret, for both learning algorithms, is of the order [math] (up to a logarithmic factor) over [math] episodes, matching the best known result from the literature.
期刊介绍:
SIAM Journal on Control and Optimization (SICON) publishes original research articles on the mathematics and applications of control theory and certain parts of optimization theory. Papers considered for publication must be significant at both the mathematical level and the level of applications or potential applications. Papers containing mostly routine mathematics or those with no discernible connection to control and systems theory or optimization will not be considered for publication. From time to time, the journal will also publish authoritative surveys of important subject areas in control theory and optimization whose level of maturity permits a clear and unified exposition.
The broad areas mentioned above are intended to encompass a wide range of mathematical techniques and scientific, engineering, economic, and industrial applications. These include stochastic and deterministic methods in control, estimation, and identification of systems; modeling and realization of complex control systems; the numerical analysis and related computational methodology of control processes and allied issues; and the development of mathematical theories and techniques that give new insights into old problems or provide the basis for further progress in control theory and optimization. Within the field of optimization, the journal focuses on the parts that are relevant to dynamic and control systems. Contributions to numerical methodology are also welcome in accordance with these aims, especially as related to large-scale problems and decomposition as well as to fundamental questions of convergence and approximation.