连续时间平均回报马尔可夫决策过程的对数回归界线

IF 2.2 2区数学 Q2 AUTOMATION & CONTROL SYSTEMS SIAM Journal on Control and Optimization Pub Date : 2024-09-10 DOI:10.1137/23m1584101

Xuefeng Gao, Xun Yu Zhou

{"title":"连续时间平均回报马尔可夫决策过程的对数回归界线","authors":"Xuefeng Gao, Xun Yu Zhou","doi":"10.1137/23m1584101","DOIUrl":null,"url":null,"abstract":"SIAM Journal on Control and Optimization, Volume 62, Issue 5, Page 2529-2556, October 2024. <br/> Abstract. We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.","PeriodicalId":49531,"journal":{"name":"SIAM Journal on Control and Optimization","volume":"36 1","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Logarithmic Regret Bounds for Continuous-Time Average-Reward Markov Decision Processes\",\"authors\":\"Xuefeng Gao, Xun Yu Zhou\",\"doi\":\"10.1137/23m1584101\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"SIAM Journal on Control and Optimization, Volume 62, Issue 5, Page 2529-2556, October 2024. <br/> Abstract. We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.\",\"PeriodicalId\":49531,\"journal\":{\"name\":\"SIAM Journal on Control and Optimization\",\"volume\":\"36 1\",\"pages\":\"\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2024-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SIAM Journal on Control and Optimization\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1137/23m1584101\",\"RegionNum\":2,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIAM Journal on Control and Optimization","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1137/23m1584101","RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

SIAM 控制与优化期刊》第 62 卷第 5 期第 2529-2556 页，2024 年 10 月。摘要我们考虑的是无限视距、平均回报环境下连续时间马尔可夫决策过程（MDP）的强化学习。与离散时间马尔可夫决策过程不同，连续时间过程在采取行动后会移动到一个状态，并在该状态停留一段时间。在未知过渡概率和指数保持时间率的情况下，我们推导出了与实例相关的遗憾下限，这些遗憾下限与时间跨度成对数关系。此外，我们还设计了一种学习算法，并建立了能达到对数增长率的有限时间后悔约束。我们的分析建立在上置信度强化学习、平均保持时间的微妙估计和点过程的随机比较之上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Logarithmic Regret Bounds for Continuous-Time Average-Reward Markov Decision Processes

SIAM Journal on Control and Optimization, Volume 62, Issue 5, Page 2529-2556, October 2024.
Abstract. We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

SIAM Journal on Control and Optimization 数学-应用数学

CiteScore

4.00

自引率

4.50%

发文量

143

审稿时长

12 months

期刊介绍： SIAM Journal on Control and Optimization (SICON) publishes original research articles on the mathematics and applications of control theory and certain parts of optimization theory. Papers considered for publication must be significant at both the mathematical level and the level of applications or potential applications. Papers containing mostly routine mathematics or those with no discernible connection to control and systems theory or optimization will not be considered for publication. From time to time, the journal will also publish authoritative surveys of important subject areas in control theory and optimization whose level of maturity permits a clear and unified exposition. The broad areas mentioned above are intended to encompass a wide range of mathematical techniques and scientific, engineering, economic, and industrial applications. These include stochastic and deterministic methods in control, estimation, and identification of systems; modeling and realization of complex control systems; the numerical analysis and related computational methodology of control processes and allied issues; and the development of mathematical theories and techniques that give new insights into old problems or provide the basis for further progress in control theory and optimization. Within the field of optimization, the journal focuses on the parts that are relevant to dynamic and control systems. Contributions to numerical methodology are also welcome in accordance with these aims, especially as related to large-scale problems and decomposition as well as to fundamental questions of convergence and approximation.