On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes

IF 7 1区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS IEEE Transactions on Automatic Control Pub Date : 2025-02-04 DOI:10.1109/TAC.2025.3538807
Hyeong Soo Chang
{"title":"On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes","authors":"Hyeong Soo Chang","doi":"10.1109/TAC.2025.3538807","DOIUrl":null,"url":null,"abstract":"A recent theoretical analysis of a Monte-Carlo tree search (MCTS) method properly modified from the “upper confidence bound applied to trees” (UCT) algorithm established a surprising result, due to a great deal of empirical successes reported from heuristic usage of UCT with relevant adjustments for various problem domains in the literature, that its rate of convergence of the expected absolute error to zero is <inline-formula><tex-math>$O(1/\\sqrt{n})$</tex-math></inline-formula> in estimating the optimal value at an initial state in a finite-horizon Markov decision process (MDP), where <inline-formula><tex-math>$n$</tex-math></inline-formula> is the number of simulations. We strengthen this dispiriting slow convergence result by arguing within a simpler algorithmic framework in the perspective of MDP, apart from the usual MCTS description, that the simpler strategy, called “upper confidence bound 1” (UCB1) for multiarmed bandit problems, when employed as an instance of MCTS by setting UCB1’s arm set to be the policy set of the underlying MDP, has an asymptotically faster convergence-rate of <inline-formula><tex-math>$O(\\ln n / n)$</tex-math></inline-formula>. We also point out that the UCT-based MCTS in general has the time and space complexities that depend on the size of the state space in the worst case, which contradicts the original design spirit of MCTS. Unless heuristically used, UCT-based MCTS has yet to have theoretical supports for its applicabilities.","PeriodicalId":13201,"journal":{"name":"IEEE Transactions on Automatic Control","volume":"70 7","pages":"4788-4793"},"PeriodicalIF":7.0000,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Automatic Control","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10870057/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

A recent theoretical analysis of a Monte-Carlo tree search (MCTS) method properly modified from the “upper confidence bound applied to trees” (UCT) algorithm established a surprising result, due to a great deal of empirical successes reported from heuristic usage of UCT with relevant adjustments for various problem domains in the literature, that its rate of convergence of the expected absolute error to zero is $O(1/\sqrt{n})$ in estimating the optimal value at an initial state in a finite-horizon Markov decision process (MDP), where $n$ is the number of simulations. We strengthen this dispiriting slow convergence result by arguing within a simpler algorithmic framework in the perspective of MDP, apart from the usual MCTS description, that the simpler strategy, called “upper confidence bound 1” (UCB1) for multiarmed bandit problems, when employed as an instance of MCTS by setting UCB1’s arm set to be the policy set of the underlying MDP, has an asymptotically faster convergence-rate of $O(\ln n / n)$. We also point out that the UCT-based MCTS in general has the time and space complexities that depend on the size of the state space in the worst case, which contradicts the original design spirit of MCTS. Unless heuristically used, UCT-based MCTS has yet to have theoretical supports for its applicabilities.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
马尔可夫决策过程中最优值估计的MCTS收敛速度研究
最近对蒙特卡罗树搜索(MCTS)方法的理论分析,从“应用于树的上置信度界限”(UCT)算法中适当修改,建立了一个令人惊讶的结果,由于在文献中对各种问题域进行相关调整的启发式使用UCT报告了大量的经验成功。在有限视界马尔可夫决策过程(MDP)的初始状态下估计最优值时,其期望绝对误差收敛到零的速率为$O(1/\sqrt{n})$,其中$n$为模拟次数。除了通常的MCTS描述之外,我们通过在MDP角度的一个更简单的算法框架内论证这一令人沮丧的缓慢收敛结果,即当将UCB1的臂集设置为底层MDP的策略集作为MCTS的实例时,称为“上置信度界1”(UCB1)的更简单的策略作为MCTS的实例时,收敛速度渐近较快,为$O(\ln n / n)$。我们还指出,在最坏情况下,基于uct的MCTS通常具有依赖于状态空间大小的时间和空间复杂性,这与MCTS最初的设计精神相矛盾。除非采用启发式方法,否则基于uct的MCTS的适用性尚未得到理论支持。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Transactions on Automatic Control
IEEE Transactions on Automatic Control 工程技术-工程:电子与电气
CiteScore
11.30
自引率
5.90%
发文量
824
审稿时长
9 months
期刊介绍: In the IEEE Transactions on Automatic Control, the IEEE Control Systems Society publishes high-quality papers on the theory, design, and applications of control engineering. Two types of contributions are regularly considered: 1) Papers: Presentation of significant research, development, or application of control concepts. 2) Technical Notes and Correspondence: Brief technical notes, comments on published areas or established control topics, corrections to papers and notes published in the Transactions. In addition, special papers (tutorials, surveys, and perspectives on the theory and applications of control systems topics) are solicited.
期刊最新文献
Reaching Resilient Leader-Follower Consensus in Time-Varying Networks via Multi-Hop Relays Dynamical System Approach for Optimal Control Problems with Equilibrium Constraints Using Gap-Constraint-Based Reformulation Set-Based State Estimation for Discrete-Time Semi-Markov Jump Linear Systems Using Zonotopes Safe Event-triggered Gaussian Process Learning for Barrier-Constrained Control Energy-Gain Control of Time-Varying Systems: Receding Horizon Approximation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1