{"title":"蒙特卡罗树搜索的非渐近分析","authors":"D. Shah, Qiaomin Xie, Zhi Xu","doi":"10.1145/3393691.3394202","DOIUrl":null,"url":null,"abstract":"In this work, we consider the popular tree-based search strategy within the framework of reinforcement learning, the Monte Carlo Tree Search (MCTS), in the context of infinite-horizon discounted cost Markov Decision Process (MDP) with deterministic transitions. While MCTS is believed to provide an approximate value function for a given state with enough simulations, cf. [Kocsis and Szepesvari 2006; Kocsis et al. 2006], the claimed proof of this property is incomplete. This is due to the fact that the variant of MCTS, the Upper Confidence Bound for Trees (UCT), analyzed in prior works utilizes \"logarithmic\" bonus term for balancing exploration and exploitation within the tree-based search, following the insights from stochastic multi-arm bandit (MAB) literature, cf. [Agrawal 1995; Auer et al. 2002]. In effect, such an approach assumes that the regret of the underlying recursively dependent non-stationary MABs concentrates around their mean exponentially in the number of steps, which is unlikely to hold as pointed out in [Audibert et al. 2009], even for stationary MABs. As the key contribution of this work, we establish polynomial concentration property of regret for a class of non-stationary multi-arm bandits. This in turn establishes that the MCTS with appropriate polynomial rather than logarithmic bonus term in UCB has the claimed property of [Kocsis and Szepesvari 2006; Kocsis et al. 2006]. Interestingly enough, empirically successful approaches (cf. [Silver et al. 2017]) utilize a similar polynomial form of MCTS as suggested by our result. Using this as a building block, we argue that MCTS, combined with nearest neighbor supervised learning, acts as a \"policy improvement\" operator, i.e., it iteratively improves value function approximation for all states, due to combining with supervised learning, despite evaluating at only finitely many states. In effect, we establish that to learn an ε-approximation of the value function for deterministic MDPs with respect to ℓ∞ norm, MCTS combined with nearest neighbor requires a sample size scaling as Õ (ε-(d+4), where d is the dimension of the state space. This is nearly optimal due to a minimax lower bound of ∼Ω (ε-(d+2) [Shah and Xie 2018], suggesting the strength of the variant of MCTS we propose here and our resulting analysis.","PeriodicalId":188517,"journal":{"name":"Abstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems","volume":"199 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Non-Asymptotic Analysis of Monte Carlo Tree Search\",\"authors\":\"D. Shah, Qiaomin Xie, Zhi Xu\",\"doi\":\"10.1145/3393691.3394202\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this work, we consider the popular tree-based search strategy within the framework of reinforcement learning, the Monte Carlo Tree Search (MCTS), in the context of infinite-horizon discounted cost Markov Decision Process (MDP) with deterministic transitions. While MCTS is believed to provide an approximate value function for a given state with enough simulations, cf. [Kocsis and Szepesvari 2006; Kocsis et al. 2006], the claimed proof of this property is incomplete. This is due to the fact that the variant of MCTS, the Upper Confidence Bound for Trees (UCT), analyzed in prior works utilizes \\\"logarithmic\\\" bonus term for balancing exploration and exploitation within the tree-based search, following the insights from stochastic multi-arm bandit (MAB) literature, cf. [Agrawal 1995; Auer et al. 2002]. In effect, such an approach assumes that the regret of the underlying recursively dependent non-stationary MABs concentrates around their mean exponentially in the number of steps, which is unlikely to hold as pointed out in [Audibert et al. 2009], even for stationary MABs. As the key contribution of this work, we establish polynomial concentration property of regret for a class of non-stationary multi-arm bandits. This in turn establishes that the MCTS with appropriate polynomial rather than logarithmic bonus term in UCB has the claimed property of [Kocsis and Szepesvari 2006; Kocsis et al. 2006]. Interestingly enough, empirically successful approaches (cf. [Silver et al. 2017]) utilize a similar polynomial form of MCTS as suggested by our result. Using this as a building block, we argue that MCTS, combined with nearest neighbor supervised learning, acts as a \\\"policy improvement\\\" operator, i.e., it iteratively improves value function approximation for all states, due to combining with supervised learning, despite evaluating at only finitely many states. In effect, we establish that to learn an ε-approximation of the value function for deterministic MDPs with respect to ℓ∞ norm, MCTS combined with nearest neighbor requires a sample size scaling as Õ (ε-(d+4), where d is the dimension of the state space. This is nearly optimal due to a minimax lower bound of ∼Ω (ε-(d+2) [Shah and Xie 2018], suggesting the strength of the variant of MCTS we propose here and our resulting analysis.\",\"PeriodicalId\":188517,\"journal\":{\"name\":\"Abstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems\",\"volume\":\"199 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-05-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Abstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3393691.3394202\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Abstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3393691.3394202","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
摘要
在这项工作中,我们考虑了在具有确定性过渡的无限视界贴现成本马尔可夫决策过程(MDP)的背景下,在强化学习框架内流行的基于树的搜索策略,蒙特卡洛树搜索(MCTS)。虽然MCTS被认为为给定状态提供了一个近似的值函数,但有足够的模拟,参见[Kocsis和Szepesvari 2006;Kocsis et al. 2006],该属性的证明是不完整的。这是由于MCTS的变体,即树的上置信限(UCT),在先前的工作中分析,利用“对数”奖励项来平衡基于树的搜索中的探索和开发,这是根据随机多臂强盗(MAB)文献的见解,参见[Agrawal 1995;Auer et al. 2002]。实际上,这种方法假设潜在的递归依赖的非平稳mab的遗憾在步数上以指数形式集中在它们的平均值周围,正如[Audibert et al. 2009]所指出的那样,即使对于平稳mab,这也不太可能成立。作为本工作的关键贡献,我们建立了一类非平稳多臂强盗的多项式后悔集中性质。这反过来又确立了UCB中具有适当多项式而不是对数奖励项的MCTS具有[Kocsis和Szepesvari 2006;Kocsis et al. 2006]。有趣的是,经验上成功的方法(参见[Silver et al. 2017])利用了与我们的结果相似的多项式形式的MCTS。使用此作为构建块,我们认为MCTS与最近邻监督学习相结合,充当“策略改进”算子,即,尽管仅在有限多个状态下评估,但由于与监督学习相结合,它迭代地改进了所有状态的值函数近似。实际上,我们建立了为了学习确定性MDPs相对于r∞范数的值函数的ε-近似,MCTS结合最近邻需要样本大小缩放为Õ (ε-(d+4),其中d是状态空间的维数。由于最小和最大下界为~ Ω (ε-(d+2) [Shah and Xie 2018],这几乎是最优的,这表明我们在这里提出的MCTS变体的强度以及我们的结果分析。
Non-Asymptotic Analysis of Monte Carlo Tree Search
In this work, we consider the popular tree-based search strategy within the framework of reinforcement learning, the Monte Carlo Tree Search (MCTS), in the context of infinite-horizon discounted cost Markov Decision Process (MDP) with deterministic transitions. While MCTS is believed to provide an approximate value function for a given state with enough simulations, cf. [Kocsis and Szepesvari 2006; Kocsis et al. 2006], the claimed proof of this property is incomplete. This is due to the fact that the variant of MCTS, the Upper Confidence Bound for Trees (UCT), analyzed in prior works utilizes "logarithmic" bonus term for balancing exploration and exploitation within the tree-based search, following the insights from stochastic multi-arm bandit (MAB) literature, cf. [Agrawal 1995; Auer et al. 2002]. In effect, such an approach assumes that the regret of the underlying recursively dependent non-stationary MABs concentrates around their mean exponentially in the number of steps, which is unlikely to hold as pointed out in [Audibert et al. 2009], even for stationary MABs. As the key contribution of this work, we establish polynomial concentration property of regret for a class of non-stationary multi-arm bandits. This in turn establishes that the MCTS with appropriate polynomial rather than logarithmic bonus term in UCB has the claimed property of [Kocsis and Szepesvari 2006; Kocsis et al. 2006]. Interestingly enough, empirically successful approaches (cf. [Silver et al. 2017]) utilize a similar polynomial form of MCTS as suggested by our result. Using this as a building block, we argue that MCTS, combined with nearest neighbor supervised learning, acts as a "policy improvement" operator, i.e., it iteratively improves value function approximation for all states, due to combining with supervised learning, despite evaluating at only finitely many states. In effect, we establish that to learn an ε-approximation of the value function for deterministic MDPs with respect to ℓ∞ norm, MCTS combined with nearest neighbor requires a sample size scaling as Õ (ε-(d+4), where d is the dimension of the state space. This is nearly optimal due to a minimax lower bound of ∼Ω (ε-(d+2) [Shah and Xie 2018], suggesting the strength of the variant of MCTS we propose here and our resulting analysis.