AlphaGo and Monte Carlo tree search: The simulation optimization perspective

2016 Winter Simulation Conference (WSC) Pub Date : 2016-12-11 DOI:10.1109/WSC.2016.7822130

M. Fu

引用次数: 29

Abstract

In March of 2016, Google DeepMind's AlphaGo, a computer Go-playing program, defeated the reigning human world champion Go player, 4-1, a feat far more impressive than previous victories by computer programs in chess (IBM's Deep Blue) and Jeopardy (IBM's Watson). The main engine behind the program combines machine learning approaches with a technique called Monte Carlo tree search. Current versions of Monte Carlo tree search used in Go-playing algorithms are based on a version developed for games that traces its roots back to the adaptive multi-stage sampling simulation optimization algorithm for estimating value functions in finite-horizon Markov decision processes (MDPs) introduced by Chang et al. (2005), which was the first use of Upper Confidence Bounds (UCBs) for Monte Carlo simulation-based solution of MDPs. We review the main ideas in UCB-based Monte Carlo tree search by connecting it to simulation optimization through the use of two simple examples: decision trees and tic-tac-toe.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

AlphaGo和蒙特卡罗树搜索:模拟优化的视角

2016年3月，DeepMind的计算机围棋程序AlphaGo以4比1击败了人类围棋世界冠军，这一壮举远比之前计算机程序在国际象棋(IBM的深蓝)和危险边缘(IBM的沃森)中的胜利令人印象深刻。该程序背后的主要引擎结合了机器学习方法和一种名为蒙特卡洛树搜索的技术。围棋算法中使用的蒙特卡罗树搜索的当前版本是基于为游戏开发的一个版本，该版本可追溯到Chang等人(2005)引入的用于估计有限水平马尔可夫决策过程(mdp)中的值函数的自适应多阶段采样模拟优化算法，这是首次将上限置信限(ucb)用于基于蒙特卡罗模拟的mdp解决方案。我们回顾了基于ucb的蒙特卡罗树搜索的主要思想，通过使用两个简单的例子:决策树和井字棋，将其与仿真优化联系起来。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 Winter Simulation Conference (WSC)

自引率

0.00%

发文量