{"title":"类alphazero深度强化学习的超参数分析","authors":"Haibo Wang, M. Emmerich, M. Preuss, A. Plaat","doi":"10.1142/s0219622022500547","DOIUrl":null,"url":null,"abstract":"The landmark achievements of AlphaGo Zero have created great research interest into self-play in reinforcement learning. In self-play, Monte Carlo Tree Search is used to train a deep neural network, which is then used itself in tree searches. The training is gov- erned by many hyper-parameters. There has been surprisingly little research on design choices for hyper-parameter values and loss functions, presumably because of the pro- hibitive computational cost to explore the parameter space. In this paper, we investigate 12 hyper-parameters in an AlphaZero-like self-play algorithm and evaluate how these parameters contribute to training. We study them on small games, to achieve meaningful exploration with moderate computational effort. The experimental results show that training is highly sensitive to hyper-parameter choices. Through multi-objective analysis, we identify 4 important hyper-parameters to further assess. To start, we find surprising results where too much training can sometimes lead to lower performance. Our main result is that the number of self-play iterations subsumes MCTS-search sim- ulations, game episodes, and training epochs. The intuition is that these three increase together as self-play iterations increase and that increasing them individually is sub- optimal. As a consequence of our experiments, we provide recommendations on setting hyper-parameter values in self-play. The outer loop of self-play iterations should be em- phasized, in favor of the inner loop. This means hyper-parameters for the inner loop, should be set to lower values. A secondary result of our experiments concerns the choice of optimization goals, for which we also provide recommendations.","PeriodicalId":13527,"journal":{"name":"Int. J. Inf. Technol. Decis. Mak.","volume":"87 1","pages":"829-853"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Analysis of Hyper-Parameters for AlphaZero-Like Deep Reinforcement Learning\",\"authors\":\"Haibo Wang, M. Emmerich, M. Preuss, A. Plaat\",\"doi\":\"10.1142/s0219622022500547\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The landmark achievements of AlphaGo Zero have created great research interest into self-play in reinforcement learning. In self-play, Monte Carlo Tree Search is used to train a deep neural network, which is then used itself in tree searches. The training is gov- erned by many hyper-parameters. There has been surprisingly little research on design choices for hyper-parameter values and loss functions, presumably because of the pro- hibitive computational cost to explore the parameter space. In this paper, we investigate 12 hyper-parameters in an AlphaZero-like self-play algorithm and evaluate how these parameters contribute to training. We study them on small games, to achieve meaningful exploration with moderate computational effort. The experimental results show that training is highly sensitive to hyper-parameter choices. Through multi-objective analysis, we identify 4 important hyper-parameters to further assess. To start, we find surprising results where too much training can sometimes lead to lower performance. Our main result is that the number of self-play iterations subsumes MCTS-search sim- ulations, game episodes, and training epochs. The intuition is that these three increase together as self-play iterations increase and that increasing them individually is sub- optimal. As a consequence of our experiments, we provide recommendations on setting hyper-parameter values in self-play. The outer loop of self-play iterations should be em- phasized, in favor of the inner loop. This means hyper-parameters for the inner loop, should be set to lower values. A secondary result of our experiments concerns the choice of optimization goals, for which we also provide recommendations.\",\"PeriodicalId\":13527,\"journal\":{\"name\":\"Int. J. Inf. Technol. Decis. Mak.\",\"volume\":\"87 1\",\"pages\":\"829-853\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Int. J. Inf. Technol. Decis. Mak.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1142/s0219622022500547\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Inf. Technol. Decis. Mak.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s0219622022500547","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Analysis of Hyper-Parameters for AlphaZero-Like Deep Reinforcement Learning
The landmark achievements of AlphaGo Zero have created great research interest into self-play in reinforcement learning. In self-play, Monte Carlo Tree Search is used to train a deep neural network, which is then used itself in tree searches. The training is gov- erned by many hyper-parameters. There has been surprisingly little research on design choices for hyper-parameter values and loss functions, presumably because of the pro- hibitive computational cost to explore the parameter space. In this paper, we investigate 12 hyper-parameters in an AlphaZero-like self-play algorithm and evaluate how these parameters contribute to training. We study them on small games, to achieve meaningful exploration with moderate computational effort. The experimental results show that training is highly sensitive to hyper-parameter choices. Through multi-objective analysis, we identify 4 important hyper-parameters to further assess. To start, we find surprising results where too much training can sometimes lead to lower performance. Our main result is that the number of self-play iterations subsumes MCTS-search sim- ulations, game episodes, and training epochs. The intuition is that these three increase together as self-play iterations increase and that increasing them individually is sub- optimal. As a consequence of our experiments, we provide recommendations on setting hyper-parameter values in self-play. The outer loop of self-play iterations should be em- phasized, in favor of the inner loop. This means hyper-parameters for the inner loop, should be set to lower values. A secondary result of our experiments concerns the choice of optimization goals, for which we also provide recommendations.