{"title":"Adaptive MCMC parallelisation in Stan","authors":"T. Stenborg","doi":"10.36334/modsim.2023.stenborg","DOIUrl":null,"url":null,"abstract":": Stan is a probabilistic programming language that uses Markov chain Monte Carlo (MCMC) sampling for Bayesian inference (Carpenter at el.). Stan sampling can be parallelised by running Markov chains m on separate processing cores n , i.e. ≥ 1 chain/core, for Amdahlian speedup (Annis et al.). An extension , introduced here, is adaptive parallelisation. First, prior to planned sampling, performance benchmarking was dynamically performed with m = 4… M chains distributed over n = 1 … m cores (where M is a system’s number of available cores, and using at least four chains is recommended (Vehtari et el.)). The best performing configuration ( m, n ) was then automatically adopted ( github.com/tstenborg/Stan - Adaptive -Parallelisation). To be relevant, benchmarking should proceed with the same data and compiled Stan model as the planned sampling. For efficiency, benchmarking was performed with fewer chain iterations than for inference proper, though using the same ratio of warmup to post-warmup iterations/chain (1 : 1/ m , yielding an equal number of total draws per configuration). For further efficiency, comparison of only one evaluation of each configuration was made. One evaluation was deemed sufficient after measuring speedup variability, for an example problem and configuration near the middle of a test system’s (Intel Core i7-10750H) non-hyperthreaded ( m , n ) configuration range. The simplifying assumption was made that results for the configuration were representative of the entire hyperthreaded and non-hyperthreaded range. Finally, for meaningful interconfiguration comparisons, a fixed seed was passed to the Stan random number generator. Warmup iterations had a significant effect on optimum ( m , n ). Too few warmup iterations, though speeding up benchmarking, can leave Stan without enough adaptation time to determine efficient sampling parameters (Hecht et al.","PeriodicalId":390064,"journal":{"name":"MODSIM2023, 25th International Congress on Modelling and Simulation.","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"MODSIM2023, 25th International Congress on Modelling and Simulation.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.36334/modsim.2023.stenborg","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
: Stan is a probabilistic programming language that uses Markov chain Monte Carlo (MCMC) sampling for Bayesian inference (Carpenter at el.). Stan sampling can be parallelised by running Markov chains m on separate processing cores n , i.e. ≥ 1 chain/core, for Amdahlian speedup (Annis et al.). An extension , introduced here, is adaptive parallelisation. First, prior to planned sampling, performance benchmarking was dynamically performed with m = 4… M chains distributed over n = 1 … m cores (where M is a system’s number of available cores, and using at least four chains is recommended (Vehtari et el.)). The best performing configuration ( m, n ) was then automatically adopted ( github.com/tstenborg/Stan - Adaptive -Parallelisation). To be relevant, benchmarking should proceed with the same data and compiled Stan model as the planned sampling. For efficiency, benchmarking was performed with fewer chain iterations than for inference proper, though using the same ratio of warmup to post-warmup iterations/chain (1 : 1/ m , yielding an equal number of total draws per configuration). For further efficiency, comparison of only one evaluation of each configuration was made. One evaluation was deemed sufficient after measuring speedup variability, for an example problem and configuration near the middle of a test system’s (Intel Core i7-10750H) non-hyperthreaded ( m , n ) configuration range. The simplifying assumption was made that results for the configuration were representative of the entire hyperthreaded and non-hyperthreaded range. Finally, for meaningful interconfiguration comparisons, a fixed seed was passed to the Stan random number generator. Warmup iterations had a significant effect on optimum ( m , n ). Too few warmup iterations, though speeding up benchmarking, can leave Stan without enough adaptation time to determine efficient sampling parameters (Hecht et al.