Asaf CasselSchool of Computer Science, Tel Aviv University, Orin LevySchool of Computer Science, Tel Aviv University, Yishay MansourSchool of Computer Science, Tel Aviv UniversityGoogle Research, Tel Aviv
{"title":"随机匪帮中的方差依赖回退的批量合奏","authors":"Asaf CasselSchool of Computer Science, Tel Aviv University, Orin LevySchool of Computer Science, Tel Aviv University, Yishay MansourSchool of Computer Science, Tel Aviv UniversityGoogle Research, Tel Aviv","doi":"arxiv-2409.08570","DOIUrl":null,"url":null,"abstract":"Efficiently trading off exploration and exploitation is one of the key\nchallenges in online Reinforcement Learning (RL). Most works achieve this by\ncarefully estimating the model uncertainty and following the so-called\noptimistic model. Inspired by practical ensemble methods, in this work we\npropose a simple and novel batch ensemble scheme that provably achieves\nnear-optimal regret for stochastic Multi-Armed Bandits (MAB). Crucially, our\nalgorithm has just a single parameter, namely the number of batches, and its\nvalue does not depend on distributional properties such as the scale and\nvariance of the losses. We complement our theoretical results by demonstrating\nthe effectiveness of our algorithm on synthetic benchmarks.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"177 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Batch Ensemble for Variance Dependent Regret in Stochastic Bandits\",\"authors\":\"Asaf CasselSchool of Computer Science, Tel Aviv University, Orin LevySchool of Computer Science, Tel Aviv University, Yishay MansourSchool of Computer Science, Tel Aviv UniversityGoogle Research, Tel Aviv\",\"doi\":\"arxiv-2409.08570\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Efficiently trading off exploration and exploitation is one of the key\\nchallenges in online Reinforcement Learning (RL). Most works achieve this by\\ncarefully estimating the model uncertainty and following the so-called\\noptimistic model. Inspired by practical ensemble methods, in this work we\\npropose a simple and novel batch ensemble scheme that provably achieves\\nnear-optimal regret for stochastic Multi-Armed Bandits (MAB). Crucially, our\\nalgorithm has just a single parameter, namely the number of batches, and its\\nvalue does not depend on distributional properties such as the scale and\\nvariance of the losses. We complement our theoretical results by demonstrating\\nthe effectiveness of our algorithm on synthetic benchmarks.\",\"PeriodicalId\":501340,\"journal\":{\"name\":\"arXiv - STAT - Machine Learning\",\"volume\":\"177 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - STAT - Machine Learning\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.08570\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08570","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Batch Ensemble for Variance Dependent Regret in Stochastic Bandits
Efficiently trading off exploration and exploitation is one of the key
challenges in online Reinforcement Learning (RL). Most works achieve this by
carefully estimating the model uncertainty and following the so-called
optimistic model. Inspired by practical ensemble methods, in this work we
propose a simple and novel batch ensemble scheme that provably achieves
near-optimal regret for stochastic Multi-Armed Bandits (MAB). Crucially, our
algorithm has just a single parameter, namely the number of batches, and its
value does not depend on distributional properties such as the scale and
variance of the losses. We complement our theoretical results by demonstrating
the effectiveness of our algorithm on synthetic benchmarks.