Nonstationary Stochastic Bandits: UCB Policies and Minimax Regret

IEEE open journal of control systems Pub Date : 2024-03-05 DOI:10.1109/OJCSYS.2024.3372929

Lai Wei;Vaibhav Srivastava

{"title":"Nonstationary Stochastic Bandits: UCB Policies and Minimax Regret","authors":"Lai Wei;Vaibhav Srivastava","doi":"10.1109/OJCSYS.2024.3372929","DOIUrl":null,"url":null,"abstract":"We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in which the distributions of rewards associated with arms are assumed to be time-varying and the total variation in the expected rewards is subject to a variation budget. The regret of a policy is defined by the difference in the expected cumulative reward obtained using the policy and using an oracle that selects the arm with the maximum mean reward at each time. We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget. We design Upper-Confidence Bound (UCB)-based policies with three different approaches, namely, periodic resetting, sliding observation window, and discount factor, and show that they are order-optimal with respect to the minimax regret, i.e., the minimum worst-case regret achieved by any policy. We also relax the sub-Gaussian assumption on reward distributions and develop robust versions of the proposed policies that can handle heavy-tailed reward distributions and maintain their performance guarantees.","PeriodicalId":73299,"journal":{"name":"IEEE open journal of control systems","volume":"3 ","pages":"128-142"},"PeriodicalIF":0.0000,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10460198","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of control systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10460198/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in which the distributions of rewards associated with arms are assumed to be time-varying and the total variation in the expected rewards is subject to a variation budget. The regret of a policy is defined by the difference in the expected cumulative reward obtained using the policy and using an oracle that selects the arm with the maximum mean reward at each time. We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget. We design Upper-Confidence Bound (UCB)-based policies with three different approaches, namely, periodic resetting, sliding observation window, and discount factor, and show that they are order-optimal with respect to the minimax regret, i.e., the minimum worst-case regret achieved by any policy. We also relax the sub-Gaussian assumption on reward distributions and develop robust versions of the proposed policies that can handle heavy-tailed reward distributions and maintain their performance guarantees.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

非平稳随机强盗：UCB 政策和最小遗憾

我们研究的是非平稳随机多臂匪徒（MAB）问题，其中假定与臂相关的奖励分布是时变的，并且预期奖励的总变化受制于变化预算。一个策略的遗憾度是指使用该策略与使用一个每次选择平均奖励最大的手臂的神谕所获得的预期累积奖励之差。我们用最坏情况下的遗憾值来描述所提策略的性能，即满足变化预算的奖励分布序列集合上遗憾值的上确值。我们设计了基于置信度上限（UCB）的策略，采用了三种不同的方法，即周期性重置、滑动观察窗口和贴现因子，并证明它们在最小遗憾（即任何策略都能达到的最小最坏情况遗憾）方面都是有序最优的。我们还放宽了奖赏分布的亚高斯假设，并开发了可处理重尾奖赏分布并保持其性能保证的鲁棒版本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE open journal of control systems

自引率

0.00%

发文量

期刊最新文献

Lyapunov-Based Nonlinear Model Predictive Control of Input-Delayed Functional Electrical Stimulation: Investigative Simulations and Experiments Front Cover IEEE Control Systems Society Information IEEE Open Journal of Control Systems Publication Information Geometry-Aware Edge-State Tracking for Robust Affine Formation Control