Nonstationary Stochastic Bandits: UCB Policies and Minimax Regret

Lai Wei;Vaibhav Srivastava
{"title":"Nonstationary Stochastic Bandits: UCB Policies and Minimax Regret","authors":"Lai Wei;Vaibhav Srivastava","doi":"10.1109/OJCSYS.2024.3372929","DOIUrl":null,"url":null,"abstract":"We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in which the distributions of rewards associated with arms are assumed to be time-varying and the total variation in the expected rewards is subject to a variation budget. The regret of a policy is defined by the difference in the expected cumulative reward obtained using the policy and using an oracle that selects the arm with the maximum mean reward at each time. We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget. We design Upper-Confidence Bound (UCB)-based policies with three different approaches, namely, periodic resetting, sliding observation window, and discount factor, and show that they are order-optimal with respect to the minimax regret, i.e., the minimum worst-case regret achieved by any policy. We also relax the sub-Gaussian assumption on reward distributions and develop robust versions of the proposed policies that can handle heavy-tailed reward distributions and maintain their performance guarantees.","PeriodicalId":73299,"journal":{"name":"IEEE open journal of control systems","volume":"3 ","pages":"128-142"},"PeriodicalIF":0.0000,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10460198","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of control systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10460198/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in which the distributions of rewards associated with arms are assumed to be time-varying and the total variation in the expected rewards is subject to a variation budget. The regret of a policy is defined by the difference in the expected cumulative reward obtained using the policy and using an oracle that selects the arm with the maximum mean reward at each time. We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget. We design Upper-Confidence Bound (UCB)-based policies with three different approaches, namely, periodic resetting, sliding observation window, and discount factor, and show that they are order-optimal with respect to the minimax regret, i.e., the minimum worst-case regret achieved by any policy. We also relax the sub-Gaussian assumption on reward distributions and develop robust versions of the proposed policies that can handle heavy-tailed reward distributions and maintain their performance guarantees.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
非平稳随机强盗:UCB 政策和最小遗憾
我们研究的是非平稳随机多臂匪徒(MAB)问题,其中假定与臂相关的奖励分布是时变的,并且预期奖励的总变化受制于变化预算。一个策略的遗憾度是指使用该策略与使用一个每次选择平均奖励最大的手臂的神谕所获得的预期累积奖励之差。我们用最坏情况下的遗憾值来描述所提策略的性能,即满足变化预算的奖励分布序列集合上遗憾值的上确值。我们设计了基于置信度上限(UCB)的策略,采用了三种不同的方法,即周期性重置、滑动观察窗口和贴现因子,并证明它们在最小遗憾(即任何策略都能达到的最小最坏情况遗憾)方面都是有序最优的。我们还放宽了奖赏分布的亚高斯假设,并开发了可处理重尾奖赏分布并保持其性能保证的鲁棒版本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Lyapunov-Based Nonlinear Model Predictive Control of Input-Delayed Functional Electrical Stimulation: Investigative Simulations and Experiments Front Cover IEEE Control Systems Society Information IEEE Open Journal of Control Systems Publication Information Geometry-Aware Edge-State Tracking for Robust Affine Formation Control
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1