Simple fixes that accommodate switching costs in multi-armed bandits

IF 6 2区 管理学 Q1 OPERATIONS RESEARCH & MANAGEMENT SCIENCE European Journal of Operational Research Pub Date : 2024-09-19 DOI:10.1016/j.ejor.2024.09.017
Ehsan Teymourian , Jian Yang
{"title":"Simple fixes that accommodate switching costs in multi-armed bandits","authors":"Ehsan Teymourian ,&nbsp;Jian Yang","doi":"10.1016/j.ejor.2024.09.017","DOIUrl":null,"url":null,"abstract":"<div><div>When switching costs are added to the multi-armed bandit (MAB) problem where the arms’ random reward distributions are previously unknown, usually quite different techniques than those for pure MAB are required. We find that two simple fixes on the existing upper-confidence-bound (UCB) policy can work well for MAB with switching costs (MAB-SC). Two cases should be distinguished. One is with <em>positive-gap</em> ambiguity where the performance gap between the leading and lagging arms is known to be at least some <span><math><mrow><mi>δ</mi><mo>&gt;</mo><mn>0</mn></mrow></math></span>. For this, our fix is to erect barriers that discourage frivolous arm switchings. The other is with <em>zero-gap</em> ambiguity where absolutely nothing is known. We remedy this by forcing the same arms to be pulled in increasingly prolonged intervals. As usual, the effectivenesses of our fixes are measured by the worst average regrets over long time horizons <span><math><mi>T</mi></math></span>. When the barriers are fixed at <span><math><mrow><mi>δ</mi><mo>/</mo><mn>2</mn></mrow></math></span>, we can accomplish a <span><math><mrow><mo>ln</mo><mrow><mo>(</mo><mi>T</mi><mo>)</mo></mrow></mrow></math></span>-sized regret bound for the positive-gap case. When intervals are such that <span><math><mi>n</mi></math></span> of them occupy <span><math><msup><mrow><mi>n</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> periods, we can achieve the best possible <span><math><msup><mrow><mi>T</mi></mrow><mrow><mn>1</mn><mo>/</mo><mn>2</mn></mrow></msup></math></span>-sized regret bound for the zero-gap case. Other than UCB, these fixes can be applied to a learning while doing (LWD) heuristic to reach satisfactory results as well. While not yet with the best theoretical guarantees, the LWD-based policies have empirically outperformed those based on UCB and other known alternatives. Numerically competitive policies still include ones resulting from interval-based fixes on Thompson sampling (TS).</div></div>","PeriodicalId":55161,"journal":{"name":"European Journal of Operational Research","volume":"320 3","pages":"Pages 616-627"},"PeriodicalIF":6.0000,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Operational Research","FirstCategoryId":"91","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0377221724007203","RegionNum":2,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPERATIONS RESEARCH & MANAGEMENT SCIENCE","Score":null,"Total":0}
引用次数: 0

Abstract

When switching costs are added to the multi-armed bandit (MAB) problem where the arms’ random reward distributions are previously unknown, usually quite different techniques than those for pure MAB are required. We find that two simple fixes on the existing upper-confidence-bound (UCB) policy can work well for MAB with switching costs (MAB-SC). Two cases should be distinguished. One is with positive-gap ambiguity where the performance gap between the leading and lagging arms is known to be at least some δ>0. For this, our fix is to erect barriers that discourage frivolous arm switchings. The other is with zero-gap ambiguity where absolutely nothing is known. We remedy this by forcing the same arms to be pulled in increasingly prolonged intervals. As usual, the effectivenesses of our fixes are measured by the worst average regrets over long time horizons T. When the barriers are fixed at δ/2, we can accomplish a ln(T)-sized regret bound for the positive-gap case. When intervals are such that n of them occupy n2 periods, we can achieve the best possible T1/2-sized regret bound for the zero-gap case. Other than UCB, these fixes can be applied to a learning while doing (LWD) heuristic to reach satisfactory results as well. While not yet with the best theoretical guarantees, the LWD-based policies have empirically outperformed those based on UCB and other known alternatives. Numerically competitive policies still include ones resulting from interval-based fixes on Thompson sampling (TS).
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
适应多臂匪徒中转换成本的简单修复方法
在多臂强盗(MAB)问题中加入转换成本时,由于各臂的随机奖励分布是未知的,通常需要采用与纯 MAB 完全不同的技术。我们发现,对现有的置信上限(UCB)策略进行两个简单的修正,就能很好地解决有转换成本的多臂强盗(MAB)问题(MAB-SC)。需要区分两种情况。一种情况是正差距模糊性,即已知领先臂和落后臂之间的性能差距至少为某个 δ>0。对于这种情况,我们的解决办法是设置障碍,阻止轻率的臂切换。另一种情况是零间隙模糊性,在这种情况下,我们完全不知道任何事情。对此,我们的补救措施是强迫在越来越长的时间间隔内拉动相同的臂。像往常一样,我们的固定方法的有效性是通过长时间跨度 T 的最坏平均遗憾来衡量的。当障碍固定在 δ/2 时,我们可以在正间隙情况下实现 ln(T)-sized regret bound。当时间间隔为 n 个,其中 n 个占据 n2 个周期时,我们就能在零间隙情况下实现最佳的 T1/2 大小的遗憾约束。除 UCB 外,这些修正也可以应用于边做边学(LWD)启发式,以获得令人满意的结果。基于 LWD 的策略虽然尚未获得最佳的理论保证,但在经验上已经优于基于 UCB 和其他已知替代方法的策略。在数值上具有竞争力的策略还包括基于汤普森采样(Thompson sampling,TS)的区间固定策略。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
European Journal of Operational Research
European Journal of Operational Research 管理科学-运筹学与管理科学
CiteScore
11.90
自引率
9.40%
发文量
786
审稿时长
8.2 months
期刊介绍: The European Journal of Operational Research (EJOR) publishes high quality, original papers that contribute to the methodology of operational research (OR) and to the practice of decision making.
期刊最新文献
Editorial Board Bi-objective ranking and selection using stochastic kriging Single-machine preemptive scheduling with assignable due dates or assignable weights to minimize total weighted late work Measuring carbon emission performance in China's energy market: Evidence from improved non-radial directional distance function data envelopment analysis A general valuation framework for rough stochastic local volatility models and applications
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1