Robust sequential design for piecewise-stationary multi-armed bandit problem in the presence of outliers

IF 1.3 Q3 STATISTICS & PROBABILITY Statistical Theory and Related Fields Pub Date : 2021-04-03 DOI:10.1080/24754269.2021.1902687

Yaping Wang, Zhicheng Peng, Riquan Zhang, Qian Xiao

{"title":"Robust sequential design for piecewise-stationary multi-armed bandit problem in the presence of outliers","authors":"Yaping Wang, Zhicheng Peng, Riquan Zhang, Qian Xiao","doi":"10.1080/24754269.2021.1902687","DOIUrl":null,"url":null,"abstract":"ABSTRACT The multi-armed bandit (MAB) problem studies the sequential decision making in the presence of uncertainty and partial feedback on rewards. Its name comes from imagining a gambler at a row of slot machines who needs to decide the best strategy on the number of times as well as the orders to play each machine. It is a classic reinforcement learning problem which is fundamental to many online learning problems. In many practical applications of the MAB, the reward distributions may change at unknown time steps and the outliers (extreme rewards) often exist. Current sequential design strategies may struggle in such cases, as they tend to infer additional change points to fit the outliers. In this paper, we propose a robust change-detection upper confidence bound (RCD-UCB) algorithm which can distinguish the real change points from the outliers in piecewise-stationary MAB settings. We show that the proposed RCD-UCB algorithm can achieve a nearly optimal regret bound on the order of , where T is the number of time steps, K is the number of arms and S is the number of stationary segments. We demonstrate its superior performance compared to some state-of-the-art algorithms in both simulation experiments and real data analysis. (See https://github.com/woaishufenke/MAB_STRF.git for the codes used in this paper.)","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"5 1","pages":"122 - 133"},"PeriodicalIF":1.3000,"publicationDate":"2021-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/24754269.2021.1902687","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Theory and Related Fields","FirstCategoryId":"96","ListUrlMain":"https://doi.org/10.1080/24754269.2021.1902687","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 1

Abstract

ABSTRACT The multi-armed bandit (MAB) problem studies the sequential decision making in the presence of uncertainty and partial feedback on rewards. Its name comes from imagining a gambler at a row of slot machines who needs to decide the best strategy on the number of times as well as the orders to play each machine. It is a classic reinforcement learning problem which is fundamental to many online learning problems. In many practical applications of the MAB, the reward distributions may change at unknown time steps and the outliers (extreme rewards) often exist. Current sequential design strategies may struggle in such cases, as they tend to infer additional change points to fit the outliers. In this paper, we propose a robust change-detection upper confidence bound (RCD-UCB) algorithm which can distinguish the real change points from the outliers in piecewise-stationary MAB settings. We show that the proposed RCD-UCB algorithm can achieve a nearly optimal regret bound on the order of , where T is the number of time steps, K is the number of arms and S is the number of stationary segments. We demonstrate its superior performance compared to some state-of-the-art algorithms in both simulation experiments and real data analysis. (See https://github.com/woaishufenke/MAB_STRF.git for the codes used in this paper.)

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

存在异常值的分段平稳多臂盗匪问题的鲁棒序列设计

摘要多武装土匪（MAB）问题研究了在存在不确定性和部分回报反馈的情况下的序列决策。它的名字来源于想象一个赌徒在一排老虎机旁，他需要根据玩每台老虎机的次数和顺序来决定最佳策略。这是一个经典的强化学习问题，是许多在线学习问题的基础。在MAB的许多实际应用中，奖励分布可能在未知的时间步长发生变化，并且经常存在异常值（极端奖励）。当前的顺序设计策略在这种情况下可能会遇到困难，因为它们倾向于推断出额外的变化点来适应异常值。在本文中，我们提出了一种鲁棒的变化检测置信上限（RCD-UCB）算法，该算法可以区分分段平稳MAB设置中的真实变化点和异常值。我们证明了所提出的RCD-UCB算法可以实现数量级的近似最优后悔界，其中T是时间步长的数量，K是臂的数量，S是静止段的数量。在模拟实验和实际数据分析中，我们展示了它与一些最先进的算法相比的优越性能。（请参见https://github.com/woaishufenke/MAB_STRF.git对于本文中使用的代码。）

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊