SAFE LINEAR BANDITS

2021 55th Annual Conference on Information Sciences and Systems (CISS) Pub Date : 2021-03-24 DOI:10.1109/CISS50987.2021.9400288

Ahmadreza Moradipari, Sanae Amani, M. Alizadeh, Christos Thrampoulidis

{"title":"SAFE LINEAR BANDITS","authors":"Ahmadreza Moradipari, Sanae Amani, M. Alizadeh, Christos Thrampoulidis","doi":"10.1109/CISS50987.2021.9400288","DOIUrl":null,"url":null,"abstract":"Bandit algorithms have various applications in safety-critical systems, where it is important to respect the system's underlying constraints. The challenge is that such constraints are often unknown as they depend on the bandit's unknown parameters. In this talk, we formulate a linear stochastic multi-armed bandit problem with safety constraints that depend linearly on an unknown parameter vector. As such, the learner is unable to identify all safe actions and must act conservatively in ensuring that their actions satisfy the safety constraint at all rounds (at least with high probability). For these bandits, we propose new upper-confidence bound (UCB) and Thompson-sampling algorithms, which include necessary modifications to respect the safety constraints. For two settings -with and without bandit feedback information on the constraint- we prove regret bounds and discuss their optimality in relation to corresponding bounds in the absence of safety restrictions. For example, for a setting with bandit-feedback information on the constraint, we present a frequentist regret of order $\\mathcal{O}\\left(d^{3/2}log^{1/2}d\\sqrt{T}log^{2/3}T\\right)$, which remarkably matches the results provided by [1] for the standard linear Thompson-sampling algorithm. We highlight how the inherently randomized nature of Thompson-sampling helps expand the set of safe actions the algorithm has access to at each round. Finally, we discuss related problem variations with stage-wise baseline constraints, in which the learner must choose actions that not only maximize cumulative reward across the entire time horizon, but they further satisfy a linear baseline constraint taking the form of a lower bound on the instantaneous reward. The content of this talk is based on [2]–[4].","PeriodicalId":228112,"journal":{"name":"2021 55th Annual Conference on Information Sciences and Systems (CISS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 55th Annual Conference on Information Sciences and Systems (CISS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CISS50987.2021.9400288","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Bandit algorithms have various applications in safety-critical systems, where it is important to respect the system's underlying constraints. The challenge is that such constraints are often unknown as they depend on the bandit's unknown parameters. In this talk, we formulate a linear stochastic multi-armed bandit problem with safety constraints that depend linearly on an unknown parameter vector. As such, the learner is unable to identify all safe actions and must act conservatively in ensuring that their actions satisfy the safety constraint at all rounds (at least with high probability). For these bandits, we propose new upper-confidence bound (UCB) and Thompson-sampling algorithms, which include necessary modifications to respect the safety constraints. For two settings -with and without bandit feedback information on the constraint- we prove regret bounds and discuss their optimality in relation to corresponding bounds in the absence of safety restrictions. For example, for a setting with bandit-feedback information on the constraint, we present a frequentist regret of order $\mathcal{O}\left(d^{3/2}log^{1/2}d\sqrt{T}log^{2/3}T\right)$, which remarkably matches the results provided by [1] for the standard linear Thompson-sampling algorithm. We highlight how the inherently randomized nature of Thompson-sampling helps expand the set of safe actions the algorithm has access to at each round. Finally, we discuss related problem variations with stage-wise baseline constraints, in which the learner must choose actions that not only maximize cumulative reward across the entire time horizon, but they further satisfy a linear baseline constraint taking the form of a lower bound on the instantaneous reward. The content of this talk is based on [2]–[4].

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

安全线性土匪

Bandit算法在安全关键型系统中有各种各样的应用，在这些系统中，尊重系统的底层约束是很重要的。挑战在于这些约束通常是未知的，因为它们依赖于强盗的未知参数。在这个演讲中，我们提出了一个线性随机多臂强盗问题，其安全约束线性依赖于一个未知参数向量。因此，学习者无法识别所有的安全动作，必须采取保守的行动，以确保他们的行动在所有回合中都满足安全约束(至少在高概率下)。对于这些强盗，我们提出了新的上置信度界(UCB)和汤普森采样算法，其中包括必要的修改，以尊重安全约束。对于两种设置-有和没有强盗反馈信息的约束-我们证明了遗憾边界，并讨论了在没有安全限制的情况下它们相对于相应边界的最优性。例如，对于约束上有强盗反馈信息的设置，我们给出了阶为$\mathcal{O}\left(d^{3/2}log^{1/2}d\sqrt{T}log^{2/3}T\right)$的频率遗憾，它与[1]为标准线性汤普森采样算法提供的结果非常匹配。我们强调了汤普森抽样固有的随机性质如何帮助扩展算法在每一轮访问的安全动作集。最后，我们讨论了与阶段基线约束相关的问题变化，其中学习者必须选择的行动不仅要在整个时间范围内最大化累积奖励，而且还要进一步满足以瞬时奖励下界形式存在的线性基线约束。本次演讲的内容基于[2]-[4]。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2021 55th Annual Conference on Information Sciences and Systems (CISS)

自引率

0.00%

发文量