Ahmadreza Moradipari, Sanae Amani, M. Alizadeh, Christos Thrampoulidis
{"title":"SAFE LINEAR BANDITS","authors":"Ahmadreza Moradipari, Sanae Amani, M. Alizadeh, Christos Thrampoulidis","doi":"10.1109/CISS50987.2021.9400288","DOIUrl":null,"url":null,"abstract":"Bandit algorithms have various applications in safety-critical systems, where it is important to respect the system's underlying constraints. The challenge is that such constraints are often unknown as they depend on the bandit's unknown parameters. In this talk, we formulate a linear stochastic multi-armed bandit problem with safety constraints that depend linearly on an unknown parameter vector. As such, the learner is unable to identify all safe actions and must act conservatively in ensuring that their actions satisfy the safety constraint at all rounds (at least with high probability). For these bandits, we propose new upper-confidence bound (UCB) and Thompson-sampling algorithms, which include necessary modifications to respect the safety constraints. For two settings -with and without bandit feedback information on the constraint- we prove regret bounds and discuss their optimality in relation to corresponding bounds in the absence of safety restrictions. For example, for a setting with bandit-feedback information on the constraint, we present a frequentist regret of order $\\mathcal{O}\\left(d^{3/2}log^{1/2}d\\sqrt{T}log^{2/3}T\\right)$, which remarkably matches the results provided by [1] for the standard linear Thompson-sampling algorithm. We highlight how the inherently randomized nature of Thompson-sampling helps expand the set of safe actions the algorithm has access to at each round. Finally, we discuss related problem variations with stage-wise baseline constraints, in which the learner must choose actions that not only maximize cumulative reward across the entire time horizon, but they further satisfy a linear baseline constraint taking the form of a lower bound on the instantaneous reward. The content of this talk is based on [2]–[4].","PeriodicalId":228112,"journal":{"name":"2021 55th Annual Conference on Information Sciences and Systems (CISS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 55th Annual Conference on Information Sciences and Systems (CISS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CISS50987.2021.9400288","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Bandit algorithms have various applications in safety-critical systems, where it is important to respect the system's underlying constraints. The challenge is that such constraints are often unknown as they depend on the bandit's unknown parameters. In this talk, we formulate a linear stochastic multi-armed bandit problem with safety constraints that depend linearly on an unknown parameter vector. As such, the learner is unable to identify all safe actions and must act conservatively in ensuring that their actions satisfy the safety constraint at all rounds (at least with high probability). For these bandits, we propose new upper-confidence bound (UCB) and Thompson-sampling algorithms, which include necessary modifications to respect the safety constraints. For two settings -with and without bandit feedback information on the constraint- we prove regret bounds and discuss their optimality in relation to corresponding bounds in the absence of safety restrictions. For example, for a setting with bandit-feedback information on the constraint, we present a frequentist regret of order $\mathcal{O}\left(d^{3/2}log^{1/2}d\sqrt{T}log^{2/3}T\right)$, which remarkably matches the results provided by [1] for the standard linear Thompson-sampling algorithm. We highlight how the inherently randomized nature of Thompson-sampling helps expand the set of safe actions the algorithm has access to at each round. Finally, we discuss related problem variations with stage-wise baseline constraints, in which the learner must choose actions that not only maximize cumulative reward across the entire time horizon, but they further satisfy a linear baseline constraint taking the form of a lower bound on the instantaneous reward. The content of this talk is based on [2]–[4].