Mean-variance and value at risk in multi-armed bandit problems

2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton) Pub Date : 2015-09-01 DOI:10.1109/ALLERTON.2015.7447162

Sattar Vakili, Qing Zhao

引用次数: 34

Abstract

We study risk-averse multi-armed bandit problems under different risk measures. We consider three risk mitigation models. In the first model, the variations in the reward values obtained at different times are considered as risk and the objective is to minimize the mean-variance of the observed rewards. In the second and the third models, the quantity of interest is the total reward at the end of the time horizon, and the objective is to minimize the mean-variance and maximize the value at risk of the total reward, respectively. We develop risk-averse online learning policies and analyze their regret performance. We also provide tight lower bounds on regret under the model of mean-variance of observations.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

多武装盗匪问题的均值方差和风险值

研究了不同风险度量下的风险规避型多武装盗匪问题。我们考虑了三种风险缓解模型。在第一个模型中，不同时间获得的奖励值的变化被视为风险，目标是最小化观察到的奖励的均值方差。在第二个和第三个模型中，利息的数量是在时间范围结束时的总回报，目标分别是最小化平均方差和最大化总回报的风险值。我们制定了规避风险的在线学习政策，并分析了它们的后悔表现。我们还在观测的均值-方差模型下提供了遗憾的严格下界。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton)

自引率

0.00%

发文量

期刊最新文献

Robust temporal logic model predictive control Efficient replication of queued tasks for latency reduction in cloud systems Cut-set bound is loose for Gaussian relay networks Improving MIMO detection performance in presence of phase noise using norm difference criterion Utility fair RAT selection in multi-homed LTE/802.11 networks