Multi-model optimization with discounted reward and budget constraint

Proceedings of 2018 International Conference on Mathematics and Artificial Intelligence Pub Date : 2018-04-20 DOI:10.1145/3208788.3208796

Jixuan Shi, Mei Chen

引用次数: 1

Abstract

Multiple arm bandit algorithm is widely used in gaming, gambling, policy generation, and artificial intelligence projects and gets more attention recently. In this paper, we explore non-stationary reward MAB problem with limited query budget. An upper confidence bound (UCB) based algorithm for the discounted MAB budget finite problem, which uses reward-cost ratio instead of arm rewards in discount empirical average. In order to estimate the instantaneous expected reward-cost ratio, the DUCB-BF policy averages past rewards with a discount factor giving more weight to recent observations. Theoretical regret bound is established with proof to be over-performed than other MAB algorithms. A real application on maintenance recovery models refinement is explored. Results comparison on 4 different MAB algorithms and DUCB-BF algorithm yields lowest regret as expected.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

具有折扣奖励和预算约束的多模型优化

多臂强盗算法被广泛应用于游戏、赌博、政策生成、人工智能项目中，近年来受到越来越多的关注。本文研究了查询预算有限的非平稳奖励MAB问题。基于上置信度界(UCB)的折现MAB预算有限问题的算法，该算法在折现经验平均中使用奖励-成本比代替手臂奖励。为了估计瞬时期望的奖励成本比，DUCB-BF策略对过去的奖励进行平均，并对最近的观察给予更多的权重。建立了理论后悔界，并证明该算法优于其他MAB算法。探讨了维修恢复模型精化的实际应用。结果4种不同的MAB算法和DUCB-BF算法的比较得到了最低的遗憾。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of 2018 International Conference on Mathematics and Artificial Intelligence

自引率

0.00%

发文量