Low-complexity algorithm for restless bandits with imperfect observations

IF 0.9 4区 数学 Q3 MATHEMATICS, APPLIED Mathematical Methods of Operations Research Pub Date : 2024-09-05 DOI:10.1007/s00186-024-00868-x
Keqin Liu, Richard Weber, Chengzhong Zhang
{"title":"Low-complexity algorithm for restless bandits with imperfect observations","authors":"Keqin Liu, Richard Weber, Chengzhong Zhang","doi":"10.1007/s00186-024-00868-x","DOIUrl":null,"url":null,"abstract":"<p>We consider a class of restless bandit problems that finds a broad application area in reinforcement learning and stochastic optimization. We consider <i>N</i> independent discrete-time Markov processes, each of which had two possible states: 1 and 0 (‘good’ and ‘bad’). Only if a process is both in state 1 and observed to be so does reward accrue. The aim is to maximize the expected discounted sum of returns over the infinite horizon subject to a constraint that only <i>M</i> <span>\\((&lt;N)\\)</span> processes may be observed at each step. Observation is error-prone: there are known probabilities that state 1 (0) will be observed as 0 (1). From this one knows, at any time <i>t</i>, a probability that process <i>i</i> is in state 1. The resulting system may be modeled as a restless multi-armed bandit problem with an information state space of uncountable cardinality. Restless bandit problems with even finite state spaces are PSPACE-HARD in general. We propose a novel approach for simplifying the dynamic programming equations of this class of restless bandits and develop a low-complexity algorithm that achieves a strong performance and is readily extensible to the general restless bandit model with observation errors. Under certain conditions, we establish the existence (indexability) of Whittle index and its equivalence to our algorithm. When those conditions do not hold, we show by numerical experiments the near-optimal performance of our algorithm in the general parametric space. Furthermore, we theoretically prove the optimality of our algorithm for homogeneous systems.</p>","PeriodicalId":49862,"journal":{"name":"Mathematical Methods of Operations Research","volume":"15 1","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mathematical Methods of Operations Research","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1007/s00186-024-00868-x","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICS, APPLIED","Score":null,"Total":0}
引用次数: 0

Abstract

We consider a class of restless bandit problems that finds a broad application area in reinforcement learning and stochastic optimization. We consider N independent discrete-time Markov processes, each of which had two possible states: 1 and 0 (‘good’ and ‘bad’). Only if a process is both in state 1 and observed to be so does reward accrue. The aim is to maximize the expected discounted sum of returns over the infinite horizon subject to a constraint that only M \((<N)\) processes may be observed at each step. Observation is error-prone: there are known probabilities that state 1 (0) will be observed as 0 (1). From this one knows, at any time t, a probability that process i is in state 1. The resulting system may be modeled as a restless multi-armed bandit problem with an information state space of uncountable cardinality. Restless bandit problems with even finite state spaces are PSPACE-HARD in general. We propose a novel approach for simplifying the dynamic programming equations of this class of restless bandits and develop a low-complexity algorithm that achieves a strong performance and is readily extensible to the general restless bandit model with observation errors. Under certain conditions, we establish the existence (indexability) of Whittle index and its equivalence to our algorithm. When those conditions do not hold, we show by numerical experiments the near-optimal performance of our algorithm in the general parametric space. Furthermore, we theoretically prove the optimality of our algorithm for homogeneous systems.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
观测不完善的不安定强盗的低复杂度算法
我们考虑的是一类不安定的强盗问题,它在强化学习和随机优化中有着广泛的应用。我们考虑 N 个独立的离散时间马尔可夫过程,每个过程都有两种可能的状态:1 和 0("好 "和 "坏"):1和0("好 "和 "坏")。只有当进程处于状态 1 并被观察到时,才会产生奖励。我们的目标是在每一步只能观察到 M ((<N)\)个过程的约束下,最大化无限期内的预期贴现收益总和。观察是容易出错的:状态 1(0)被观察为 0(1)的概率是已知的。由此可以知道,在任何时间 t,进程 i 处于状态 1 的概率。由此产生的系统可以建模为一个不安定的多臂强盗问题,其信息状态空间具有不可计数的卡方性。一般来说,即使是有限状态空间的无休止强盗问题也是 PSPACE-HARD(空间困难)的。我们提出了一种简化该类无休止强盗动态程序方程的新方法,并开发了一种低复杂度算法,该算法性能优异,可随时扩展到具有观测误差的一般无休止强盗模型。在某些条件下,我们建立了惠特尔指数的存在性(可索引性)及其与我们算法的等价性。当这些条件不成立时,我们通过数值实验证明了我们的算法在一般参数空间中接近最优的性能。此外,我们还从理论上证明了我们的算法对于同质系统的最优性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
1.90
自引率
0.00%
发文量
36
审稿时长
>12 weeks
期刊介绍: This peer reviewed journal publishes original and high-quality articles on important mathematical and computational aspects of operations research, in particular in the areas of continuous and discrete mathematical optimization, stochastics, and game theory. Theoretically oriented papers are supposed to include explicit motivations of assumptions and results, while application oriented papers need to contain substantial mathematical contributions. Suggestions for algorithms should be accompanied with numerical evidence for their superiority over state-of-the-art methods. Articles must be of interest for a large audience in operations research, written in clear and correct English, and typeset in LaTeX. A special section contains invited tutorial papers on advanced mathematical or computational aspects of operations research, aiming at making such methodologies accessible for a wider audience. All papers are refereed. The emphasis is on originality, quality, and importance.
期刊最新文献
Low-complexity algorithm for restless bandits with imperfect observations Multi-stage distributionally robust convex stochastic optimization with Bayesian-type ambiguity sets A new value for communication situations On the relationship between the value function and the efficient frontier of a mixed integer linear optimization problem An approximation algorithm for multiobjective mixed-integer convex optimization
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1