基于部分信息的马尔可夫奖励随机控制的逼近算法

S. Guha, Kamesh Munagala
{"title":"基于部分信息的马尔可夫奖励随机控制的逼近算法","authors":"S. Guha, Kamesh Munagala","doi":"10.1109/FOCS.2007.12","DOIUrl":null,"url":null,"abstract":"We consider a variant of the classic multi-armed bandit problem (MAB), which we call feedback MAB, where the reward obtained by playing each of n independent arms varies according to an underlying on/off Markov process with known parameters. The evolution of the Markov chain happens irrespective of whether the arm is played, and furthermore, the exact state of the Markov chain is only revealed to the player when the arm is played and the reward observed. At most one arm (or in general, M arms) can be played any time step. The goal is to design a policy for playing the arms in order to maximize the infinite horizon time average expected reward. This problem is an instance of a partially observable Markov decision process (POMDP), and a special case of the notoriously intractable \"restless bandit\" problem. Unlike the stochastic MAB problem, the feedback MAB problem does not admit to greedy index-based optimal policies. Vie state of the system at any time step encodes the beliefs about the states of different arms, and the policy decisions change these beliefs - this aspect complicates the design and analysis of simple algorithms. We design a constant factor approximation to the feedback MAB problem by solving and rounding a natural LP relaxation to this problem. As far as we are aware, this is the first approximation algorithm for a POMDP problem.","PeriodicalId":197431,"journal":{"name":"48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2007-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"59","resultStr":"{\"title\":\"Approximation Algorithms for Partial-Information Based Stochastic Control with Markovian Rewards\",\"authors\":\"S. Guha, Kamesh Munagala\",\"doi\":\"10.1109/FOCS.2007.12\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider a variant of the classic multi-armed bandit problem (MAB), which we call feedback MAB, where the reward obtained by playing each of n independent arms varies according to an underlying on/off Markov process with known parameters. The evolution of the Markov chain happens irrespective of whether the arm is played, and furthermore, the exact state of the Markov chain is only revealed to the player when the arm is played and the reward observed. At most one arm (or in general, M arms) can be played any time step. The goal is to design a policy for playing the arms in order to maximize the infinite horizon time average expected reward. This problem is an instance of a partially observable Markov decision process (POMDP), and a special case of the notoriously intractable \\\"restless bandit\\\" problem. Unlike the stochastic MAB problem, the feedback MAB problem does not admit to greedy index-based optimal policies. Vie state of the system at any time step encodes the beliefs about the states of different arms, and the policy decisions change these beliefs - this aspect complicates the design and analysis of simple algorithms. We design a constant factor approximation to the feedback MAB problem by solving and rounding a natural LP relaxation to this problem. As far as we are aware, this is the first approximation algorithm for a POMDP problem.\",\"PeriodicalId\":197431,\"journal\":{\"name\":\"48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-10-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"59\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FOCS.2007.12\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FOCS.2007.12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 59

摘要

我们考虑了经典多臂盗匪问题(MAB)的一个变体,我们称之为反馈MAB,其中通过玩n个独立手臂中的每一个获得的奖励根据具有已知参数的潜在开/关马尔可夫过程而变化。马尔可夫链的进化与手臂是否被使用无关,而且,马尔可夫链的确切状态只有在手臂被使用并观察到奖励时才会向玩家透露。在任何时间步长,最多可以播放一只手臂(或者通常是M只手臂)。目标是设计一种策略,使无限视界时间平均期望奖励最大化。这个问题是部分可观察马尔可夫决策过程(POMDP)的一个实例,也是众所周知的棘手的“不宁强盗”问题的一个特例。与随机MAB问题不同,反馈MAB问题不承认基于贪婪索引的最优策略。系统在任意时刻的状态编码了关于不同武器状态的信念,而策略决策改变了这些信念——这方面使简单算法的设计和分析变得复杂。我们设计了一个常因子近似的反馈MAB问题通过求解和四舍五入的自然LP松弛问题。据我们所知,这是POMDP问题的第一个近似算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Approximation Algorithms for Partial-Information Based Stochastic Control with Markovian Rewards
We consider a variant of the classic multi-armed bandit problem (MAB), which we call feedback MAB, where the reward obtained by playing each of n independent arms varies according to an underlying on/off Markov process with known parameters. The evolution of the Markov chain happens irrespective of whether the arm is played, and furthermore, the exact state of the Markov chain is only revealed to the player when the arm is played and the reward observed. At most one arm (or in general, M arms) can be played any time step. The goal is to design a policy for playing the arms in order to maximize the infinite horizon time average expected reward. This problem is an instance of a partially observable Markov decision process (POMDP), and a special case of the notoriously intractable "restless bandit" problem. Unlike the stochastic MAB problem, the feedback MAB problem does not admit to greedy index-based optimal policies. Vie state of the system at any time step encodes the beliefs about the states of different arms, and the policy decisions change these beliefs - this aspect complicates the design and analysis of simple algorithms. We design a constant factor approximation to the feedback MAB problem by solving and rounding a natural LP relaxation to this problem. As far as we are aware, this is the first approximation algorithm for a POMDP problem.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Maximizing Non-Monotone Submodular Functions On the Complexity of Nash Equilibria and Other Fixed Points (Extended Abstract) A Lower Bound for the Size of Syntactically Multilinear Arithmetic Circuits Linear Equations Modulo 2 and the L1 Diameter of Convex Bodies Non-Preemptive Min-Sum Scheduling with Resource Augmentation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1