具有多个长期平均目标的马尔可夫决策过程

IF 0.6 4区 数学 Q4 COMPUTER SCIENCE, THEORY & METHODS Logical Methods in Computer Science Pub Date : 2011-04-18 DOI:10.2168/LMCS-10(1:13)2014
Tom'avs Br'azdil, V'aclav Brovzek, K. Chatterjee, Vojtvech Forejt, Anton'in Kuvcera
{"title":"具有多个长期平均目标的马尔可夫决策过程","authors":"Tom'avs Br'azdil, V'aclav Brovzek, K. Chatterjee, Vojtvech Forejt, Anton'in Kuvcera","doi":"10.2168/LMCS-10(1:13)2014","DOIUrl":null,"url":null,"abstract":"We study Markov decision processes (MDPs) with multiple\nlimit-average (or mean-payoff) functions. We consider two\ndifferent objectives, namely, expectation and satisfaction\nobjectives. Given an MDP with k limit-average functions, in the\nexpectation objective the goal is to maximize the expected\nlimit-average value, and in the satisfaction objective the goal\nis to maximize the probability of runs such that the\nlimit-average value stays above a given vector. We show that\nunder the expectation objective, in contrast to the case of one\nlimit-average function, both randomization and memory are\nnecessary for strategies even for epsilon-approximation, and\nthat finite-memory randomized strategies are sufficient for\nachieving Pareto optimal values. Under the satisfaction\nobjective, in contrast to the case of one limit-average\nfunction, infinite memory is necessary for strategies achieving\na specific value (i.e. randomized finite-memory strategies are\nnot sufficient), whereas memoryless randomized strategies are\nsufficient for epsilon-approximation, for all epsilon>0. We\nfurther prove that the decision problems for both expectation\nand satisfaction objectives can be solved in polynomial time\nand the trade-off curve (Pareto curve) can be\nepsilon-approximated in time polynomial in the size of the MDP\nand 1/epsilon, and exponential in the number of limit-average\nfunctions, for all epsilon>0. Our analysis also reveals flaws\nin previous work for MDPs with multiple mean-payoff functions\nunder the expectation objective, corrects the flaws, and allows\nus to obtain improved results.","PeriodicalId":49904,"journal":{"name":"Logical Methods in Computer Science","volume":null,"pages":null},"PeriodicalIF":0.6000,"publicationDate":"2011-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":"{\"title\":\"Markov Decision Processes with Multiple Long-Run AverageObjectives\",\"authors\":\"Tom'avs Br'azdil, V'aclav Brovzek, K. Chatterjee, Vojtvech Forejt, Anton'in Kuvcera\",\"doi\":\"10.2168/LMCS-10(1:13)2014\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We study Markov decision processes (MDPs) with multiple\\nlimit-average (or mean-payoff) functions. We consider two\\ndifferent objectives, namely, expectation and satisfaction\\nobjectives. Given an MDP with k limit-average functions, in the\\nexpectation objective the goal is to maximize the expected\\nlimit-average value, and in the satisfaction objective the goal\\nis to maximize the probability of runs such that the\\nlimit-average value stays above a given vector. We show that\\nunder the expectation objective, in contrast to the case of one\\nlimit-average function, both randomization and memory are\\nnecessary for strategies even for epsilon-approximation, and\\nthat finite-memory randomized strategies are sufficient for\\nachieving Pareto optimal values. Under the satisfaction\\nobjective, in contrast to the case of one limit-average\\nfunction, infinite memory is necessary for strategies achieving\\na specific value (i.e. randomized finite-memory strategies are\\nnot sufficient), whereas memoryless randomized strategies are\\nsufficient for epsilon-approximation, for all epsilon>0. We\\nfurther prove that the decision problems for both expectation\\nand satisfaction objectives can be solved in polynomial time\\nand the trade-off curve (Pareto curve) can be\\nepsilon-approximated in time polynomial in the size of the MDP\\nand 1/epsilon, and exponential in the number of limit-average\\nfunctions, for all epsilon>0. Our analysis also reveals flaws\\nin previous work for MDPs with multiple mean-payoff functions\\nunder the expectation objective, corrects the flaws, and allows\\nus to obtain improved results.\",\"PeriodicalId\":49904,\"journal\":{\"name\":\"Logical Methods in Computer Science\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.6000,\"publicationDate\":\"2011-04-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"29\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Logical Methods in Computer Science\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.2168/LMCS-10(1:13)2014\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Logical Methods in Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.2168/LMCS-10(1:13)2014","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 29

摘要

我们研究了马尔可夫决策过程(mdp)与多重限制-平均(或平均-收益)函数。我们考虑两个不同的目标,即期望目标和满意目标。给定一个具有k个极限-平均函数的MDP,在期望目标中,目标是最大化期望极限-平均值,而在满足目标中,目标是最大化运行的概率,使极限-平均值保持在给定向量之上。我们证明了在期望目标下,与单极限平均函数的情况相反,即使对于epsilon逼近,随机化和记忆对于策略也是必要的,并且有限记忆随机化策略对于实现帕累托最优值是足够的。在满足目标下,与一个极限平均函数的情况相反,对于实现特定值的策略,无限内存是必要的(即随机有限内存策略是不够的),而对于所有epsilon>,无内存随机化策略对于epsilon逼近是足够的。我们进一步证明了期望目标和满意度目标的决策问题都可以在多项式时间内解决,并且权衡曲线(Pareto曲线)可以在mdp的大小和1/epsilon的时间多项式中近似,并且在极限平均函数的数量上呈指数,对于所有的epsilon>0。我们的分析也揭示了以往在期望目标下具有多个均值-收益函数的MDPs工作的缺陷,纠正了这些缺陷,并使我们能够获得改进的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Markov Decision Processes with Multiple Long-Run AverageObjectives
We study Markov decision processes (MDPs) with multiple limit-average (or mean-payoff) functions. We consider two different objectives, namely, expectation and satisfaction objectives. Given an MDP with k limit-average functions, in the expectation objective the goal is to maximize the expected limit-average value, and in the satisfaction objective the goal is to maximize the probability of runs such that the limit-average value stays above a given vector. We show that under the expectation objective, in contrast to the case of one limit-average function, both randomization and memory are necessary for strategies even for epsilon-approximation, and that finite-memory randomized strategies are sufficient for achieving Pareto optimal values. Under the satisfaction objective, in contrast to the case of one limit-average function, infinite memory is necessary for strategies achieving a specific value (i.e. randomized finite-memory strategies are not sufficient), whereas memoryless randomized strategies are sufficient for epsilon-approximation, for all epsilon>0. We further prove that the decision problems for both expectation and satisfaction objectives can be solved in polynomial time and the trade-off curve (Pareto curve) can be epsilon-approximated in time polynomial in the size of the MDP and 1/epsilon, and exponential in the number of limit-average functions, for all epsilon>0. Our analysis also reveals flaws in previous work for MDPs with multiple mean-payoff functions under the expectation objective, corrects the flaws, and allows us to obtain improved results.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Logical Methods in Computer Science
Logical Methods in Computer Science 工程技术-计算机:理论方法
CiteScore
1.80
自引率
0.00%
发文量
105
审稿时长
6-12 weeks
期刊介绍: Logical Methods in Computer Science is a fully refereed, open access, free, electronic journal. It welcomes papers on theoretical and practical areas in computer science involving logical methods, taken in a broad sense; some particular areas within its scope are listed below. Papers are refereed in the traditional way, with two or more referees per paper. Copyright is retained by the author. Topics of Logical Methods in Computer Science: Algebraic methods Automata and logic Automated deduction Categorical models and logic Coalgebraic methods Computability and Logic Computer-aided verification Concurrency theory Constraint programming Cyber-physical systems Database theory Defeasible reasoning Domain theory Emerging topics: Computational systems in biology Emerging topics: Quantum computation and logic Finite model theory Formalized mathematics Functional programming and lambda calculus Inductive logic and learning Interactive proof checking Logic and algorithms Logic and complexity Logic and games Logic and probability Logic for knowledge representation Logic programming Logics of programs Modal and temporal logics Program analysis and type checking Program development and specification Proof complexity Real time and hybrid systems Reasoning about actions and planning Satisfiability Security Semantics of programming languages Term rewriting and equational logic Type theory and constructive mathematics.
期刊最新文献
Node Replication: Theory And Practice A categorical characterization of relative entropy on standard Borel spaces The Power-Set Construction for Tree Algebras Token Games and History-Deterministic Quantitative-Automata A coherent differential PCF
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1