Ultimately Stationary Policies to Approximate Risk-Sensitive Discounted MDPs

M. UdayKumar, S. Bhat, V. Kavitha, N. Hemachandra
{"title":"Ultimately Stationary Policies to Approximate Risk-Sensitive Discounted MDPs","authors":"M. UdayKumar, S. Bhat, V. Kavitha, N. Hemachandra","doi":"10.1145/3306309.3306320","DOIUrl":null,"url":null,"abstract":"Risk-sensitive Markov Decision Process (RSMDP) models are less studied than linear Markov decision models. Linear models optimize only expected cost whereas RSMDP models optimize a combination of expected cost and higher moments of the cost. On the other hand, optimal policies in RSMDP models are generally non-stationary, and need not even be ultimately stationary. This makes optimal policies more difficult to compute and implement in RSMDP models. We provide an algorithm, called Ultimately Stationary Linear Discounted (USLD), to compute an ∈-optimal ultimately stationary policy whose risk-sensitive cost approximates the optimal risk-sensitive cost within any specified degree of accuracy ∈. The algorithm approximates the tail costs with the optimal linear discounted cost, which is then treated as the terminal cost of a finite-horizon RSMDP. For the sake of comparison, we also consider an alternative method from the literature, which we call Ultimately Stationary Tail Off (USTO). USTO is based on the intuition that all decisions beyond a sufficiently large decision epoch make a negligibly small contribution to the total discounted cost. Accordingly, USTO involves optimizing the finite horizon RSMDP cost obtained by ignoring all stage-wise costs that occur after a sufficiently large decision epoch, and then concatenating the resulting finite-horizon policy with an arbitrarily chosen stationary policy to get a ∈-optimal ultimately stationary policy. We provide proofs of risk-sensitive ∈-optimality of policies yielded by USLD and USTO, and compare the performance of both on an inventory control problem.","PeriodicalId":113198,"journal":{"name":"Proceedings of the 12th EAI International Conference on Performance Evaluation Methodologies and Tools","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 12th EAI International Conference on Performance Evaluation Methodologies and Tools","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3306309.3306320","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Risk-sensitive Markov Decision Process (RSMDP) models are less studied than linear Markov decision models. Linear models optimize only expected cost whereas RSMDP models optimize a combination of expected cost and higher moments of the cost. On the other hand, optimal policies in RSMDP models are generally non-stationary, and need not even be ultimately stationary. This makes optimal policies more difficult to compute and implement in RSMDP models. We provide an algorithm, called Ultimately Stationary Linear Discounted (USLD), to compute an ∈-optimal ultimately stationary policy whose risk-sensitive cost approximates the optimal risk-sensitive cost within any specified degree of accuracy ∈. The algorithm approximates the tail costs with the optimal linear discounted cost, which is then treated as the terminal cost of a finite-horizon RSMDP. For the sake of comparison, we also consider an alternative method from the literature, which we call Ultimately Stationary Tail Off (USTO). USTO is based on the intuition that all decisions beyond a sufficiently large decision epoch make a negligibly small contribution to the total discounted cost. Accordingly, USTO involves optimizing the finite horizon RSMDP cost obtained by ignoring all stage-wise costs that occur after a sufficiently large decision epoch, and then concatenating the resulting finite-horizon policy with an arbitrarily chosen stationary policy to get a ∈-optimal ultimately stationary policy. We provide proofs of risk-sensitive ∈-optimality of policies yielded by USLD and USTO, and compare the performance of both on an inventory control problem.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
接近风险敏感贴现mdp的最终平稳政策
风险敏感马尔可夫决策过程(RSMDP)模型与线性马尔可夫决策模型相比研究较少。线性模型只优化期望成本,而RSMDP模型优化期望成本和更高时刻成本的组合。另一方面,RSMDP模型中的最优策略通常是非平稳的,甚至不需要最终平稳。这使得在RSMDP模型中计算和实现最优策略变得更加困难。我们提供了一种称为最终平稳线性贴现(USLD)的算法来计算一个∈最优的最终平稳策略,其风险敏感成本近似于任何指定精度程度∈内的最优风险敏感成本。该算法用最优线性折现成本逼近尾部成本,并将其作为有限视界RSMDP的终端成本。为了便于比较,我们还考虑了文献中的另一种方法,我们称之为最终平稳尾断(USTO)。USTO是基于这样一种直觉,即所有超出足够大决策时期的决策对总贴现成本的贡献很小,可以忽略不计。因此,USTO涉及通过忽略在足够大的决策epoch之后发生的所有阶段成本来优化有限视界RSMDP成本,然后将得到的有限视界策略与任意选择的平稳策略连接起来,以获得∈最优的最终平稳策略。我们提供了USLD和USTO产生的策略的风险敏感∈-最优性的证明,并比较了两者在库存控制问题上的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Revenue-Driven Scheduling in Drone Delivery Networks with Time-sensitive Service Level Agreements A proof of the conjecture in "Pricing surplus server capacity for mean waiting time sensitive customers" Tandem stochastic systems: Jackson networks, asymmetric exclusion processes, asymmetric inclusion processes and Catalan numbers Ultimately Stationary Policies to Approximate Risk-Sensitive Discounted MDPs A bottleneck with randomly distorted arrival times
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1