{"title":"Ultimately Stationary Policies to Approximate Risk-Sensitive Discounted MDPs","authors":"M. UdayKumar, S. Bhat, V. Kavitha, N. Hemachandra","doi":"10.1145/3306309.3306320","DOIUrl":null,"url":null,"abstract":"Risk-sensitive Markov Decision Process (RSMDP) models are less studied than linear Markov decision models. Linear models optimize only expected cost whereas RSMDP models optimize a combination of expected cost and higher moments of the cost. On the other hand, optimal policies in RSMDP models are generally non-stationary, and need not even be ultimately stationary. This makes optimal policies more difficult to compute and implement in RSMDP models. We provide an algorithm, called Ultimately Stationary Linear Discounted (USLD), to compute an ∈-optimal ultimately stationary policy whose risk-sensitive cost approximates the optimal risk-sensitive cost within any specified degree of accuracy ∈. The algorithm approximates the tail costs with the optimal linear discounted cost, which is then treated as the terminal cost of a finite-horizon RSMDP. For the sake of comparison, we also consider an alternative method from the literature, which we call Ultimately Stationary Tail Off (USTO). USTO is based on the intuition that all decisions beyond a sufficiently large decision epoch make a negligibly small contribution to the total discounted cost. Accordingly, USTO involves optimizing the finite horizon RSMDP cost obtained by ignoring all stage-wise costs that occur after a sufficiently large decision epoch, and then concatenating the resulting finite-horizon policy with an arbitrarily chosen stationary policy to get a ∈-optimal ultimately stationary policy. We provide proofs of risk-sensitive ∈-optimality of policies yielded by USLD and USTO, and compare the performance of both on an inventory control problem.","PeriodicalId":113198,"journal":{"name":"Proceedings of the 12th EAI International Conference on Performance Evaluation Methodologies and Tools","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 12th EAI International Conference on Performance Evaluation Methodologies and Tools","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3306309.3306320","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Risk-sensitive Markov Decision Process (RSMDP) models are less studied than linear Markov decision models. Linear models optimize only expected cost whereas RSMDP models optimize a combination of expected cost and higher moments of the cost. On the other hand, optimal policies in RSMDP models are generally non-stationary, and need not even be ultimately stationary. This makes optimal policies more difficult to compute and implement in RSMDP models. We provide an algorithm, called Ultimately Stationary Linear Discounted (USLD), to compute an ∈-optimal ultimately stationary policy whose risk-sensitive cost approximates the optimal risk-sensitive cost within any specified degree of accuracy ∈. The algorithm approximates the tail costs with the optimal linear discounted cost, which is then treated as the terminal cost of a finite-horizon RSMDP. For the sake of comparison, we also consider an alternative method from the literature, which we call Ultimately Stationary Tail Off (USTO). USTO is based on the intuition that all decisions beyond a sufficiently large decision epoch make a negligibly small contribution to the total discounted cost. Accordingly, USTO involves optimizing the finite horizon RSMDP cost obtained by ignoring all stage-wise costs that occur after a sufficiently large decision epoch, and then concatenating the resulting finite-horizon policy with an arbitrarily chosen stationary policy to get a ∈-optimal ultimately stationary policy. We provide proofs of risk-sensitive ∈-optimality of policies yielded by USLD and USTO, and compare the performance of both on an inventory control problem.