Learning and Planning for Time-Varying MDPs Using Maximum Likelihood Estimation.

Journal of machine learning research : JMLR Pub Date : 2021-01-01 Epub Date: 2021-02-01
Melkior Ornik, Ufuk Topcu
{"title":"Learning and Planning for Time-Varying MDPs Using Maximum Likelihood Estimation.","authors":"Melkior Ornik, Ufuk Topcu","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>This paper proposes a formal approach to online learning and planning for agents operating in a priori unknown, time-varying environments. The proposed method computes the maximally likely model of the environment, given the observations about the environment made by an agent earlier in the system run and assuming knowledge of a bound on the maximal rate of change of system dynamics. Such an approach generalizes the estimation method commonly used in learning algorithms for unknown Markov decision processes with time-invariant transition probabilities, but is also able to quickly and correctly identify the system dynamics following a change. Based on the proposed method, we generalize the exploration bonuses used in learning for time-invariant Markov decision processes by introducing a notion of uncertainty in a learned time-varying model, and develop a control policy for time-varying Markov decision processes based on the exploitation and exploration trade-off. We demonstrate the proposed methods on four numerical examples: a patrolling task with a change in system dynamics, a two-state MDP with periodically changing outcomes of actions, a wind flow estimation task, and a multi-armed bandit problem with periodically changing probabilities of different rewards.</p>","PeriodicalId":314696,"journal":{"name":"Journal of machine learning research : JMLR","volume":" ","pages":"1-40"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8739185/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of machine learning research : JMLR","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/2/1 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This paper proposes a formal approach to online learning and planning for agents operating in a priori unknown, time-varying environments. The proposed method computes the maximally likely model of the environment, given the observations about the environment made by an agent earlier in the system run and assuming knowledge of a bound on the maximal rate of change of system dynamics. Such an approach generalizes the estimation method commonly used in learning algorithms for unknown Markov decision processes with time-invariant transition probabilities, but is also able to quickly and correctly identify the system dynamics following a change. Based on the proposed method, we generalize the exploration bonuses used in learning for time-invariant Markov decision processes by introducing a notion of uncertainty in a learned time-varying model, and develop a control policy for time-varying Markov decision processes based on the exploitation and exploration trade-off. We demonstrate the proposed methods on four numerical examples: a patrolling task with a change in system dynamics, a two-state MDP with periodically changing outcomes of actions, a wind flow estimation task, and a multi-armed bandit problem with periodically changing probabilities of different rewards.

Abstract Image

Abstract Image

Abstract Image

分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用最大似然估计的时变mdp学习和规划。
本文提出了一种正式的方法,用于在先验未知的时变环境中运行的智能体的在线学习和规划。提出的方法计算环境的最大可能模型,给定一个代理在系统运行早期对环境的观察,并假设系统动力学的最大变化率有一个界的知识。该方法不仅推广了具有定常转移概率的未知马尔可夫决策过程学习算法中常用的估计方法,而且能够快速正确地识别变化后的系统动力学。在此基础上,通过在学习的时变模型中引入不确定性的概念,推广了时变马尔可夫决策过程学习中使用的探索奖励,并基于开发和探索权衡制定了时变马尔可夫决策过程的控制策略。我们通过四个数值例子证明了所提出的方法:一个具有系统动力学变化的巡逻任务,一个具有周期性变化的行动结果的两状态MDP,一个风流量估计任务,以及一个具有周期性变化的不同奖励概率的多武装强盗问题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Hoeffding's inequality for general Markov chains with its applications to statistical learning. Estimation and Inference for High Dimensional Generalized Linear Models: A Splitting and Smoothing Approach. Integrative Generalized Convex Clustering Optimization and Feature Selection for Mixed Multi-View Data. Learning and Planning for Time-Varying MDPs Using Maximum Likelihood Estimation. Bayesian time-aligned factor analysis of paired multivariate time series.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1