强化学习的不确定性量化与探索

IF 0.7 4区管理学 Q3 Engineering Military Operations Research Pub Date : 2019-10-12 DOI:10.1287/opre.2023.2436

Yi Zhu, Jing Dong, H. Lam

{"title":"强化学习的不确定性量化与探索","authors":"Yi Zhu, Jing Dong, H. Lam","doi":"10.1287/opre.2023.2436","DOIUrl":null,"url":null,"abstract":"Quantify the uncertainty to decide and explore better In statistical inference, large-sample behavior and confidence interval construction are fundamental in assessing the error and reliability of estimated quantities with respect to the data noises. In the paper “Uncertainty Quantification and Exploration for Reinforcement Learning”, Dong, Lam, and Zhu study the large sample behavior in the classic setting of reinforcement learning. They derive appropriate large-sample asymptotic distributions for the state-action value function (Q-value) and optimal value function estimations when data are collected from the underlying Markov chain. This allows one to evaluate the assertiveness of performances among different decisions. The tight uncertainty quantification also facilitates the development of a pure exploration policy by maximizing the worst-case relative discrepancy among the estimated Q-values (ratio of the mean squared difference to the variance). This exploration policy aims to collect informative training data to maximize the probability of learning the optimal reward collecting policy, and it achieves good empirical performance.","PeriodicalId":49809,"journal":{"name":"Military Operations Research","volume":"12 1","pages":""},"PeriodicalIF":0.7000,"publicationDate":"2019-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Uncertainty Quantification and Exploration for Reinforcement Learning\",\"authors\":\"Yi Zhu, Jing Dong, H. Lam\",\"doi\":\"10.1287/opre.2023.2436\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Quantify the uncertainty to decide and explore better In statistical inference, large-sample behavior and confidence interval construction are fundamental in assessing the error and reliability of estimated quantities with respect to the data noises. In the paper “Uncertainty Quantification and Exploration for Reinforcement Learning”, Dong, Lam, and Zhu study the large sample behavior in the classic setting of reinforcement learning. They derive appropriate large-sample asymptotic distributions for the state-action value function (Q-value) and optimal value function estimations when data are collected from the underlying Markov chain. This allows one to evaluate the assertiveness of performances among different decisions. The tight uncertainty quantification also facilitates the development of a pure exploration policy by maximizing the worst-case relative discrepancy among the estimated Q-values (ratio of the mean squared difference to the variance). This exploration policy aims to collect informative training data to maximize the probability of learning the optimal reward collecting policy, and it achieves good empirical performance.\",\"PeriodicalId\":49809,\"journal\":{\"name\":\"Military Operations Research\",\"volume\":\"12 1\",\"pages\":\"\"},\"PeriodicalIF\":0.7000,\"publicationDate\":\"2019-10-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Military Operations Research\",\"FirstCategoryId\":\"91\",\"ListUrlMain\":\"https://doi.org/10.1287/opre.2023.2436\",\"RegionNum\":4,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Engineering\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Military Operations Research","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1287/opre.2023.2436","RegionNum":4,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Engineering","Score":null,"Total":0}

引用次数: 1

摘要

在统计推断中，大样本行为和置信区间构造是评估相对于数据噪声的估计量的误差和可靠性的基础。在论文“不确定性量化和探索强化学习”中，Dong, Lam和Zhu研究了经典强化学习环境下的大样本行为。当从底层马尔可夫链收集数据时，他们推导出适当的大样本渐近分布的状态-作用值函数(q值)和最优值函数估计。这允许人们在不同的决策中评估表现的自信。严格的不确定性量化还通过最大化估计q值(均方差与方差的比值)之间的最坏情况相对差异，促进了纯勘探策略的发展。该探索策略旨在收集信息丰富的训练数据，使学习到最优奖励收集策略的概率最大化，并取得了良好的经验性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Uncertainty Quantification and Exploration for Reinforcement Learning

Quantify the uncertainty to decide and explore better In statistical inference, large-sample behavior and confidence interval construction are fundamental in assessing the error and reliability of estimated quantities with respect to the data noises. In the paper “Uncertainty Quantification and Exploration for Reinforcement Learning”, Dong, Lam, and Zhu study the large sample behavior in the classic setting of reinforcement learning. They derive appropriate large-sample asymptotic distributions for the state-action value function (Q-value) and optimal value function estimations when data are collected from the underlying Markov chain. This allows one to evaluate the assertiveness of performances among different decisions. The tight uncertainty quantification also facilitates the development of a pure exploration policy by maximizing the worst-case relative discrepancy among the estimated Q-values (ratio of the mean squared difference to the variance). This exploration policy aims to collect informative training data to maximize the probability of learning the optimal reward collecting policy, and it achieves good empirical performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Military Operations Research 管理科学-运筹学与管理科学

CiteScore

1.00

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Military Operations Research is a peer-reviewed journal of high academic quality. The Journal publishes articles that describe operations research (OR) methodologies and theories used in key military and national security applications. Of particular interest are papers that present: Case studies showing innovative OR applications Apply OR to major policy issues Introduce interesting new problems areas Highlight education issues Document the history of military and national security OR.

期刊最新文献

Optimal Routing Under Demand Surges: The Value of Future Arrival Rates Demand Estimation Under Uncertain Consideration Sets Optimal Routing to Parallel Servers in Heavy Traffic The When and How of Delegated Search A Data-Driven Approach to Beating SAA Out of Sample