用于离线强化学习的预估状态-行为平衡权值

IF 3.7 1区数学 Q1 STATISTICS & PROBABILITY Annals of Statistics Pub Date : 2023-08-01 DOI:10.1214/23-aos2302

Jiayi Wang, Zhengling Qi, Raymond K. W. Wong

{"title":"用于离线强化学习的预估状态-行为平衡权值","authors":"Jiayi Wang, Zhengling Qi, Raymond K. W. Wong","doi":"10.1214/23-aos2302","DOIUrl":null,"url":null,"abstract":"Off-policy evaluation is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and the covariate balancing idea in causal inference, we propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation. We obtain the convergence rate of these weights, and show that the proposed value estimator is asymptotically normal under technical conditions. In terms of asymptotics, our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, we develop a necessary and sufficient condition for establishing the well-posedness of the operator that relates to the nonparametric Q-function estimation in the off-policy setting, which characterizes the difficulty of Q-function estimation and may be of independent interest. Numerical experiments demonstrate the promising performance of our proposed estimator.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":"15 1","pages":"0"},"PeriodicalIF":3.7000,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Projected state-action balancing weights for offline reinforcement learning\",\"authors\":\"Jiayi Wang, Zhengling Qi, Raymond K. W. Wong\",\"doi\":\"10.1214/23-aos2302\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Off-policy evaluation is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and the covariate balancing idea in causal inference, we propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation. We obtain the convergence rate of these weights, and show that the proposed value estimator is asymptotically normal under technical conditions. In terms of asymptotics, our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, we develop a necessary and sufficient condition for establishing the well-posedness of the operator that relates to the nonparametric Q-function estimation in the off-policy setting, which characterizes the difficulty of Q-function estimation and may be of independent interest. Numerical experiments demonstrate the promising performance of our proposed estimator.\",\"PeriodicalId\":8032,\"journal\":{\"name\":\"Annals of Statistics\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2023-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Annals of Statistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1214/23-aos2302\",\"RegionNum\":1,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Statistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1214/23-aos2302","RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 6

摘要

非策略评估被认为是强化学习(RL)中的一个基本和具有挑战性的问题。本文研究了在无限视界马尔可夫决策过程框架下，基于可能不同策略产生的预采集数据对目标策略的价值估计。基于最近发展起来的强化学习中的边际重要性抽样方法和因果推理中的协变量平衡思想，我们提出了一种具有近似投影状态-行为平衡权的策略值估计器。我们得到了这些权值的收敛速率，并证明了所提出的值估计量在技术条件下是渐近正态的。在渐近性方面，我们的结果与轨迹的数量和每个轨迹上的决策点的数量都有关系。因此，当决策点的数量偏离时，仍然可以用有限数量的受试者实现一致性。此外，我们还建立了一个关于非参数q函数估计的算子的适定性的充分必要条件，它表征了q函数估计的难度，可能具有独立的研究意义。数值实验证明了该估计方法的良好性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Projected state-action balancing weights for offline reinforcement learning

Off-policy evaluation is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and the covariate balancing idea in causal inference, we propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation. We obtain the convergence rate of these weights, and show that the proposed value estimator is asymptotically normal under technical conditions. In terms of asymptotics, our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, we develop a necessary and sufficient condition for establishing the well-posedness of the operator that relates to the nonparametric Q-function estimation in the off-policy setting, which characterizes the difficulty of Q-function estimation and may be of independent interest. Numerical experiments demonstrate the promising performance of our proposed estimator.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Annals of Statistics 数学-统计学与概率论

CiteScore

9.30

自引率

8.90%

发文量

119

审稿时长

6-12 weeks

期刊介绍： The Annals of Statistics aim to publish research papers of highest quality reflecting the many facets of contemporary statistics. Primary emphasis is placed on importance and originality, not on formalism. The journal aims to cover all areas of statistics, especially mathematical statistics and applied & interdisciplinary statistics. Of course many of the best papers will touch on more than one of these general areas, because the discipline of statistics has deep roots in mathematics, and in substantive scientific fields.