Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis

IF 1.9 Q1 MATHEMATICS, APPLIED SIAM journal on mathematics of data science Pub Date : 2020-03-16 DOI:10.1137/20m1331524

K. Khamaru A. Pananjady Feng Ruan M. Wainwright Michael I. Jordan

引用次数: 39

Abstract

We address the problem of policy evaluation in discounted Markov decision processes, and provide instance-dependent guarantees on the $\ell_\infty$-error under a generative model. We establish both asymptotic and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms. Theory-inspired simulations show that the widely-used temporal difference (TD) algorithm is strictly suboptimal when evaluated in a non-asymptotic setting, even when combined with Polyak-Ruppert iterate averaging. We remedy this issue by introducing and analyzing variance-reduced forms of stochastic approximation, showing that they achieve non-asymptotic, instance-dependent optimality up to logarithmic factors.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

时间差异学习是最优的吗?依赖实例的分析

我们解决了贴现马尔可夫决策过程中的策略评估问题，并在生成模型下提供了对$\ell_\infty$ -误差的实例依赖保证。我们建立了策略评估的局部极大极小下界的渐近和非渐近版本，从而提供了一个实例相关的基线来比较算法。理论启发的模拟表明，当在非渐近设置中评估时，广泛使用的时间差分(TD)算法是严格次优的，即使与Polyak-Ruppert迭代平均相结合。我们通过引入和分析方差减少形式的随机逼近来解决这个问题，表明它们达到非渐近的、实例相关的最优性，直至对数因子。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

IF 6.8 1区计算机科学IEEE Transactions on Automatic ControlPub Date : 1997-03-01 DOI: 10.1109/9.580874

J.N. Tsitsiklis;B. Van Roy

An Analysis of Quantile Temporal-Difference Learning

IF 0 ArXivPub Date : 2023-01-11 DOI: 10.48550/arXiv.2301.04462

Mark Rowland, R. Munos, M. G. Azar, Yunhao Tang, Georg Ostrovski, A. Harutyunyan, K. Tuyls, Marc G. Bellemare, Will Dabney

An Analysis of Experience Replay in Temporal Difference Learning

IF 1.7 4区计算机科学Cybernetics and SystemsPub Date : 1999-06-01 DOI: 10.1080/019697299125127

Paweł Cichosz

来源期刊