Charline Le Lan, Stephen Tu, Mark Rowland, A. Harutyunyan, Rishabh Agarwal, Marc G. Bellemare, Will Dabney
{"title":"Bootstrapped Representations in Reinforcement Learning","authors":"Charline Le Lan, Stephen Tu, Mark Rowland, A. Harutyunyan, Rishabh Agarwal, Marc G. Bellemare, Will Dabney","doi":"10.48550/arXiv.2306.10171","DOIUrl":null,"url":null,"abstract":"In reinforcement learning (RL), state representations are key to dealing with large or continuous state spaces. While one of the promises of deep learning algorithms is to automatically construct features well-tuned for the task they try to solve, such a representation might not emerge from end-to-end training of deep RL agents. To mitigate this issue, auxiliary objectives are often incorporated into the learning process and help shape the learnt state representation. Bootstrapping methods are today's method of choice to make these additional predictions. Yet, it is unclear which features these algorithms capture and how they relate to those from other auxiliary-task-based approaches. In this paper, we address this gap and provide a theoretical characterization of the state representation learnt by temporal difference learning (Sutton, 1988). Surprisingly, we find that this representation differs from the features learned by Monte Carlo and residual gradient algorithms for most transition structures of the environment in the policy evaluation setting. We describe the efficacy of these representations for policy evaluation, and use our theoretical analysis to design new auxiliary learning rules. We complement our theoretical results with an empirical comparison of these learning rules for different cumulant functions on classic domains such as the four-room domain (Sutton et al, 1999) and Mountain Car (Moore, 1990).","PeriodicalId":74529,"journal":{"name":"Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning","volume":"62 1","pages":"18686-18713"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2306.10171","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
In reinforcement learning (RL), state representations are key to dealing with large or continuous state spaces. While one of the promises of deep learning algorithms is to automatically construct features well-tuned for the task they try to solve, such a representation might not emerge from end-to-end training of deep RL agents. To mitigate this issue, auxiliary objectives are often incorporated into the learning process and help shape the learnt state representation. Bootstrapping methods are today's method of choice to make these additional predictions. Yet, it is unclear which features these algorithms capture and how they relate to those from other auxiliary-task-based approaches. In this paper, we address this gap and provide a theoretical characterization of the state representation learnt by temporal difference learning (Sutton, 1988). Surprisingly, we find that this representation differs from the features learned by Monte Carlo and residual gradient algorithms for most transition structures of the environment in the policy evaluation setting. We describe the efficacy of these representations for policy evaluation, and use our theoretical analysis to design new auxiliary learning rules. We complement our theoretical results with an empirical comparison of these learning rules for different cumulant functions on classic domains such as the four-room domain (Sutton et al, 1999) and Mountain Car (Moore, 1990).
在强化学习(RL)中,状态表示是处理大型或连续状态空间的关键。虽然深度学习算法的承诺之一是自动构建针对它们试图解决的任务进行优化的特征,但这种表示可能不会从深度强化学习代理的端到端训练中出现。为了缓解这个问题,辅助目标通常被纳入学习过程,并帮助塑造学习状态表示。自举法是目前进行这些额外预测的首选方法。然而,目前还不清楚这些算法捕获了哪些特征,以及它们如何与其他基于辅助任务的方法相关联。在本文中,我们解决了这一差距,并提供了通过时间差异学习学习的状态表示的理论表征(Sutton, 1988)。令人惊讶的是,我们发现这种表示不同于蒙特卡罗和残差梯度算法在策略评估设置中对环境的大多数过渡结构学习的特征。我们描述了这些表征对政策评估的有效性,并使用我们的理论分析来设计新的辅助学习规则。我们对经典领域(如四房间领域(Sutton et al ., 1999)和Mountain Car (Moore, 1990))的不同累积函数的学习规则进行了实证比较,以补充我们的理论结果。