Curious Explorer: A Provable Exploration Strategy in Policy Learning

IF 18.6 IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-09-16 DOI:10.1109/TPAMI.2024.3460972

Marco Miani;Maurizio Parton;Marco Romito

{"title":"Curious Explorer: A Provable Exploration Strategy in Policy Learning","authors":"Marco Miani;Maurizio Parton;Marco Romito","doi":"10.1109/TPAMI.2024.3460972","DOIUrl":null,"url":null,"abstract":"A coverage assumption is critical with policy gradient methods, because while the objective function is insensitive to updates in unlikely states, the agent may need improvements in those states to reach a nearly optimal payoff. However, this assumption can be unfeasible in certain environments, for instance in online learning, or when restarts are possible only from a fixed initial state. In these cases, classical policy gradient algorithms like REINFORCE can have poor convergence properties and sample efficiency. Curious Explorer is an iterative state space pure exploration strategy improving coverage of any restart distribution \n<inline-formula><tex-math>$\\rho$</tex-math></inline-formula>\n. Using \n<inline-formula><tex-math>$\\rho$</tex-math></inline-formula>\n and intrinsic rewards, Curious Explorer produces a sequence of policies, each one more exploratory than the previous one, and outputs a restart distribution with coverage based on the state visitation distribution of the exploratory policies. This paper main results are a theoretical upper bound on how often an optimal policy visits poorly visited states, and a bound on the error of the return obtained by REINFORCE without any coverage assumption. Finally, we conduct ablation studies with \n<monospace>REINFORCE</monospace>\n and \n<monospace>TRPO</monospace>\n in two hard-exploration tasks, to support the claim that Curious Explorer can improve the performance of very different policy gradient algorithms.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11422-11431"},"PeriodicalIF":18.6000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10680592/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

A coverage assumption is critical with policy gradient methods, because while the objective function is insensitive to updates in unlikely states, the agent may need improvements in those states to reach a nearly optimal payoff. However, this assumption can be unfeasible in certain environments, for instance in online learning, or when restarts are possible only from a fixed initial state. In these cases, classical policy gradient algorithms like REINFORCE can have poor convergence properties and sample efficiency. Curious Explorer is an iterative state space pure exploration strategy improving coverage of any restart distribution

$\rho$

. Using

$\rho$

and intrinsic rewards, Curious Explorer produces a sequence of policies, each one more exploratory than the previous one, and outputs a restart distribution with coverage based on the state visitation distribution of the exploratory policies. This paper main results are a theoretical upper bound on how often an optimal policy visits poorly visited states, and a bound on the error of the return obtained by REINFORCE without any coverage assumption. Finally, we conduct ablation studies with REINFORCE and TRPO in two hard-exploration tasks, to support the claim that Curious Explorer can improve the performance of very different policy gradient algorithms.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

好奇的探索者：政策学习中的可证明探索策略

覆盖假设对策略梯度法至关重要，因为虽然目标函数对不可能状态下的更新不敏感，但代理可能需要改进这些状态才能达到近乎最优的报酬。然而，在某些环境下，例如在线学习，或只能从固定的初始状态重新开始时，这一假设可能是不可行的。在这种情况下，REINFORCE 等经典策略梯度算法的收敛性和采样效率都很差。好奇探索者是一种迭代状态空间纯探索策略，它能提高任何重启分布 $\rho$ 的覆盖率。好奇探索者使用 $\rho$ 和内在奖励，产生一系列策略，每一个策略都比前一个策略更具探索性，并根据探索策略的状态访问分布输出具有覆盖率的重启分布。本文的主要研究成果是：对最优策略访问欠访问状态的频率提出了理论上限，并对 REINFORCE 在不考虑任何覆盖率假设的情况下获得的回报误差提出了上限。最后，我们用 REINFORCE 和 TRPO 在两个硬探索任务中进行了消融研究，以支持好奇探索者能提高不同策略梯度算法性能的说法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量

期刊最新文献

Examining the Impact of Optical Aberrations to Image Classification and Object Detection Models. Neural Eigenfunctions are Structured Representation Learners. Distribution-to-Points Matching for Image Text Retrieval. Penny-Wise and Pound-Foolish in AI-Generated Image Detection. Enhancing Adversarial Transferability with Cost-efficient Landscape Flattening.