Improved Exploration With Demonstrations in Procedurally-Generated Environments

IF 1.7 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IEEE Transactions on Games Pub Date : 2023-07-31 DOI:10.1109/TG.2023.3299986

Mao Xu;Shuzhi Sam Ge;Dongjie Zhao;Qian Zhao

{"title":"Improved Exploration With Demonstrations in Procedurally-Generated Environments","authors":"Mao Xu;Shuzhi Sam Ge;Dongjie Zhao;Qian Zhao","doi":"10.1109/TG.2023.3299986","DOIUrl":null,"url":null,"abstract":"Exploring sparse reward environments remains a major challenge in model-free deep reinforcement learning (RL). State-of-the-art exploration methods address this challenge by utilizing intrinsic rewards to guide exploration in uncertain environment dynamics or novel states. However, these methods fall short in procedurally-generated environments, where the agent is unlikely to visit a state more than once due to the different environments generated in each episode. Recently, imitation-learning-based exploration methods have been proposed to guide exploration in different kinds of procedurally-generated environments by imitating high-quality exploration episodes. However, these methods have weaker exploration capabilities and lower sample efficiency in complex procedurally-generated environments. Motivated by the fact that demonstrations can guide exploration in sparse reward environments, we propose improved exploration with demonstrations (IEWD), an improved imitation-learning-based exploration method in procedurally-generated environments, which utilizes demonstrations from these environments. IEWD assigns different episode-level exploration scores to each demonstration episode and generated episode. IEWD then ranks these episodes based on their scores and stores highly-scored episodes into a small ranking buffer. IEWD treats these highly-scored episodes as good exploration episodes and makes the deep RL agent imitate exploration behaviors from the ranking buffer to reproduce exploration behaviors from good exploration episodes. Additionally, IEWD adopts the experience replay buffer to store generated positive episodes and demonstrations and employs self-imitating learning to utilize experiences from the experience replay buffer to optimize the policy of the deep RL agent. We evaluate our method IEWD on several procedurally-generated MiniGrid environments and 3-D maze environments from MiniWorld. The results show that IEWD significantly outperforms existing learning from demonstration methods and exploration methods, including state-of-the-art imitation-learning-based exploration methods, in terms of sample efficiency and final performance in complex procedurally-generated environments.","PeriodicalId":55977,"journal":{"name":"IEEE Transactions on Games","volume":"16 3","pages":"530-543"},"PeriodicalIF":1.7000,"publicationDate":"2023-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Games","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10197470/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Exploring sparse reward environments remains a major challenge in model-free deep reinforcement learning (RL). State-of-the-art exploration methods address this challenge by utilizing intrinsic rewards to guide exploration in uncertain environment dynamics or novel states. However, these methods fall short in procedurally-generated environments, where the agent is unlikely to visit a state more than once due to the different environments generated in each episode. Recently, imitation-learning-based exploration methods have been proposed to guide exploration in different kinds of procedurally-generated environments by imitating high-quality exploration episodes. However, these methods have weaker exploration capabilities and lower sample efficiency in complex procedurally-generated environments. Motivated by the fact that demonstrations can guide exploration in sparse reward environments, we propose improved exploration with demonstrations (IEWD), an improved imitation-learning-based exploration method in procedurally-generated environments, which utilizes demonstrations from these environments. IEWD assigns different episode-level exploration scores to each demonstration episode and generated episode. IEWD then ranks these episodes based on their scores and stores highly-scored episodes into a small ranking buffer. IEWD treats these highly-scored episodes as good exploration episodes and makes the deep RL agent imitate exploration behaviors from the ranking buffer to reproduce exploration behaviors from good exploration episodes. Additionally, IEWD adopts the experience replay buffer to store generated positive episodes and demonstrations and employs self-imitating learning to utilize experiences from the experience replay buffer to optimize the policy of the deep RL agent. We evaluate our method IEWD on several procedurally-generated MiniGrid environments and 3-D maze environments from MiniWorld. The results show that IEWD significantly outperforms existing learning from demonstration methods and exploration methods, including state-of-the-art imitation-learning-based exploration methods, in terms of sample efficiency and final performance in complex procedurally-generated environments.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用程序生成环境中的演示改进探索工作

探索稀疏奖励环境仍然是无模型深度强化学习（RL）的一大挑战。最先进的探索方法通过利用内在奖励来引导在不确定的环境动态或新状态下的探索，从而解决了这一难题。然而，这些方法在程序生成的环境中并不适用，在这种环境中，由于每集生成的环境不同，代理不可能多次访问一个状态。最近，有人提出了基于模仿学习的探索方法，通过模仿高质量的探索情节，在不同类型的程序生成环境中引导探索。然而，在复杂的程序生成环境中，这些方法的探索能力较弱，采样效率较低。鉴于示范可以引导稀疏奖励环境中的探索，我们提出了基于示范的改进探索方法（IEWD），这是一种在程序生成环境中基于模仿学习的改进探索方法，它利用了这些环境中的示范。IEWD 为每个演示情节和生成情节分配不同的情节级探索分数。然后，IEWD 根据分数对这些情节进行排名，并将高分情节存储到一个小的排名缓冲区中。IEWD 将这些高分剧集视为优秀探索剧集，并让深度 RL 代理模仿排名缓冲区中的探索行为，以重现优秀探索剧集中的探索行为。此外，IEWD 还采用经验回放缓冲区来存储生成的积极情节和演示，并利用自模仿学习来利用经验回放缓冲区中的经验来优化深度 RL 代理的策略。我们在几个程序生成的 MiniGrid 环境和 MiniWorld 的三维迷宫环境中评估了我们的 IEWD 方法。结果表明，在复杂的程序生成环境中，IEWD 在样本效率和最终性能方面明显优于现有的示范学习方法和探索方法，包括最先进的基于模仿学习的探索方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助