Improved Exploration With Demonstrations in Procedurally-Generated Environments

IF 1.7 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IEEE Transactions on Games Pub Date : 2023-07-31 DOI:10.1109/TG.2023.3299986
Mao Xu;Shuzhi Sam Ge;Dongjie Zhao;Qian Zhao
{"title":"Improved Exploration With Demonstrations in Procedurally-Generated Environments","authors":"Mao Xu;Shuzhi Sam Ge;Dongjie Zhao;Qian Zhao","doi":"10.1109/TG.2023.3299986","DOIUrl":null,"url":null,"abstract":"Exploring sparse reward environments remains a major challenge in model-free deep reinforcement learning (RL). State-of-the-art exploration methods address this challenge by utilizing intrinsic rewards to guide exploration in uncertain environment dynamics or novel states. However, these methods fall short in procedurally-generated environments, where the agent is unlikely to visit a state more than once due to the different environments generated in each episode. Recently, imitation-learning-based exploration methods have been proposed to guide exploration in different kinds of procedurally-generated environments by imitating high-quality exploration episodes. However, these methods have weaker exploration capabilities and lower sample efficiency in complex procedurally-generated environments. Motivated by the fact that demonstrations can guide exploration in sparse reward environments, we propose improved exploration with demonstrations (IEWD), an improved imitation-learning-based exploration method in procedurally-generated environments, which utilizes demonstrations from these environments. IEWD assigns different episode-level exploration scores to each demonstration episode and generated episode. IEWD then ranks these episodes based on their scores and stores highly-scored episodes into a small ranking buffer. IEWD treats these highly-scored episodes as good exploration episodes and makes the deep RL agent imitate exploration behaviors from the ranking buffer to reproduce exploration behaviors from good exploration episodes. Additionally, IEWD adopts the experience replay buffer to store generated positive episodes and demonstrations and employs self-imitating learning to utilize experiences from the experience replay buffer to optimize the policy of the deep RL agent. We evaluate our method IEWD on several procedurally-generated MiniGrid environments and 3-D maze environments from MiniWorld. The results show that IEWD significantly outperforms existing learning from demonstration methods and exploration methods, including state-of-the-art imitation-learning-based exploration methods, in terms of sample efficiency and final performance in complex procedurally-generated environments.","PeriodicalId":55977,"journal":{"name":"IEEE Transactions on Games","volume":"16 3","pages":"530-543"},"PeriodicalIF":1.7000,"publicationDate":"2023-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Games","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10197470/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Exploring sparse reward environments remains a major challenge in model-free deep reinforcement learning (RL). State-of-the-art exploration methods address this challenge by utilizing intrinsic rewards to guide exploration in uncertain environment dynamics or novel states. However, these methods fall short in procedurally-generated environments, where the agent is unlikely to visit a state more than once due to the different environments generated in each episode. Recently, imitation-learning-based exploration methods have been proposed to guide exploration in different kinds of procedurally-generated environments by imitating high-quality exploration episodes. However, these methods have weaker exploration capabilities and lower sample efficiency in complex procedurally-generated environments. Motivated by the fact that demonstrations can guide exploration in sparse reward environments, we propose improved exploration with demonstrations (IEWD), an improved imitation-learning-based exploration method in procedurally-generated environments, which utilizes demonstrations from these environments. IEWD assigns different episode-level exploration scores to each demonstration episode and generated episode. IEWD then ranks these episodes based on their scores and stores highly-scored episodes into a small ranking buffer. IEWD treats these highly-scored episodes as good exploration episodes and makes the deep RL agent imitate exploration behaviors from the ranking buffer to reproduce exploration behaviors from good exploration episodes. Additionally, IEWD adopts the experience replay buffer to store generated positive episodes and demonstrations and employs self-imitating learning to utilize experiences from the experience replay buffer to optimize the policy of the deep RL agent. We evaluate our method IEWD on several procedurally-generated MiniGrid environments and 3-D maze environments from MiniWorld. The results show that IEWD significantly outperforms existing learning from demonstration methods and exploration methods, including state-of-the-art imitation-learning-based exploration methods, in terms of sample efficiency and final performance in complex procedurally-generated environments.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用程序生成环境中的演示改进探索工作
探索稀疏奖励环境仍然是无模型深度强化学习(RL)的一大挑战。最先进的探索方法通过利用内在奖励来引导在不确定的环境动态或新状态下的探索,从而解决了这一难题。然而,这些方法在程序生成的环境中并不适用,在这种环境中,由于每集生成的环境不同,代理不可能多次访问一个状态。最近,有人提出了基于模仿学习的探索方法,通过模仿高质量的探索情节,在不同类型的程序生成环境中引导探索。然而,在复杂的程序生成环境中,这些方法的探索能力较弱,采样效率较低。鉴于示范可以引导稀疏奖励环境中的探索,我们提出了基于示范的改进探索方法(IEWD),这是一种在程序生成环境中基于模仿学习的改进探索方法,它利用了这些环境中的示范。IEWD 为每个演示情节和生成情节分配不同的情节级探索分数。然后,IEWD 根据分数对这些情节进行排名,并将高分情节存储到一个小的排名缓冲区中。IEWD 将这些高分剧集视为优秀探索剧集,并让深度 RL 代理模仿排名缓冲区中的探索行为,以重现优秀探索剧集中的探索行为。此外,IEWD 还采用经验回放缓冲区来存储生成的积极情节和演示,并利用自模仿学习来利用经验回放缓冲区中的经验来优化深度 RL 代理的策略。我们在几个程序生成的 MiniGrid 环境和 MiniWorld 的三维迷宫环境中评估了我们的 IEWD 方法。结果表明,在复杂的程序生成环境中,IEWD 在样本效率和最终性能方面明显优于现有的示范学习方法和探索方法,包括最先进的基于模仿学习的探索方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Transactions on Games
IEEE Transactions on Games Engineering-Electrical and Electronic Engineering
CiteScore
4.60
自引率
8.70%
发文量
87
期刊最新文献
Table of Contents IEEE Computational Intelligence Society Information IEEE Transactions on Games Publication Information Large Language Models and Games: A Survey and Roadmap Investigating Efficiency of Free-For-All Models in a Matchmaking Context
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1