基于卷积神经网络的奖励塑造

Inf. Sci. Pub Date : 2022-10-30 DOI:10.48550/arXiv.2210.16956

Hani Sami, H. Otrok, J. Bentahar, A. Mourad, E. Damiani

{"title":"基于卷积神经网络的奖励塑造","authors":"Hani Sami, H. Otrok, J. Bentahar, A. Mourad, E. Damiani","doi":"10.48550/arXiv.2210.16956","DOIUrl":null,"url":null,"abstract":"In this paper, we propose Value Iteration Network for Reward Shaping (VIN-RS), a potential-based reward shaping mechanism using Convolutional Neural Network (CNN). The proposed VIN-RS embeds a CNN trained on computed labels using the message passing mechanism of the Hidden Markov Model. The CNN processes images or graphs of the environment to predict the shaping values. Recent work on reward shaping still has limitations towards training on a representation of the Markov Decision Process (MDP) and building an estimate of the transition matrix. The advantage of VIN-RS is to construct an effective potential function from an estimated MDP while automatically inferring the environment transition matrix. The proposed VIN-RS estimates the transition matrix through a self-learned convolution filter while extracting environment details from the input frames or sampled graphs. Due to (1) the previous success of using message passing for reward shaping; and (2) the CNN planning behavior, we use these messages to train the CNN of VIN-RS. Experiments are performed on tabular games, Atari 2600 and MuJoCo, for discrete and continuous action space. Our results illustrate promising improvements in the learning speed and maximum cumulative reward compared to the state-of-the-art.","PeriodicalId":13641,"journal":{"name":"Inf. Sci.","volume":"278 1","pages":"119481"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Reward Shaping Using Convolutional Neural Network\",\"authors\":\"Hani Sami, H. Otrok, J. Bentahar, A. Mourad, E. Damiani\",\"doi\":\"10.48550/arXiv.2210.16956\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose Value Iteration Network for Reward Shaping (VIN-RS), a potential-based reward shaping mechanism using Convolutional Neural Network (CNN). The proposed VIN-RS embeds a CNN trained on computed labels using the message passing mechanism of the Hidden Markov Model. The CNN processes images or graphs of the environment to predict the shaping values. Recent work on reward shaping still has limitations towards training on a representation of the Markov Decision Process (MDP) and building an estimate of the transition matrix. The advantage of VIN-RS is to construct an effective potential function from an estimated MDP while automatically inferring the environment transition matrix. The proposed VIN-RS estimates the transition matrix through a self-learned convolution filter while extracting environment details from the input frames or sampled graphs. Due to (1) the previous success of using message passing for reward shaping; and (2) the CNN planning behavior, we use these messages to train the CNN of VIN-RS. Experiments are performed on tabular games, Atari 2600 and MuJoCo, for discrete and continuous action space. Our results illustrate promising improvements in the learning speed and maximum cumulative reward compared to the state-of-the-art.\",\"PeriodicalId\":13641,\"journal\":{\"name\":\"Inf. Sci.\",\"volume\":\"278 1\",\"pages\":\"119481\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Inf. Sci.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2210.16956\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Inf. Sci.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2210.16956","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在本文中，我们提出了一种基于卷积神经网络(CNN)的基于电位的奖励形成机制——奖励形成的价值迭代网络(VIN-RS)。本文提出的VIN-RS利用隐马尔可夫模型的消息传递机制嵌入一个经过计算标签训练的CNN。CNN处理环境的图像或图形来预测塑形值。最近关于奖励形成的工作在训练马尔可夫决策过程(MDP)的表示和建立转移矩阵的估计方面仍然存在局限性。VIN-RS的优点是在自动推断环境转移矩阵的同时，根据估计的MDP构造有效的势函数。提出的VIN-RS通过自学习卷积滤波器估计过渡矩阵，同时从输入帧或采样图中提取环境细节。由于(1)先前使用消息传递进行奖励塑造的成功;(2) CNN的规划行为，我们使用这些消息来训练VIN-RS的CNN。实验在表格游戏，Atari 2600和MuJoCo上进行，用于离散和连续的动作空间。我们的研究结果表明，与最先进的方法相比，在学习速度和最大累积奖励方面有很大的改善。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Reward Shaping Using Convolutional Neural Network

In this paper, we propose Value Iteration Network for Reward Shaping (VIN-RS), a potential-based reward shaping mechanism using Convolutional Neural Network (CNN). The proposed VIN-RS embeds a CNN trained on computed labels using the message passing mechanism of the Hidden Markov Model. The CNN processes images or graphs of the environment to predict the shaping values. Recent work on reward shaping still has limitations towards training on a representation of the Markov Decision Process (MDP) and building an estimate of the transition matrix. The advantage of VIN-RS is to construct an effective potential function from an estimated MDP while automatically inferring the environment transition matrix. The proposed VIN-RS estimates the transition matrix through a self-learned convolution filter while extracting environment details from the input frames or sampled graphs. Due to (1) the previous success of using message passing for reward shaping; and (2) the CNN planning behavior, we use these messages to train the CNN of VIN-RS. Experiments are performed on tabular games, Atari 2600 and MuJoCo, for discrete and continuous action space. Our results illustrate promising improvements in the learning speed and maximum cumulative reward compared to the state-of-the-art.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Inf. Sci.

自引率

0.00%

发文量