通过强化学习中的时空不一致性进行自监督探索

IEEE transactions on artificial intelligence Pub Date : 2024-06-13 DOI:10.1109/TAI.2024.3413692

Zijian Gao;Kele Xu;Yuanzhao Zhai;Bo Ding;Dawei Feng;Xinjun Mao;Huaimin Wang

{"title":"通过强化学习中的时空不一致性进行自监督探索","authors":"Zijian Gao;Kele Xu;Yuanzhao Zhai;Bo Ding;Dawei Feng;Xinjun Mao;Huaimin Wang","doi":"10.1109/TAI.2024.3413692","DOIUrl":null,"url":null,"abstract":"In sparse extrinsic reward settings, reinforcement learning remains a challenge despite increasing interest in this field. Existing approaches suggest that intrinsic rewards can alleviate issues caused by reward sparsity. However, many studies overlook the critical role of temporal information, essential for human curiosity. This article introduces a novel intrinsic reward mechanism inspired by human learning processes, where curiosity is evaluated by comparing current observations with historical knowledge. Our method involves training a self-supervised prediction model, periodically saving snapshots of the model parameters, and employing the nuclear norm to assess the temporal inconsistency between predictions from different snapshots as intrinsic rewards. Additionally, we propose a variational weighting mechanism to adaptively assign weights to the snapshots, enhancing the model's robustness and performance. Experimental results across various benchmark environments demonstrate the efficacy of our approach, which outperforms other state-of-the-art methods without incurring additional training costs and exhibits higher noise tolerance. Our findings indicate that leveraging temporal information in intrinsic rewards can significantly improve exploration performance, motivating future research to develop more robust and accurate reward systems for reinforcement learning.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"5 11","pages":"5530-5539"},"PeriodicalIF":0.0000,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Self-Supervised Exploration via Temporal Inconsistency in Reinforcement Learning\",\"authors\":\"Zijian Gao;Kele Xu;Yuanzhao Zhai;Bo Ding;Dawei Feng;Xinjun Mao;Huaimin Wang\",\"doi\":\"10.1109/TAI.2024.3413692\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In sparse extrinsic reward settings, reinforcement learning remains a challenge despite increasing interest in this field. Existing approaches suggest that intrinsic rewards can alleviate issues caused by reward sparsity. However, many studies overlook the critical role of temporal information, essential for human curiosity. This article introduces a novel intrinsic reward mechanism inspired by human learning processes, where curiosity is evaluated by comparing current observations with historical knowledge. Our method involves training a self-supervised prediction model, periodically saving snapshots of the model parameters, and employing the nuclear norm to assess the temporal inconsistency between predictions from different snapshots as intrinsic rewards. Additionally, we propose a variational weighting mechanism to adaptively assign weights to the snapshots, enhancing the model's robustness and performance. Experimental results across various benchmark environments demonstrate the efficacy of our approach, which outperforms other state-of-the-art methods without incurring additional training costs and exhibits higher noise tolerance. Our findings indicate that leveraging temporal information in intrinsic rewards can significantly improve exploration performance, motivating future research to develop more robust and accurate reward systems for reinforcement learning.\",\"PeriodicalId\":73305,\"journal\":{\"name\":\"IEEE transactions on artificial intelligence\",\"volume\":\"5 11\",\"pages\":\"5530-5539\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on artificial intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10557253/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10557253/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在外部奖励稀疏的情况下，强化学习仍然是一项挑战，尽管人们对这一领域的兴趣与日俱增。现有的方法表明，内在奖励可以缓解奖励稀疏带来的问题。然而，许多研究忽视了时间信息的关键作用，而时间信息对人类的好奇心至关重要。本文介绍了一种新颖的内在奖励机制，其灵感来源于人类的学习过程，通过比较当前观察和历史知识来评估好奇心。我们的方法包括训练一个自监督预测模型，定期保存模型参数的快照，并使用核规范来评估不同快照预测之间的时间不一致性，以此作为内在奖励。此外，我们还提出了一种变异加权机制，用于自适应地为快照分配权重，从而提高模型的鲁棒性和性能。在各种基准环境下的实验结果证明了我们的方法的有效性，它在不产生额外训练成本的情况下超越了其他最先进的方法，并表现出更高的噪声容忍度。我们的研究结果表明，利用内在奖励中的时间信息可以显著提高探索性能，从而激励未来的研究为强化学习开发更稳健、更准确的奖励系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Self-Supervised Exploration via Temporal Inconsistency in Reinforcement Learning

In sparse extrinsic reward settings, reinforcement learning remains a challenge despite increasing interest in this field. Existing approaches suggest that intrinsic rewards can alleviate issues caused by reward sparsity. However, many studies overlook the critical role of temporal information, essential for human curiosity. This article introduces a novel intrinsic reward mechanism inspired by human learning processes, where curiosity is evaluated by comparing current observations with historical knowledge. Our method involves training a self-supervised prediction model, periodically saving snapshots of the model parameters, and employing the nuclear norm to assess the temporal inconsistency between predictions from different snapshots as intrinsic rewards. Additionally, we propose a variational weighting mechanism to adaptively assign weights to the snapshots, enhancing the model's robustness and performance. Experimental results across various benchmark environments demonstrate the efficacy of our approach, which outperforms other state-of-the-art methods without incurring additional training costs and exhibits higher noise tolerance. Our findings indicate that leveraging temporal information in intrinsic rewards can significantly improve exploration performance, motivating future research to develop more robust and accurate reward systems for reinforcement learning.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助