Spiking Variational Policy Gradient for Brain Inspired Reinforcement Learning

Zhile Yang;Shangqi Guo;Ying Fang;Zhaofei Yu;Jian K. Liu
{"title":"Spiking Variational Policy Gradient for Brain Inspired Reinforcement Learning","authors":"Zhile Yang;Shangqi Guo;Ying Fang;Zhaofei Yu;Jian K. Liu","doi":"10.1109/TPAMI.2024.3511936","DOIUrl":null,"url":null,"abstract":"Recent studies in reinforcement learning have explored brain-inspired function approximators and learning algorithms to simulate brain intelligence and adapt to neuromorphic hardware. Among these approaches, reward-modulated spike-timing-dependent plasticity (R-STDP) is biologically plausible and energy-efficient, but suffers from a gap between its local learning rules and the global learning objectives, which limits its performance and applicability. In this paper, we design a recurrent winner-take-all network and propose the spiking variational policy gradient (SVPG), a new R-STDP learning method derived theoretically from the global policy gradient. Specifically, the policy inference is derived from an energy-based policy function using mean-field inference, and the policy optimization is based on a last-step approximation of the global policy gradient. These fill the gap between the local learning rules and the global target. In experiments including a challenging ViZDoom vision-based navigation task and two realistic robot control tasks, SVPG successfully solves all the tasks. In addition, SVPG exhibits better inherent robustness to various kinds of input, network parameters, and environmental perturbations than compared methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"1975-1990"},"PeriodicalIF":18.6000,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10786920/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Recent studies in reinforcement learning have explored brain-inspired function approximators and learning algorithms to simulate brain intelligence and adapt to neuromorphic hardware. Among these approaches, reward-modulated spike-timing-dependent plasticity (R-STDP) is biologically plausible and energy-efficient, but suffers from a gap between its local learning rules and the global learning objectives, which limits its performance and applicability. In this paper, we design a recurrent winner-take-all network and propose the spiking variational policy gradient (SVPG), a new R-STDP learning method derived theoretically from the global policy gradient. Specifically, the policy inference is derived from an energy-based policy function using mean-field inference, and the policy optimization is based on a last-step approximation of the global policy gradient. These fill the gap between the local learning rules and the global target. In experiments including a challenging ViZDoom vision-based navigation task and two realistic robot control tasks, SVPG successfully solves all the tasks. In addition, SVPG exhibits better inherent robustness to various kinds of input, network parameters, and environmental perturbations than compared methods.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
脑启发强化学习的峰值变分策略梯度
最近在强化学习方面的研究已经探索了大脑启发的函数逼近器和学习算法来模拟大脑智能并适应神经形态硬件。在这些方法中,奖励调制的spike- time -dependent plasticity (R-STDP)是生物学上合理且节能的方法,但其局部学习规则与全局学习目标之间存在差距,限制了其性能和适用性。本文设计了一个循环的赢家通吃网络,并提出了一种新的R-STDP学习方法——尖峰变分策略梯度(spike variational policy gradient, SVPG)。具体来说,策略推理是使用平均场推理从基于能量的策略函数中导出的,策略优化是基于全局策略梯度的最后一步逼近。这填补了局部学习规则和全局目标之间的空白。在包括一个具有挑战性的基于ViZDoom视觉的导航任务和两个现实机器人控制任务的实验中,SVPG成功地解决了所有任务。此外,与比较的方法相比,SVPG对各种输入、网络参数和环境扰动具有更好的固有鲁棒性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
GrowSP++: Growing Superpoints and Primitives for Unsupervised 3D Semantic Segmentation. Unsupervised Gaze Representation Learning by Switching Features. H2OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers. MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection. Parse Trees Guided LLM Prompt Compression.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1