Single-Loop Deep Actor-Critic for Constrained Reinforcement Learning With Provable Convergence

IF 4.6 2区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Signal Processing Pub Date : 2024-09-16 DOI:10.1109/TSP.2024.3461963
Kexuan Wang;An Liu;Baishuo Lin
{"title":"Single-Loop Deep Actor-Critic for Constrained Reinforcement Learning With Provable Convergence","authors":"Kexuan Wang;An Liu;Baishuo Lin","doi":"10.1109/TSP.2024.3461963","DOIUrl":null,"url":null,"abstract":"Deep actor-critic (DAC) algorithms, which combine actor-critic with deep neural network (DNN), have been among the most prevalent reinforcement learning algorithms for decision-making problems in simulated environments. However, the existing DAC algorithms are still not mature to solve realistic problems with non-convex stochastic constraints and high cost to interact with the environment. In this paper, we propose a single-loop DAC (SLDAC) algorithmic framework for general constrained reinforcement learning problems. In the actor module, the constrained stochastic successive convex approximation (CSSCA) method is applied to better handle the non-convex stochastic objective and constraints. In the critic module, the critic DNNs are only updated once or a few finite times for each iteration, which simplifies the algorithm to a single-loop framework. Moreover, the variance of the policy gradient estimation is reduced by reusing observations from the old policy. The single-loop design and the observation reuse effectively reduce the agent-environment interaction cost and computational complexity. Despite the biased policy gradient estimation incurred by the single-loop design and observation reuse, we prove that the SLDAC with a feasible initial point can converge to a Karush-Kuhn-Tuker (KKT) point of the original problem almost surely. Simulations show that the SLDAC algorithm can achieve superior performance with much lower interaction cost.","PeriodicalId":13330,"journal":{"name":"IEEE Transactions on Signal Processing","volume":"72 ","pages":"4871-4887"},"PeriodicalIF":4.6000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10681174/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Deep actor-critic (DAC) algorithms, which combine actor-critic with deep neural network (DNN), have been among the most prevalent reinforcement learning algorithms for decision-making problems in simulated environments. However, the existing DAC algorithms are still not mature to solve realistic problems with non-convex stochastic constraints and high cost to interact with the environment. In this paper, we propose a single-loop DAC (SLDAC) algorithmic framework for general constrained reinforcement learning problems. In the actor module, the constrained stochastic successive convex approximation (CSSCA) method is applied to better handle the non-convex stochastic objective and constraints. In the critic module, the critic DNNs are only updated once or a few finite times for each iteration, which simplifies the algorithm to a single-loop framework. Moreover, the variance of the policy gradient estimation is reduced by reusing observations from the old policy. The single-loop design and the observation reuse effectively reduce the agent-environment interaction cost and computational complexity. Despite the biased policy gradient estimation incurred by the single-loop design and observation reuse, we prove that the SLDAC with a feasible initial point can converge to a Karush-Kuhn-Tuker (KKT) point of the original problem almost surely. Simulations show that the SLDAC algorithm can achieve superior performance with much lower interaction cost.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
具有可证明收敛性的受约束强化学习单环深度行为批判器
深度行为批判(DAC)算法将行为批判与深度神经网络(DNN)相结合,是模拟环境中决策问题最常用的强化学习算法之一。然而,现有的 DAC 算法在解决具有非凸随机约束和高环境交互成本的现实问题方面仍不成熟。在本文中,我们针对一般约束强化学习问题提出了一种单环 DAC(SLDAC)算法框架。在行动者模块中,应用了约束随机连续凸近似(CSSCA)方法,以更好地处理非凸随机目标和约束。在批判者模块中,批判者 DNN 在每次迭代中只更新一次或有限的几次,从而将算法简化为单循环框架。此外,通过重复使用旧策略的观测数据,还能降低策略梯度估计的方差。单循环设计和观测重复使用有效降低了代理与环境的交互成本和计算复杂度。尽管单环设计和观测重用会导致策略梯度估计出现偏差,但我们证明了具有可行初始点的 SLDAC 几乎肯定能收敛到原始问题的卡鲁什-库恩-图克(KKT)点。仿真表明,SLDAC 算法能以更低的交互成本实现更优越的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Transactions on Signal Processing
IEEE Transactions on Signal Processing 工程技术-工程:电子与电气
CiteScore
11.20
自引率
9.30%
发文量
310
审稿时长
3.0 months
期刊介绍: The IEEE Transactions on Signal Processing covers novel theory, algorithms, performance analyses and applications of techniques for the processing, understanding, learning, retrieval, mining, and extraction of information from signals. The term “signal” includes, among others, audio, video, speech, image, communication, geophysical, sonar, radar, medical and musical signals. Examples of topics of interest include, but are not limited to, information processing and the theory and application of filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals.
期刊最新文献
Enhancing Missing Data Imputation of Non-stationary Oscillatory Signals with Harmonic Decomposition Arithmetic vs Expected Mean of Probabilistic Asynchronous Affine Inference Causal Influence in Federated Edge Inference Penalized Likelihood Approach to Covariance Matrix Estimation from Data with Cell Outliers Double Sparse Structure-enhanced mmWave NLOS Imaging under Multiangle Relay Surface
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1