Visionary Policy Iteration for Continuous Control

IF 8.7 1区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS IEEE Transactions on Systems Man Cybernetics-Systems Pub Date : 2025-01-22 DOI:10.1109/TSMC.2025.3525473

Botao Dong;Longyang Huang;Xiwen Ma;Hongtian Chen;Weidong Zhang

{"title":"Visionary Policy Iteration for Continuous Control","authors":"Botao Dong;Longyang Huang;Xiwen Ma;Hongtian Chen;Weidong Zhang","doi":"10.1109/TSMC.2025.3525473","DOIUrl":null,"url":null,"abstract":"In this article, a novel visionary policy iteration (VPI) framework is proposed to address the continuous-action reinforcement learning (RL) tasks. In VPI, a visionary Q-function is constructed by incorporating the successor state into the standard Q-function. Due to the introduction of the successor state, the proposed visionary Q-function captures information about state transitions within the Markov decision process (MDP), thereby providing a forward-looking perspective that enables a more accurate and foresighted evaluation of potential action outcomes. The relationship between the visionary Q-function and the standard Q-function is analyzed. Subsequently, both the policy evaluation and policy improvement rules in VPI are designed based on the proposed visionary Q-function. The convergence proof for VPI is provided, ensuring that the iterative policy sequence in VPI will converge to the optimal policy. By combining the VPI framework with the twin delayed deep deterministic policy gradient (TD3) algorithm, a visionary TD3 (VTD3) algorithm is developed. The evaluation of VTD3 is performed on multiple continuous-action control tasks from Mujoco and OpenAI Gym platforms. The results of comparative experiments demonstrate that VTD3 can achieve more competitive performance than other state-of-the-art (SOTA) RL approaches. Additionally, the experimental results indicate that VPI enhances decision-making capability, reduces Q-function estimation bias, and improves sample efficiency, thereby boosting the performance of existing RL algorithms.","PeriodicalId":48915,"journal":{"name":"IEEE Transactions on Systems Man Cybernetics-Systems","volume":"55 4","pages":"2707-2720"},"PeriodicalIF":8.7000,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Systems Man Cybernetics-Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10849991/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

In this article, a novel visionary policy iteration (VPI) framework is proposed to address the continuous-action reinforcement learning (RL) tasks. In VPI, a visionary Q-function is constructed by incorporating the successor state into the standard Q-function. Due to the introduction of the successor state, the proposed visionary Q-function captures information about state transitions within the Markov decision process (MDP), thereby providing a forward-looking perspective that enables a more accurate and foresighted evaluation of potential action outcomes. The relationship between the visionary Q-function and the standard Q-function is analyzed. Subsequently, both the policy evaluation and policy improvement rules in VPI are designed based on the proposed visionary Q-function. The convergence proof for VPI is provided, ensuring that the iterative policy sequence in VPI will converge to the optimal policy. By combining the VPI framework with the twin delayed deep deterministic policy gradient (TD3) algorithm, a visionary TD3 (VTD3) algorithm is developed. The evaluation of VTD3 is performed on multiple continuous-action control tasks from Mujoco and OpenAI Gym platforms. The results of comparative experiments demonstrate that VTD3 can achieve more competitive performance than other state-of-the-art (SOTA) RL approaches. Additionally, the experimental results indicate that VPI enhances decision-making capability, reduces Q-function estimation bias, and improves sample efficiency, thereby boosting the performance of existing RL algorithms.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

连续控制的前瞻性策略迭代

本文提出了一种新的有远见的策略迭代（VPI）框架来解决持续行动强化学习（RL）任务。在VPI中，通过将后继状态并入标准q函数来构造远景q函数。由于引入了后继状态，所提出的前瞻性q函数捕获了马尔可夫决策过程（MDP）中状态转换的信息，从而提供了一个前瞻性的视角，能够更准确、更有远见地评估潜在的行动结果。分析了异象q函数与标准q函数的关系。随后，基于所提出的前瞻性q函数，设计了VPI中的政策评价规则和政策改进规则。给出了VPI的收敛性证明，保证了VPI中的迭代策略序列收敛到最优策略。将VPI框架与双延迟深度确定性策略梯度（TD3）算法相结合，提出了一种远景TD3 （VTD3）算法。VTD3的评估是在Mujoco和OpenAI Gym平台的多个连续动作控制任务上进行的。对比实验结果表明，VTD3可以比其他最先进的（SOTA） RL方法获得更有竞争力的性能。此外，实验结果表明，VPI增强了决策能力，减少了q函数估计偏差，提高了样本效率，从而提高了现有强化学习算法的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Systems Man Cybernetics-Systems AUTOMATION & CONTROL SYSTEMS-COMPUTER SCIENCE, CYBERNETICS

CiteScore

18.50

自引率

11.50%

发文量

812

审稿时长

6 months

期刊介绍： The IEEE Transactions on Systems, Man, and Cybernetics: Systems encompasses the fields of systems engineering, covering issue formulation, analysis, and modeling throughout the systems engineering lifecycle phases. It addresses decision-making, issue interpretation, systems management, processes, and various methods such as optimization, modeling, and simulation in the development and deployment of large systems.

期刊最新文献

Introducing IEEE Collabratec IEEE Systems, Man, and Cybernetics Society Information TechRxiv: Share Your Preprint Research With the World! Reinforcement Learning-Based Optimized Adaptive Secure Control for Constrained Fractional-Order Nonlinear Systems Under FDI Attacks Learning Multilayer Feature Projection for Homogeneous and Heterogeneous Palmprint Recognition