{"title":"基于深度强化学习的双 Q 网络偏离策略修正算法","authors":"Qingbo Zhang, Manlu Liu, Heng Wang, Weimin Qian, Xinglang Zhang","doi":"10.1049/csy2.12102","DOIUrl":null,"url":null,"abstract":"<p>A deep reinforcement learning (DRL) method based on the deep deterministic policy gradient (DDPG) algorithm is proposed to address the problems of a mismatch between the needed training samples and the actual training samples during the training of intelligence, the overestimation and underestimation of the existence of Q-values, and the insufficient dynamism of the intelligence policy exploration. This method introduces the Actor-Critic Off-Policy Correction (AC-Off-POC) reinforcement learning framework and an improved double Q-value learning method, which enables the value function network in the target task to provide a more accurate evaluation of the policy network and converge to the optimal policy more quickly and stably to obtain higher value returns. The method is applied to multiple MuJoCo tasks on the Open AI Gym simulation platform. The experimental results show that it is better than the DDPG algorithm based solely on the different policy correction framework (AC-Off-POC) and the conventional DRL algorithm. The value of returns and stability of the double-Q-network off-policy correction algorithm for the deep deterministic policy gradient (DCAOP-DDPG) proposed by the authors are significantly higher than those of other DRL algorithms.</p>","PeriodicalId":34110,"journal":{"name":"IET Cybersystems and Robotics","volume":"5 4","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/csy2.12102","citationCount":"0","resultStr":"{\"title\":\"Off-policy correction algorithm for double Q network based on deep reinforcement learning\",\"authors\":\"Qingbo Zhang, Manlu Liu, Heng Wang, Weimin Qian, Xinglang Zhang\",\"doi\":\"10.1049/csy2.12102\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>A deep reinforcement learning (DRL) method based on the deep deterministic policy gradient (DDPG) algorithm is proposed to address the problems of a mismatch between the needed training samples and the actual training samples during the training of intelligence, the overestimation and underestimation of the existence of Q-values, and the insufficient dynamism of the intelligence policy exploration. This method introduces the Actor-Critic Off-Policy Correction (AC-Off-POC) reinforcement learning framework and an improved double Q-value learning method, which enables the value function network in the target task to provide a more accurate evaluation of the policy network and converge to the optimal policy more quickly and stably to obtain higher value returns. The method is applied to multiple MuJoCo tasks on the Open AI Gym simulation platform. The experimental results show that it is better than the DDPG algorithm based solely on the different policy correction framework (AC-Off-POC) and the conventional DRL algorithm. The value of returns and stability of the double-Q-network off-policy correction algorithm for the deep deterministic policy gradient (DCAOP-DDPG) proposed by the authors are significantly higher than those of other DRL algorithms.</p>\",\"PeriodicalId\":34110,\"journal\":{\"name\":\"IET Cybersystems and Robotics\",\"volume\":\"5 4\",\"pages\":\"\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2023-12-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1049/csy2.12102\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IET Cybersystems and Robotics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1049/csy2.12102\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Cybersystems and Robotics","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/csy2.12102","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
Off-policy correction algorithm for double Q network based on deep reinforcement learning
A deep reinforcement learning (DRL) method based on the deep deterministic policy gradient (DDPG) algorithm is proposed to address the problems of a mismatch between the needed training samples and the actual training samples during the training of intelligence, the overestimation and underestimation of the existence of Q-values, and the insufficient dynamism of the intelligence policy exploration. This method introduces the Actor-Critic Off-Policy Correction (AC-Off-POC) reinforcement learning framework and an improved double Q-value learning method, which enables the value function network in the target task to provide a more accurate evaluation of the policy network and converge to the optimal policy more quickly and stably to obtain higher value returns. The method is applied to multiple MuJoCo tasks on the Open AI Gym simulation platform. The experimental results show that it is better than the DDPG algorithm based solely on the different policy correction framework (AC-Off-POC) and the conventional DRL algorithm. The value of returns and stability of the double-Q-network off-policy correction algorithm for the deep deterministic policy gradient (DCAOP-DDPG) proposed by the authors are significantly higher than those of other DRL algorithms.