基于近端策略优化的优势策略更新

Third International Seminar on Artificial Intelligence, Networking, and Information Technology Pub Date : 2023-02-22 DOI:10.1117/12.2667235

Zilin Zeng, Junwei Wang, Zhigang Hu, Dongnan Su, Peng Shang

{"title":"基于近端策略优化的优势策略更新","authors":"Zilin Zeng, Junwei Wang, Zhigang Hu, Dongnan Su, Peng Shang","doi":"10.1117/12.2667235","DOIUrl":null,"url":null,"abstract":"In this paper, a novel policy network update approach based on Proximal Policy Optimization (PPO), Advantageous Update Policy Proximal Policy Optimization (AUP-PPO), is proposed to alleviate the problem of over-fitting caused by the use of shared layers for policy and value functions. Extended from the previous sample-efficient reinforcement learning method PPO that uses separate networks to learn policy and value functions to make them decouple optimization, AUP-PPO uses the value function to calculate the advantage and updates the policy with the loss between the current and target advantage function as a penalty term instead of the value function. Evaluated by multiple benchmark control tasks in Open-AI gym, AUP-PPO exhibits better generalization to the environment and achieves faster convergence and better robustness compared with the original PPO.","PeriodicalId":128051,"journal":{"name":"Third International Seminar on Artificial Intelligence, Networking, and Information Technology","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Advantage policy update based on proximal policy optimization\",\"authors\":\"Zilin Zeng, Junwei Wang, Zhigang Hu, Dongnan Su, Peng Shang\",\"doi\":\"10.1117/12.2667235\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, a novel policy network update approach based on Proximal Policy Optimization (PPO), Advantageous Update Policy Proximal Policy Optimization (AUP-PPO), is proposed to alleviate the problem of over-fitting caused by the use of shared layers for policy and value functions. Extended from the previous sample-efficient reinforcement learning method PPO that uses separate networks to learn policy and value functions to make them decouple optimization, AUP-PPO uses the value function to calculate the advantage and updates the policy with the loss between the current and target advantage function as a penalty term instead of the value function. Evaluated by multiple benchmark control tasks in Open-AI gym, AUP-PPO exhibits better generalization to the environment and achieves faster convergence and better robustness compared with the original PPO.\",\"PeriodicalId\":128051,\"journal\":{\"name\":\"Third International Seminar on Artificial Intelligence, Networking, and Information Technology\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Third International Seminar on Artificial Intelligence, Networking, and Information Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1117/12.2667235\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Third International Seminar on Artificial Intelligence, Networking, and Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.2667235","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文提出了一种基于近端策略优化(PPO)的策略网络更新方法——优势更新策略近端策略优化(advantage update policy Proximal policy Optimization, upp -PPO)，以缓解由于策略函数和价值函数使用共享层而导致的过拟合问题。upp -PPO从之前的样本高效强化学习方法PPO(使用单独的网络学习策略和价值函数，使其解耦优化)扩展而来，使用价值函数计算优势，并以当前和目标优势函数之间的损失作为惩罚项而不是价值函数来更新策略。通过Open-AI gym中多个基准控制任务的评估，与原来的PPO相比，upp -PPO对环境具有更好的泛化能力，收敛速度更快，鲁棒性更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Advantage policy update based on proximal policy optimization

In this paper, a novel policy network update approach based on Proximal Policy Optimization (PPO), Advantageous Update Policy Proximal Policy Optimization (AUP-PPO), is proposed to alleviate the problem of over-fitting caused by the use of shared layers for policy and value functions. Extended from the previous sample-efficient reinforcement learning method PPO that uses separate networks to learn policy and value functions to make them decouple optimization, AUP-PPO uses the value function to calculate the advantage and updates the policy with the loss between the current and target advantage function as a penalty term instead of the value function. Evaluated by multiple benchmark control tasks in Open-AI gym, AUP-PPO exhibits better generalization to the environment and achieves faster convergence and better robustness compared with the original PPO.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Third International Seminar on Artificial Intelligence, Networking, and Information Technology

自引率

0.00%

发文量