Phasic parallel-network policy: a deep reinforcement learning framework based on action correlation

IF 2.8 3区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Computing Pub Date : 2024-08-06 DOI:10.1007/s00607-024-01329-3

Jiahao Li, Tianhan Gao, Qingwei Mi

{"title":"Phasic parallel-network policy: a deep reinforcement learning framework based on action correlation","authors":"Jiahao Li, Tianhan Gao, Qingwei Mi","doi":"10.1007/s00607-024-01329-3","DOIUrl":null,"url":null,"abstract":"<p>Reinforcement learning algorithms show significant variations in performance across different environments. Optimization for reinforcement learning thus becomes the major research task since the instability and unpredictability of the reinforcement learning algorithms have consistently hindered their generalization capabilities. In this study, we address this issue by optimizing the algorithm itself rather than environment-specific optimizations. We start by tackling the uncertainty caused by the mutual influence of original action interferences, aiming to enhance the overall performance. The <i>Phasic Parallel-Network Policy</i> (PPP), which is a deep reinforcement learning framework. It diverges from the traditional policy actor-critic method by grouping the action space based on action correlations. The PPP incorporates parallel network structures and combines network optimization strategies. With the assistance of the value network, the training process is divided into different specific stages, namely the Extra-group Policy Phase and the Inter-group Optimization Phase. PPP breaks through the traditional unit learning structure. The experimental results indicate that it not only optimizes training effectiveness but also reduces training steps, enhances sample efficiency, and significantly improves stability and generalization.</p>","PeriodicalId":10718,"journal":{"name":"Computing","volume":"34 1","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00607-024-01329-3","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Reinforcement learning algorithms show significant variations in performance across different environments. Optimization for reinforcement learning thus becomes the major research task since the instability and unpredictability of the reinforcement learning algorithms have consistently hindered their generalization capabilities. In this study, we address this issue by optimizing the algorithm itself rather than environment-specific optimizations. We start by tackling the uncertainty caused by the mutual influence of original action interferences, aiming to enhance the overall performance. The Phasic Parallel-Network Policy (PPP), which is a deep reinforcement learning framework. It diverges from the traditional policy actor-critic method by grouping the action space based on action correlations. The PPP incorporates parallel network structures and combines network optimization strategies. With the assistance of the value network, the training process is divided into different specific stages, namely the Extra-group Policy Phase and the Inter-group Optimization Phase. PPP breaks through the traditional unit learning structure. The experimental results indicate that it not only optimizes training effectiveness but also reduces training steps, enhances sample efficiency, and significantly improves stability and generalization.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

相位平行网络策略：基于行动相关性的深度强化学习框架

强化学习算法在不同环境下的性能差异很大。由于强化学习算法的不稳定性和不可预测性一直阻碍着它们的泛化能力，因此优化强化学习算法就成了主要的研究任务。在本研究中，我们通过优化算法本身而不是特定环境的优化来解决这一问题。我们首先解决了原始动作干扰相互影响造成的不确定性，旨在提高整体性能。相位并行网络策略（PPP）是一种深度强化学习框架。它不同于传统的策略行动者批判方法，而是根据行动相关性对行动空间进行分组。PPP 融合了并行网络结构，并结合了网络优化策略。在价值网络的辅助下，训练过程被划分为不同的具体阶段，即组外策略阶段和组间优化阶段。PPP 突破了传统的单元学习结构。实验结果表明，它不仅优化了训练效果，还减少了训练步骤，提高了样本效率，并显著提高了稳定性和泛化能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computing 工程技术-计算机：理论方法

CiteScore

8.20

自引率

2.70%

发文量

107

审稿时长

3 months

期刊介绍： Computing publishes original papers, short communications and surveys on all fields of computing. The contributions should be written in English and may be of theoretical or applied nature, the essential criteria are computational relevance and systematic foundation of results.