Proximal Policy Optimization (PPO) is widely adopted in reinforcement learning but suffers from two fundamental limitations: its static entropy coefficient fails to adapt to evolving exploration–exploitation requirements, and its hard clipping mechanism introduces gradient discontinuities that can destabilize policy updates. This paper proposes Dynamic Proximal Policy Optimization (DPPO), which systematically addresses these challenges through adaptive entropy regulation and smooth policy ratio clipping. DPPO introduces two key innovations: (1) a dynamic entropy coefficient adjustment mechanism that modulates exploration based on training performance, implemented via two strategies—Surrogate Loss-Based Entropy Adjustment (SLEA) for epoch-level stability and Batch-Wise Entropy Adjustment (BWEA) for fine-grained responsiveness; (2) a smooth clipping function combining Taylor expansion with piecewise exponential decay, ensuring continuity and eliminating gradient discontinuities. Extensive experiments on six continuous control tasks in the PyBullet environment demonstrate that DPPO consistently outperforms PPO and three state-of-the-art baselines (PPO-, TrulyPPO, ESPO), achieving higher sample efficiency, faster convergence, and improved stability. SLEA excels in high-dimensional tasks requiring robust exploration strategies, while BWEA achieves faster convergence in lower-complexity environments. Ablation studies confirm the individual contributions of both mechanisms, highlighting DPPO’s potential for diverse reinforcement learning applications.
扫码关注我们
求助内容:
应助结果提醒方式:
