Vision-driven autonomous river following by Unmanned Aerial Vehicles is critical for applications such as rescue, surveillance, and environmental monitoring, particularly in dense riverine environments where GPS signals are unreliable. These safety-critical navigation tasks must satisfy hard safety constraints while optimizing performance. Moreover, the reward in river following is inherently history-dependent (non-Markovian) by which river segment has already been visited, making it challenging for standard safe Reinforcement Learning (SafeRL). To address these gaps, we cast river following as a coverage control problem with a submodular reward that exhibits diminishing returns as more river segments are visited, framing the task as a Submodular Markov Decision Process. Building on the SafeRL paradigm and the First Order Constrained Optimization in Policy Space (FOCOPS) algorithm, we propose three contributions.
First, we introduce the Marginal Gain Advantage Estimation (MGAE), which refines the reward advantage function using a sliding-window baseline calculated from historical episodic returns, aligning the advantage estimate with non-Markovian dynamics. Second, we develop a Semantic Dynamics Model (SDM) based on patchified water semantic masks, offering more interpretable and data-efficient short-term prediction of future observations compared to latent vision dynamics models. Third, we present the Constrained Actor Dynamics Estimator (CADE) architecture, which integrates the actor, cost estimator, and SDM for cost advantage estimation to form a model-based SafeRL framework capable of solving partially observable Constrained Submodular Markov Decision Processes.
The simulation results demonstrate that MGAE achieves faster convergence and superior performance compared to critic-based methods like Generalized Advantage Estimation. SDM provides more accurate short-term state predictions, enabling the cost estimator to better predict potential violations. Overall, CADE effectively integrates safety regulation into model-based RL, with the Lagrangian approach providing a “soft” balance between reward and safety during training, while the safety layer enhances inference by imposing a “hard” action overlay. Our code is publicly available on Github (https://github.com/EdisonPricehan/omnisafe-cade/tree/cade).
扫码关注我们
求助内容:
应助结果提醒方式:
