Safe Reinforcement Learning With Dual Robustness

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-08-15 DOI:10.1109/TPAMI.2024.3443916

Zeyang Li;Chuxiong Hu;Yunan Wang;Yujie Yang;Shengbo Eben Li

{"title":"Safe Reinforcement Learning With Dual Robustness","authors":"Zeyang Li;Chuxiong Hu;Yunan Wang;Yujie Yang;Shengbo Eben Li","doi":"10.1109/TPAMI.2024.3443916","DOIUrl":null,"url":null,"abstract":"Reinforcement learning (RL) agents are vulnerable to adversarial disturbances, which can deteriorate task performance or break down safety specifications. Existing methods either address safety requirements under the assumption of no adversary (e.g., safe RL) or only focus on robustness against performance adversaries (e.g., robust RL). Learning one policy that is both safe and robust under any adversaries remains a challenging open problem. The difficulty is how to tackle two intertwined aspects in the worst cases: feasibility and optimality. The optimality is only valid inside a feasible region (i.e., robust invariant set), while the identification of maximal feasible region must rely on how to learn the optimal policy. To address this issue, we propose a systematic framework to unify safe RL and robust RL, including the problem formulation, iteration scheme, convergence analysis and practical algorithm design. The unification is built upon constrained two-player zero-sum Markov games, in which the objective for protagonist is twofold. For states inside the maximal robust invariant set, the goal is to pursue rewards under the condition of guaranteed safety; for states outside the maximal robust invariant set, the goal is to reduce the extent of constraint violation. A dual policy iteration scheme is proposed, which simultaneously optimizes a task policy and a safety policy. We prove that the iteration scheme converges to the optimal task policy which maximizes the twofold objective in the worst cases, and the optimal safety policy which stays as far away from the safety boundary. The convergence of safety policy is established by exploiting the monotone contraction property of safety self-consistency operators, and that of task policy depends on the transformation of safety constraints into state-dependent action spaces. By adding two adversarial networks (one is for safety guarantee and the other is for task performance), we propose a practical deep RL algorithm for constrained zero-sum Markov games, called dually robust actor-critic (DRAC). The evaluations with safety-critical benchmarks demonstrate that DRAC achieves high performance and persistent safety under all scenarios (no adversary, safety adversary, performance adversary), outperforming all baselines by a large margin.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10876-10890"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10637743/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Reinforcement learning (RL) agents are vulnerable to adversarial disturbances, which can deteriorate task performance or break down safety specifications. Existing methods either address safety requirements under the assumption of no adversary (e.g., safe RL) or only focus on robustness against performance adversaries (e.g., robust RL). Learning one policy that is both safe and robust under any adversaries remains a challenging open problem. The difficulty is how to tackle two intertwined aspects in the worst cases: feasibility and optimality. The optimality is only valid inside a feasible region (i.e., robust invariant set), while the identification of maximal feasible region must rely on how to learn the optimal policy. To address this issue, we propose a systematic framework to unify safe RL and robust RL, including the problem formulation, iteration scheme, convergence analysis and practical algorithm design. The unification is built upon constrained two-player zero-sum Markov games, in which the objective for protagonist is twofold. For states inside the maximal robust invariant set, the goal is to pursue rewards under the condition of guaranteed safety; for states outside the maximal robust invariant set, the goal is to reduce the extent of constraint violation. A dual policy iteration scheme is proposed, which simultaneously optimizes a task policy and a safety policy. We prove that the iteration scheme converges to the optimal task policy which maximizes the twofold objective in the worst cases, and the optimal safety policy which stays as far away from the safety boundary. The convergence of safety policy is established by exploiting the monotone contraction property of safety self-consistency operators, and that of task policy depends on the transformation of safety constraints into state-dependent action spaces. By adding two adversarial networks (one is for safety guarantee and the other is for task performance), we propose a practical deep RL algorithm for constrained zero-sum Markov games, called dually robust actor-critic (DRAC). The evaluations with safety-critical benchmarks demonstrate that DRAC achieves high performance and persistent safety under all scenarios (no adversary, safety adversary, performance adversary), outperforming all baselines by a large margin.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

具有双重鲁棒性的安全强化学习

强化学习（RL）代理容易受到对抗性干扰的影响，这些干扰会降低任务性能或破坏安全规范。现有的方法要么是在没有对手的假设条件下满足安全要求（如安全 RL），要么是只关注针对性能对手的鲁棒性（如鲁棒 RL）。学习一种在任何对抗下都既安全又稳健的策略，仍然是一个具有挑战性的开放问题。困难在于如何在最坏情况下解决两个相互交织的方面：可行性和最优性。最优性只在可行区域（即鲁棒性不变集）内有效，而最大可行区域的识别必须依赖于如何学习最优策略。针对这一问题，我们提出了一个统一安全 RL 和鲁棒 RL 的系统框架，包括问题表述、迭代方案、收敛分析和实用算法设计。这种统一建立在受限的双人零和马尔可夫博弈基础上，其中主角的目标是双重的。对于最大稳健不变集内的状态，目标是在保证安全的条件下追求奖励；对于最大稳健不变集外的状态，目标是减少违反约束的程度。我们提出了一种双策略迭代方案，它能同时优化任务策略和安全策略。我们证明，迭代方案会收敛到最优任务策略，在最坏情况下使双重目标最大化，并收敛到最优安全策略，尽可能远离安全边界。安全策略的收敛性是利用安全自洽算子的单调收缩特性确定的，而任务策略的收敛性则取决于将安全约束条件转化为与状态相关的行动空间。通过添加两个对抗网络（一个是安全保证网络，另一个是任务性能网络），我们提出了一种针对受限零和马尔可夫博弈的实用深度 RL 算法，即双鲁棒性行动者批判（DRAC）。使用安全关键基准进行的评估表明，DRAC 在所有情况下（无对手、安全对手、性能对手）都能实现高性能和持久安全，性能远远优于所有基线算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量