Concurrent Learning of Control Policy and Unknown Safety Specifications in Reinforcement Learning

IEEE open journal of control systems Pub Date : 2024-06-24 DOI:10.1109/OJCSYS.2024.3418306

Lunet Yifru;Ali Baheri

{"title":"Concurrent Learning of Control Policy and Unknown Safety Specifications in Reinforcement Learning","authors":"Lunet Yifru;Ali Baheri","doi":"10.1109/OJCSYS.2024.3418306","DOIUrl":null,"url":null,"abstract":"Reinforcement learning (RL) has revolutionized decision-making across a wide range of domains over the past few decades. Yet, deploying RL policies in real-world scenarios presents the crucial challenge of ensuring safety. Traditional safe RL approaches have predominantly focused on incorporating predefined safety constraints into the policy learning process. However, this reliance on predefined safety constraints poses limitations in dynamic and unpredictable real-world settings where such constraints may not be available or sufficiently adaptable. Bridging this gap, we propose a novel approach that concurrently learns a safe RL control policy and identifies the unknown safety constraint parameters of a given environment. Initializing with a parametric signal temporal logic (pSTL) safety specification and a small initial labeled dataset, we frame the problem as a bilevel optimization task, intricately integrating constrained policy optimization, using a Lagrangian-variant of the twin delayed deep deterministic policy gradient (TD3) algorithm, with Bayesian optimization for optimizing parameters for the given pSTL safety specification. Through experimentation in comprehensive case studies, we validate the efficacy of this approach across varying forms of environmental constraints, consistently yielding safe RL policies with high returns. Furthermore, our findings indicate successful learning of STL safety constraint parameters, exhibiting a high degree of conformity with true environmental safety constraints. The performance of our model closely mirrors that of an ideal scenario that possesses complete prior knowledge of safety constraints, demonstrating its proficiency in accurately identifying environmental safety constraints and learning safe policies that adhere to those constraints. A Python implementation of the algorithm can be found at \n<uri>https://github.com/SAILRIT/Concurrent-Learning-of-Control-Policy-and-Unknown-Constraints-in-Reinforcement-Learning.git</uri>\n.","PeriodicalId":73299,"journal":{"name":"IEEE open journal of control systems","volume":"3 ","pages":"266-281"},"PeriodicalIF":0.0000,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10569078","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of control systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10569078/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Reinforcement learning (RL) has revolutionized decision-making across a wide range of domains over the past few decades. Yet, deploying RL policies in real-world scenarios presents the crucial challenge of ensuring safety. Traditional safe RL approaches have predominantly focused on incorporating predefined safety constraints into the policy learning process. However, this reliance on predefined safety constraints poses limitations in dynamic and unpredictable real-world settings where such constraints may not be available or sufficiently adaptable. Bridging this gap, we propose a novel approach that concurrently learns a safe RL control policy and identifies the unknown safety constraint parameters of a given environment. Initializing with a parametric signal temporal logic (pSTL) safety specification and a small initial labeled dataset, we frame the problem as a bilevel optimization task, intricately integrating constrained policy optimization, using a Lagrangian-variant of the twin delayed deep deterministic policy gradient (TD3) algorithm, with Bayesian optimization for optimizing parameters for the given pSTL safety specification. Through experimentation in comprehensive case studies, we validate the efficacy of this approach across varying forms of environmental constraints, consistently yielding safe RL policies with high returns. Furthermore, our findings indicate successful learning of STL safety constraint parameters, exhibiting a high degree of conformity with true environmental safety constraints. The performance of our model closely mirrors that of an ideal scenario that possesses complete prior knowledge of safety constraints, demonstrating its proficiency in accurately identifying environmental safety constraints and learning safe policies that adhere to those constraints. A Python implementation of the algorithm can be found at https://github.com/SAILRIT/Concurrent-Learning-of-Control-Policy-and-Unknown-Constraints-in-Reinforcement-Learning.git .

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

强化学习中同时学习控制策略和未知安全规范

在过去的几十年里，强化学习（RL）已经在广泛的领域为决策带来了革命性的变化。然而，在现实世界场景中部署 RL 政策却面临着确保安全的严峻挑战。传统的安全 RL 方法主要侧重于将预定义的安全约束纳入策略学习过程。然而，这种对预定义安全约束的依赖在动态和不可预测的真实世界环境中造成了限制，因为在这种环境中，此类约束可能无法获得或无法充分适应。为了弥补这一缺陷，我们提出了一种新方法，它能同时学习安全的 RL 控制策略，并识别给定环境中的未知安全约束参数。以参数信号时序逻辑（pSTL）安全规范和一个小型初始标注数据集为初始，我们将该问题视为一个双层优化任务，利用孪生延迟深度确定性策略梯度（TD3）算法的拉格朗日变体将约束策略优化与贝叶斯优化巧妙地结合在一起，以优化给定 pSTL 安全规范的参数。通过综合案例研究实验，我们验证了这种方法在不同形式的环境约束下的有效性，并持续产生了具有高回报的安全 RL 政策。此外，我们的研究结果表明，我们成功地学习了 STL 安全约束参数，与真实的环境安全约束高度一致。我们模型的性能与拥有完整安全约束先验知识的理想场景非常接近，这表明它能够准确识别环境安全约束并学习符合这些约束的安全策略。该算法的 Python 实现可在 https://github.com/SAILRIT/Concurrent-Learning-of-Control-Policy-and-Unknown-Constraints-in-Reinforcement-Learning.git 上找到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助