Concurrent Learning of Control Policy and Unknown Safety Specifications in Reinforcement Learning

Lunet Yifru;Ali Baheri
{"title":"Concurrent Learning of Control Policy and Unknown Safety Specifications in Reinforcement Learning","authors":"Lunet Yifru;Ali Baheri","doi":"10.1109/OJCSYS.2024.3418306","DOIUrl":null,"url":null,"abstract":"Reinforcement learning (RL) has revolutionized decision-making across a wide range of domains over the past few decades. Yet, deploying RL policies in real-world scenarios presents the crucial challenge of ensuring safety. Traditional safe RL approaches have predominantly focused on incorporating predefined safety constraints into the policy learning process. However, this reliance on predefined safety constraints poses limitations in dynamic and unpredictable real-world settings where such constraints may not be available or sufficiently adaptable. Bridging this gap, we propose a novel approach that concurrently learns a safe RL control policy and identifies the unknown safety constraint parameters of a given environment. Initializing with a parametric signal temporal logic (pSTL) safety specification and a small initial labeled dataset, we frame the problem as a bilevel optimization task, intricately integrating constrained policy optimization, using a Lagrangian-variant of the twin delayed deep deterministic policy gradient (TD3) algorithm, with Bayesian optimization for optimizing parameters for the given pSTL safety specification. Through experimentation in comprehensive case studies, we validate the efficacy of this approach across varying forms of environmental constraints, consistently yielding safe RL policies with high returns. Furthermore, our findings indicate successful learning of STL safety constraint parameters, exhibiting a high degree of conformity with true environmental safety constraints. The performance of our model closely mirrors that of an ideal scenario that possesses complete prior knowledge of safety constraints, demonstrating its proficiency in accurately identifying environmental safety constraints and learning safe policies that adhere to those constraints. A Python implementation of the algorithm can be found at \n<uri>https://github.com/SAILRIT/Concurrent-Learning-of-Control-Policy-and-Unknown-Constraints-in-Reinforcement-Learning.git</uri>\n.","PeriodicalId":73299,"journal":{"name":"IEEE open journal of control systems","volume":"3 ","pages":"266-281"},"PeriodicalIF":0.0000,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10569078","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of control systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10569078/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Reinforcement learning (RL) has revolutionized decision-making across a wide range of domains over the past few decades. Yet, deploying RL policies in real-world scenarios presents the crucial challenge of ensuring safety. Traditional safe RL approaches have predominantly focused on incorporating predefined safety constraints into the policy learning process. However, this reliance on predefined safety constraints poses limitations in dynamic and unpredictable real-world settings where such constraints may not be available or sufficiently adaptable. Bridging this gap, we propose a novel approach that concurrently learns a safe RL control policy and identifies the unknown safety constraint parameters of a given environment. Initializing with a parametric signal temporal logic (pSTL) safety specification and a small initial labeled dataset, we frame the problem as a bilevel optimization task, intricately integrating constrained policy optimization, using a Lagrangian-variant of the twin delayed deep deterministic policy gradient (TD3) algorithm, with Bayesian optimization for optimizing parameters for the given pSTL safety specification. Through experimentation in comprehensive case studies, we validate the efficacy of this approach across varying forms of environmental constraints, consistently yielding safe RL policies with high returns. Furthermore, our findings indicate successful learning of STL safety constraint parameters, exhibiting a high degree of conformity with true environmental safety constraints. The performance of our model closely mirrors that of an ideal scenario that possesses complete prior knowledge of safety constraints, demonstrating its proficiency in accurately identifying environmental safety constraints and learning safe policies that adhere to those constraints. A Python implementation of the algorithm can be found at https://github.com/SAILRIT/Concurrent-Learning-of-Control-Policy-and-Unknown-Constraints-in-Reinforcement-Learning.git .
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
强化学习中同时学习控制策略和未知安全规范
在过去的几十年里,强化学习(RL)已经在广泛的领域为决策带来了革命性的变化。然而,在现实世界场景中部署 RL 政策却面临着确保安全的严峻挑战。传统的安全 RL 方法主要侧重于将预定义的安全约束纳入策略学习过程。然而,这种对预定义安全约束的依赖在动态和不可预测的真实世界环境中造成了限制,因为在这种环境中,此类约束可能无法获得或无法充分适应。为了弥补这一缺陷,我们提出了一种新方法,它能同时学习安全的 RL 控制策略,并识别给定环境中的未知安全约束参数。以参数信号时序逻辑(pSTL)安全规范和一个小型初始标注数据集为初始,我们将该问题视为一个双层优化任务,利用孪生延迟深度确定性策略梯度(TD3)算法的拉格朗日变体将约束策略优化与贝叶斯优化巧妙地结合在一起,以优化给定 pSTL 安全规范的参数。通过综合案例研究实验,我们验证了这种方法在不同形式的环境约束下的有效性,并持续产生了具有高回报的安全 RL 政策。此外,我们的研究结果表明,我们成功地学习了 STL 安全约束参数,与真实的环境安全约束高度一致。我们模型的性能与拥有完整安全约束先验知识的理想场景非常接近,这表明它能够准确识别环境安全约束并学习符合这些约束的安全策略。该算法的 Python 实现可在 https://github.com/SAILRIT/Concurrent-Learning-of-Control-Policy-and-Unknown-Constraints-in-Reinforcement-Learning.git 上找到。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Optimal Control of Endemic Epidemic Diseases With Behavioral Response Resilient Multi-Agent Systems Against Denial of Service Attacks via Adaptive and Activatable Network Layers Resiliency Through Collaboration in Heterogeneous Multi-Robot Systems Resilient Synchronization of Pulse-Coupled Oscillators Under Stealthy Attacks Pareto-Optimal Event-Based Scheme for Station and Inter-Station Control of Electric and Automated Buses
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1