DSMC评估阶段:在深度强化学习中培养稳健和安全的行为-扩展版

IF 0.7 4区 计算机科学 Q4 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS ACM Transactions on Modeling and Computer Simulation Pub Date : 2023-07-12 DOI:10.1145/3607198
Timo P. Gros, D. Höller, Jörg Hoffmann, M. Klauck, Hendrik Meerkamp, Verena Wolf
{"title":"DSMC评估阶段:在深度强化学习中培养稳健和安全的行为-扩展版","authors":"Timo P. Gros, D. Höller, Jörg Hoffmann, M. Klauck, Hendrik Meerkamp, Verena Wolf","doi":"10.1145/3607198","DOIUrl":null,"url":null,"abstract":"Neural networks (NN) are gaining importance in sequential decision-making. Deep reinforcement learning (DRL), in particular, is extremely successful in learning action policies in complex and dynamic environments. Despite this success, however, DRL technology is not without its failures, especially in safety-critical applications: (i) the training objective maximizes average rewards, which may disregard rare but critical situations and hence lack local robustness; (ii) optimization objectives targeting safety typically yield degenerated reward structures which for DRL to work must be replaced with proxy objectives. Here we introduce a methodology that can help to address both deficiencies. We incorporate evaluation stages (ES) into DRL, leveraging recent work on deep statistical model checking (DSMC), which verifies NN policies in Markov decision processes. Our ES apply DSMC at regular intervals to determine state space regions with weak performance. We adapt the subsequent DRL training priorities based on the outcome, (i) focusing DRL on critical situations, and (ii) allowing to foster arbitrary objectives. We run case studies on two benchmarks. One of them is the Racetrack, an abstraction of autonomous driving that requires navigating a map without crashing into a wall. The other is MiniGrid, a widely used benchmark in the AI community. Our results show that DSMC-based ES can significantly improve both (i) and (ii).","PeriodicalId":50943,"journal":{"name":"ACM Transactions on Modeling and Computer Simulation","volume":" ","pages":""},"PeriodicalIF":0.7000,"publicationDate":"2023-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"DSMC Evaluation Stages: Fostering Robust and Safe Behavior in Deep Reinforcement Learning – Extended Version\",\"authors\":\"Timo P. Gros, D. Höller, Jörg Hoffmann, M. Klauck, Hendrik Meerkamp, Verena Wolf\",\"doi\":\"10.1145/3607198\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Neural networks (NN) are gaining importance in sequential decision-making. Deep reinforcement learning (DRL), in particular, is extremely successful in learning action policies in complex and dynamic environments. Despite this success, however, DRL technology is not without its failures, especially in safety-critical applications: (i) the training objective maximizes average rewards, which may disregard rare but critical situations and hence lack local robustness; (ii) optimization objectives targeting safety typically yield degenerated reward structures which for DRL to work must be replaced with proxy objectives. Here we introduce a methodology that can help to address both deficiencies. We incorporate evaluation stages (ES) into DRL, leveraging recent work on deep statistical model checking (DSMC), which verifies NN policies in Markov decision processes. Our ES apply DSMC at regular intervals to determine state space regions with weak performance. We adapt the subsequent DRL training priorities based on the outcome, (i) focusing DRL on critical situations, and (ii) allowing to foster arbitrary objectives. We run case studies on two benchmarks. One of them is the Racetrack, an abstraction of autonomous driving that requires navigating a map without crashing into a wall. The other is MiniGrid, a widely used benchmark in the AI community. Our results show that DSMC-based ES can significantly improve both (i) and (ii).\",\"PeriodicalId\":50943,\"journal\":{\"name\":\"ACM Transactions on Modeling and Computer Simulation\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.7000,\"publicationDate\":\"2023-07-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Modeling and Computer Simulation\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3607198\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Modeling and Computer Simulation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3607198","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 6

摘要

神经网络(NN)在序列决策中越来越重要。特别是深度强化学习(DRL)在复杂和动态环境中学习行动策略方面非常成功。然而,尽管取得了成功,DRL技术并非没有失败,特别是在安全关键应用中:(i)训练目标最大化平均奖励,这可能会忽略罕见但关键的情况,因此缺乏局部鲁棒性;(ii)以安全为目标的优化目标通常会产生退化的奖励结构,DRL必须用代理目标代替。在这里,我们介绍一种可以帮助解决这两个缺陷的方法。我们将评估阶段(ES)纳入DRL,利用最近在深度统计模型检查(DSMC)方面的工作,该工作验证了马尔可夫决策过程中的神经网络策略。我们的ES定期应用DSMC来确定性能较弱的状态空间区域。我们根据结果调整随后的DRL培训重点,(i)将DRL重点放在关键情况上,(ii)允许培养任意目标。我们在两个基准上运行案例研究。其中之一是Racetrack,这是一种抽象的自动驾驶,需要在地图上导航而不会撞到墙上。另一个是MiniGrid,一个在人工智能社区广泛使用的基准。我们的研究结果表明,基于dsmc的ES可以显著改善(i)和(ii)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
DSMC Evaluation Stages: Fostering Robust and Safe Behavior in Deep Reinforcement Learning – Extended Version
Neural networks (NN) are gaining importance in sequential decision-making. Deep reinforcement learning (DRL), in particular, is extremely successful in learning action policies in complex and dynamic environments. Despite this success, however, DRL technology is not without its failures, especially in safety-critical applications: (i) the training objective maximizes average rewards, which may disregard rare but critical situations and hence lack local robustness; (ii) optimization objectives targeting safety typically yield degenerated reward structures which for DRL to work must be replaced with proxy objectives. Here we introduce a methodology that can help to address both deficiencies. We incorporate evaluation stages (ES) into DRL, leveraging recent work on deep statistical model checking (DSMC), which verifies NN policies in Markov decision processes. Our ES apply DSMC at regular intervals to determine state space regions with weak performance. We adapt the subsequent DRL training priorities based on the outcome, (i) focusing DRL on critical situations, and (ii) allowing to foster arbitrary objectives. We run case studies on two benchmarks. One of them is the Racetrack, an abstraction of autonomous driving that requires navigating a map without crashing into a wall. The other is MiniGrid, a widely used benchmark in the AI community. Our results show that DSMC-based ES can significantly improve both (i) and (ii).
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
ACM Transactions on Modeling and Computer Simulation
ACM Transactions on Modeling and Computer Simulation 工程技术-计算机:跨学科应用
CiteScore
2.50
自引率
22.20%
发文量
29
审稿时长
>12 weeks
期刊介绍: The ACM Transactions on Modeling and Computer Simulation (TOMACS) provides a single archival source for the publication of high-quality research and developmental results referring to all phases of the modeling and simulation life cycle. The subjects of emphasis are discrete event simulation, combined discrete and continuous simulation, as well as Monte Carlo methods. The use of simulation techniques is pervasive, extending to virtually all the sciences. TOMACS serves to enhance the understanding, improve the practice, and increase the utilization of computer simulation. Submissions should contribute to the realization of these objectives, and papers treating applications should stress their contributions vis-á-vis these objectives.
期刊最新文献
Reproducibility Report for the Paper: A Toolset for Predicting Performance of Legacy Real-Time Software Based on the RAST Approach Context, Composition, Automation, and Communication - The C2AC Roadmap for Modeling and Simulation Adaptive Synchronization and Pacing Control for Visual Interactive Simulation Generating Hidden Markov Models from Process Models Through Nonnegative Tensor Factorization
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1