DSMC评估阶段:在深度强化学习中培养稳健和安全的行为-扩展版

IF 0.7 4区计算机科学 Q4 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS ACM Transactions on Modeling and Computer Simulation Pub Date : 2023-07-12 DOI:10.1145/3607198

Timo P. Gros, D. Höller, Jörg Hoffmann, M. Klauck, Hendrik Meerkamp, Verena Wolf

{"title":"DSMC评估阶段:在深度强化学习中培养稳健和安全的行为-扩展版","authors":"Timo P. Gros, D. Höller, Jörg Hoffmann, M. Klauck, Hendrik Meerkamp, Verena Wolf","doi":"10.1145/3607198","DOIUrl":null,"url":null,"abstract":"Neural networks (NN) are gaining importance in sequential decision-making. Deep reinforcement learning (DRL), in particular, is extremely successful in learning action policies in complex and dynamic environments. Despite this success, however, DRL technology is not without its failures, especially in safety-critical applications: (i) the training objective maximizes average rewards, which may disregard rare but critical situations and hence lack local robustness; (ii) optimization objectives targeting safety typically yield degenerated reward structures which for DRL to work must be replaced with proxy objectives. Here we introduce a methodology that can help to address both deficiencies. We incorporate evaluation stages (ES) into DRL, leveraging recent work on deep statistical model checking (DSMC), which verifies NN policies in Markov decision processes. Our ES apply DSMC at regular intervals to determine state space regions with weak performance. We adapt the subsequent DRL training priorities based on the outcome, (i) focusing DRL on critical situations, and (ii) allowing to foster arbitrary objectives. We run case studies on two benchmarks. One of them is the Racetrack, an abstraction of autonomous driving that requires navigating a map without crashing into a wall. The other is MiniGrid, a widely used benchmark in the AI community. Our results show that DSMC-based ES can significantly improve both (i) and (ii).","PeriodicalId":50943,"journal":{"name":"ACM Transactions on Modeling and Computer Simulation","volume":" ","pages":""},"PeriodicalIF":0.7000,"publicationDate":"2023-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"DSMC Evaluation Stages: Fostering Robust and Safe Behavior in Deep Reinforcement Learning – Extended Version\",\"authors\":\"Timo P. Gros, D. Höller, Jörg Hoffmann, M. Klauck, Hendrik Meerkamp, Verena Wolf\",\"doi\":\"10.1145/3607198\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Neural networks (NN) are gaining importance in sequential decision-making. Deep reinforcement learning (DRL), in particular, is extremely successful in learning action policies in complex and dynamic environments. Despite this success, however, DRL technology is not without its failures, especially in safety-critical applications: (i) the training objective maximizes average rewards, which may disregard rare but critical situations and hence lack local robustness; (ii) optimization objectives targeting safety typically yield degenerated reward structures which for DRL to work must be replaced with proxy objectives. Here we introduce a methodology that can help to address both deficiencies. We incorporate evaluation stages (ES) into DRL, leveraging recent work on deep statistical model checking (DSMC), which verifies NN policies in Markov decision processes. Our ES apply DSMC at regular intervals to determine state space regions with weak performance. We adapt the subsequent DRL training priorities based on the outcome, (i) focusing DRL on critical situations, and (ii) allowing to foster arbitrary objectives. We run case studies on two benchmarks. One of them is the Racetrack, an abstraction of autonomous driving that requires navigating a map without crashing into a wall. The other is MiniGrid, a widely used benchmark in the AI community. Our results show that DSMC-based ES can significantly improve both (i) and (ii).\",\"PeriodicalId\":50943,\"journal\":{\"name\":\"ACM Transactions on Modeling and Computer Simulation\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.7000,\"publicationDate\":\"2023-07-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Modeling and Computer Simulation\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3607198\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Modeling and Computer Simulation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3607198","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 6

摘要

神经网络(NN)在序列决策中越来越重要。特别是深度强化学习(DRL)在复杂和动态环境中学习行动策略方面非常成功。然而，尽管取得了成功，DRL技术并非没有失败，特别是在安全关键应用中:(i)训练目标最大化平均奖励，这可能会忽略罕见但关键的情况，因此缺乏局部鲁棒性;(ii)以安全为目标的优化目标通常会产生退化的奖励结构，DRL必须用代理目标代替。在这里，我们介绍一种可以帮助解决这两个缺陷的方法。我们将评估阶段(ES)纳入DRL，利用最近在深度统计模型检查(DSMC)方面的工作，该工作验证了马尔可夫决策过程中的神经网络策略。我们的ES定期应用DSMC来确定性能较弱的状态空间区域。我们根据结果调整随后的DRL培训重点，(i)将DRL重点放在关键情况上，(ii)允许培养任意目标。我们在两个基准上运行案例研究。其中之一是Racetrack，这是一种抽象的自动驾驶，需要在地图上导航而不会撞到墙上。另一个是MiniGrid，一个在人工智能社区广泛使用的基准。我们的研究结果表明，基于dsmc的ES可以显著改善(i)和(ii)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

DSMC Evaluation Stages: Fostering Robust and Safe Behavior in Deep Reinforcement Learning – Extended Version

Neural networks (NN) are gaining importance in sequential decision-making. Deep reinforcement learning (DRL), in particular, is extremely successful in learning action policies in complex and dynamic environments. Despite this success, however, DRL technology is not without its failures, especially in safety-critical applications: (i) the training objective maximizes average rewards, which may disregard rare but critical situations and hence lack local robustness; (ii) optimization objectives targeting safety typically yield degenerated reward structures which for DRL to work must be replaced with proxy objectives. Here we introduce a methodology that can help to address both deficiencies. We incorporate evaluation stages (ES) into DRL, leveraging recent work on deep statistical model checking (DSMC), which verifies NN policies in Markov decision processes. Our ES apply DSMC at regular intervals to determine state space regions with weak performance. We adapt the subsequent DRL training priorities based on the outcome, (i) focusing DRL on critical situations, and (ii) allowing to foster arbitrary objectives. We run case studies on two benchmarks. One of them is the Racetrack, an abstraction of autonomous driving that requires navigating a map without crashing into a wall. The other is MiniGrid, a widely used benchmark in the AI community. Our results show that DSMC-based ES can significantly improve both (i) and (ii).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Modeling and Computer Simulation 工程技术-计算机：跨学科应用

CiteScore

2.50

自引率

22.20%

发文量

审稿时长

>12 weeks

期刊介绍： The ACM Transactions on Modeling and Computer Simulation (TOMACS) provides a single archival source for the publication of high-quality research and developmental results referring to all phases of the modeling and simulation life cycle. The subjects of emphasis are discrete event simulation, combined discrete and continuous simulation, as well as Monte Carlo methods. The use of simulation techniques is pervasive, extending to virtually all the sciences. TOMACS serves to enhance the understanding, improve the practice, and increase the utilization of computer simulation. Submissions should contribute to the realization of these objectives, and papers treating applications should stress their contributions vis-á-vis these objectives.