噪声和不确定环境中的深度 RL 奖励机制

Andrew C. Li, Zizhao Chen, Toryn Q. Klassen, Pashootan Vaezipoor, Rodrigo Toro Icarte, Sheila A. McIlraith
{"title":"噪声和不确定环境中的深度 RL 奖励机制","authors":"Andrew C. Li, Zizhao Chen, Toryn Q. Klassen, Pashootan Vaezipoor, Rodrigo Toro Icarte, Sheila A. McIlraith","doi":"arxiv-2406.00120","DOIUrl":null,"url":null,"abstract":"Reward Machines provide an automata-inspired structure for specifying\ninstructions, safety constraints, and other temporally extended reward-worthy\nbehaviour. By exposing complex reward function structure, they enable\ncounterfactual learning updates that have resulted in impressive sample\nefficiency gains. While Reward Machines have been employed in both tabular and\ndeep RL settings, they have typically relied on a ground-truth interpretation\nof the domain-specific vocabulary that form the building blocks of the reward\nfunction. Such ground-truth interpretations can be elusive in many real-world\nsettings, due in part to partial observability or noisy sensing. In this paper,\nwe explore the use of Reward Machines for Deep RL in noisy and uncertain\nenvironments. We characterize this problem as a POMDP and propose a suite of RL\nalgorithms that leverage task structure under uncertain interpretation of\ndomain-specific vocabulary. Theoretical analysis exposes pitfalls in naive\napproaches to this problem, while experimental results show that our algorithms\nsuccessfully leverage task structure to improve performance under noisy\ninterpretations of the vocabulary. Our results provide a general framework for\nexploiting Reward Machines in partially observable environments.","PeriodicalId":501124,"journal":{"name":"arXiv - CS - Formal Languages and Automata Theory","volume":"6 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Reward Machines for Deep RL in Noisy and Uncertain Environments\",\"authors\":\"Andrew C. Li, Zizhao Chen, Toryn Q. Klassen, Pashootan Vaezipoor, Rodrigo Toro Icarte, Sheila A. McIlraith\",\"doi\":\"arxiv-2406.00120\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Reward Machines provide an automata-inspired structure for specifying\\ninstructions, safety constraints, and other temporally extended reward-worthy\\nbehaviour. By exposing complex reward function structure, they enable\\ncounterfactual learning updates that have resulted in impressive sample\\nefficiency gains. While Reward Machines have been employed in both tabular and\\ndeep RL settings, they have typically relied on a ground-truth interpretation\\nof the domain-specific vocabulary that form the building blocks of the reward\\nfunction. Such ground-truth interpretations can be elusive in many real-world\\nsettings, due in part to partial observability or noisy sensing. In this paper,\\nwe explore the use of Reward Machines for Deep RL in noisy and uncertain\\nenvironments. We characterize this problem as a POMDP and propose a suite of RL\\nalgorithms that leverage task structure under uncertain interpretation of\\ndomain-specific vocabulary. Theoretical analysis exposes pitfalls in naive\\napproaches to this problem, while experimental results show that our algorithms\\nsuccessfully leverage task structure to improve performance under noisy\\ninterpretations of the vocabulary. Our results provide a general framework for\\nexploiting Reward Machines in partially observable environments.\",\"PeriodicalId\":501124,\"journal\":{\"name\":\"arXiv - CS - Formal Languages and Automata Theory\",\"volume\":\"6 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Formal Languages and Automata Theory\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2406.00120\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Formal Languages and Automata Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.00120","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

奖励机器提供了一种受自动机启发的结构,用于指定指令、安全约束和其他时间扩展的值得奖励的行为。通过暴露复杂的奖励函数结构,奖励机器实现了反事实学习更新,从而提高了令人印象深刻的样本效率。虽然奖励机器已在表层和深层 RL 设置中得到应用,但它们通常依赖于对构成奖励函数构件的特定领域词汇的地面实况解释。在许多现实世界环境中,由于部分可观测性或嘈杂的传感等原因,这种地面实况解释可能难以捉摸。在本文中,我们探讨了在嘈杂和不确定的环境中使用奖励机器进行深度 RL 的问题。我们将这一问题描述为 POMDP,并提出了一套 RL 算法,在特定领域词汇解释不确定的情况下利用任务结构。理论分析揭示了解决这一问题的天真方法中存在的缺陷,而实验结果表明,我们的算法成功地利用了任务结构,从而提高了在词汇解释不确定的情况下的性能。我们的结果为在部分可观测环境中利用奖励机器提供了一个通用框架。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Reward Machines for Deep RL in Noisy and Uncertain Environments
Reward Machines provide an automata-inspired structure for specifying instructions, safety constraints, and other temporally extended reward-worthy behaviour. By exposing complex reward function structure, they enable counterfactual learning updates that have resulted in impressive sample efficiency gains. While Reward Machines have been employed in both tabular and deep RL settings, they have typically relied on a ground-truth interpretation of the domain-specific vocabulary that form the building blocks of the reward function. Such ground-truth interpretations can be elusive in many real-world settings, due in part to partial observability or noisy sensing. In this paper, we explore the use of Reward Machines for Deep RL in noisy and uncertain environments. We characterize this problem as a POMDP and propose a suite of RL algorithms that leverage task structure under uncertain interpretation of domain-specific vocabulary. Theoretical analysis exposes pitfalls in naive approaches to this problem, while experimental results show that our algorithms successfully leverage task structure to improve performance under noisy interpretations of the vocabulary. Our results provide a general framework for exploiting Reward Machines in partially observable environments.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Query Learning of Advice and Nominal Automata Well-Behaved (Co)algebraic Semantics of Regular Expressions in Dafny Run supports and initial algebra supports of weighted automata Alternating hierarchy of sushifts defined by nondeterministic plane-walking automata $\mathbb{N}$-polyregular functions arise from well-quasi-orderings
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1