Reward Machines for Deep RL in Noisy and Uncertain Environments

arXiv - CS - Formal Languages and Automata Theory Pub Date : 2024-05-31 DOI:arxiv-2406.00120

Andrew C. Li, Zizhao Chen, Toryn Q. Klassen, Pashootan Vaezipoor, Rodrigo Toro Icarte, Sheila A. McIlraith

{"title":"Reward Machines for Deep RL in Noisy and Uncertain Environments","authors":"Andrew C. Li, Zizhao Chen, Toryn Q. Klassen, Pashootan Vaezipoor, Rodrigo Toro Icarte, Sheila A. McIlraith","doi":"arxiv-2406.00120","DOIUrl":null,"url":null,"abstract":"Reward Machines provide an automata-inspired structure for specifying\ninstructions, safety constraints, and other temporally extended reward-worthy\nbehaviour. By exposing complex reward function structure, they enable\ncounterfactual learning updates that have resulted in impressive sample\nefficiency gains. While Reward Machines have been employed in both tabular and\ndeep RL settings, they have typically relied on a ground-truth interpretation\nof the domain-specific vocabulary that form the building blocks of the reward\nfunction. Such ground-truth interpretations can be elusive in many real-world\nsettings, due in part to partial observability or noisy sensing. In this paper,\nwe explore the use of Reward Machines for Deep RL in noisy and uncertain\nenvironments. We characterize this problem as a POMDP and propose a suite of RL\nalgorithms that leverage task structure under uncertain interpretation of\ndomain-specific vocabulary. Theoretical analysis exposes pitfalls in naive\napproaches to this problem, while experimental results show that our algorithms\nsuccessfully leverage task structure to improve performance under noisy\ninterpretations of the vocabulary. Our results provide a general framework for\nexploiting Reward Machines in partially observable environments.","PeriodicalId":501124,"journal":{"name":"arXiv - CS - Formal Languages and Automata Theory","volume":"6 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Formal Languages and Automata Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.00120","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Reward Machines provide an automata-inspired structure for specifying instructions, safety constraints, and other temporally extended reward-worthy behaviour. By exposing complex reward function structure, they enable counterfactual learning updates that have resulted in impressive sample efficiency gains. While Reward Machines have been employed in both tabular and deep RL settings, they have typically relied on a ground-truth interpretation of the domain-specific vocabulary that form the building blocks of the reward function. Such ground-truth interpretations can be elusive in many real-world settings, due in part to partial observability or noisy sensing. In this paper, we explore the use of Reward Machines for Deep RL in noisy and uncertain environments. We characterize this problem as a POMDP and propose a suite of RL algorithms that leverage task structure under uncertain interpretation of domain-specific vocabulary. Theoretical analysis exposes pitfalls in naive approaches to this problem, while experimental results show that our algorithms successfully leverage task structure to improve performance under noisy interpretations of the vocabulary. Our results provide a general framework for exploiting Reward Machines in partially observable environments.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

噪声和不确定环境中的深度 RL 奖励机制

奖励机器提供了一种受自动机启发的结构，用于指定指令、安全约束和其他时间扩展的值得奖励的行为。通过暴露复杂的奖励函数结构，奖励机器实现了反事实学习更新，从而提高了令人印象深刻的样本效率。虽然奖励机器已在表层和深层 RL 设置中得到应用，但它们通常依赖于对构成奖励函数构件的特定领域词汇的地面实况解释。在许多现实世界环境中，由于部分可观测性或嘈杂的传感等原因，这种地面实况解释可能难以捉摸。在本文中，我们探讨了在嘈杂和不确定的环境中使用奖励机器进行深度 RL 的问题。我们将这一问题描述为 POMDP，并提出了一套 RL 算法，在特定领域词汇解释不确定的情况下利用任务结构。理论分析揭示了解决这一问题的天真方法中存在的缺陷，而实验结果表明，我们的算法成功地利用了任务结构，从而提高了在词汇解释不确定的情况下的性能。我们的结果为在部分可观测环境中利用奖励机器提供了一个通用框架。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Formal Languages and Automata Theory

自引率

0.00%

发文量

期刊最新文献

Query Learning of Advice and Nominal Automata Well-Behaved (Co)algebraic Semantics of Regular Expressions in Dafny Run supports and initial algebra supports of weighted automata Alternating hierarchy of sushifts defined by nondeterministic plane-walking automata $\mathbb{N}$-polyregular functions arise from well-quasi-orderings