Andrew C. Li, Zizhao Chen, Toryn Q. Klassen, Pashootan Vaezipoor, Rodrigo Toro Icarte, Sheila A. McIlraith
{"title":"噪声和不确定环境中的深度 RL 奖励机制","authors":"Andrew C. Li, Zizhao Chen, Toryn Q. Klassen, Pashootan Vaezipoor, Rodrigo Toro Icarte, Sheila A. McIlraith","doi":"arxiv-2406.00120","DOIUrl":null,"url":null,"abstract":"Reward Machines provide an automata-inspired structure for specifying\ninstructions, safety constraints, and other temporally extended reward-worthy\nbehaviour. By exposing complex reward function structure, they enable\ncounterfactual learning updates that have resulted in impressive sample\nefficiency gains. While Reward Machines have been employed in both tabular and\ndeep RL settings, they have typically relied on a ground-truth interpretation\nof the domain-specific vocabulary that form the building blocks of the reward\nfunction. Such ground-truth interpretations can be elusive in many real-world\nsettings, due in part to partial observability or noisy sensing. In this paper,\nwe explore the use of Reward Machines for Deep RL in noisy and uncertain\nenvironments. We characterize this problem as a POMDP and propose a suite of RL\nalgorithms that leverage task structure under uncertain interpretation of\ndomain-specific vocabulary. Theoretical analysis exposes pitfalls in naive\napproaches to this problem, while experimental results show that our algorithms\nsuccessfully leverage task structure to improve performance under noisy\ninterpretations of the vocabulary. Our results provide a general framework for\nexploiting Reward Machines in partially observable environments.","PeriodicalId":501124,"journal":{"name":"arXiv - CS - Formal Languages and Automata Theory","volume":"6 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Reward Machines for Deep RL in Noisy and Uncertain Environments\",\"authors\":\"Andrew C. Li, Zizhao Chen, Toryn Q. Klassen, Pashootan Vaezipoor, Rodrigo Toro Icarte, Sheila A. McIlraith\",\"doi\":\"arxiv-2406.00120\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Reward Machines provide an automata-inspired structure for specifying\\ninstructions, safety constraints, and other temporally extended reward-worthy\\nbehaviour. By exposing complex reward function structure, they enable\\ncounterfactual learning updates that have resulted in impressive sample\\nefficiency gains. While Reward Machines have been employed in both tabular and\\ndeep RL settings, they have typically relied on a ground-truth interpretation\\nof the domain-specific vocabulary that form the building blocks of the reward\\nfunction. Such ground-truth interpretations can be elusive in many real-world\\nsettings, due in part to partial observability or noisy sensing. In this paper,\\nwe explore the use of Reward Machines for Deep RL in noisy and uncertain\\nenvironments. We characterize this problem as a POMDP and propose a suite of RL\\nalgorithms that leverage task structure under uncertain interpretation of\\ndomain-specific vocabulary. Theoretical analysis exposes pitfalls in naive\\napproaches to this problem, while experimental results show that our algorithms\\nsuccessfully leverage task structure to improve performance under noisy\\ninterpretations of the vocabulary. Our results provide a general framework for\\nexploiting Reward Machines in partially observable environments.\",\"PeriodicalId\":501124,\"journal\":{\"name\":\"arXiv - CS - Formal Languages and Automata Theory\",\"volume\":\"6 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Formal Languages and Automata Theory\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2406.00120\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Formal Languages and Automata Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.00120","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Reward Machines for Deep RL in Noisy and Uncertain Environments
Reward Machines provide an automata-inspired structure for specifying
instructions, safety constraints, and other temporally extended reward-worthy
behaviour. By exposing complex reward function structure, they enable
counterfactual learning updates that have resulted in impressive sample
efficiency gains. While Reward Machines have been employed in both tabular and
deep RL settings, they have typically relied on a ground-truth interpretation
of the domain-specific vocabulary that form the building blocks of the reward
function. Such ground-truth interpretations can be elusive in many real-world
settings, due in part to partial observability or noisy sensing. In this paper,
we explore the use of Reward Machines for Deep RL in noisy and uncertain
environments. We characterize this problem as a POMDP and propose a suite of RL
algorithms that leverage task structure under uncertain interpretation of
domain-specific vocabulary. Theoretical analysis exposes pitfalls in naive
approaches to this problem, while experimental results show that our algorithms
successfully leverage task structure to improve performance under noisy
interpretations of the vocabulary. Our results provide a general framework for
exploiting Reward Machines in partially observable environments.