{"title":"安全保证,加速学习在mdp与本地侧信息。","authors":"Pranay Thangeda, Melkior Ornik","doi":"10.23919/acc45564.2020.9147372","DOIUrl":null,"url":null,"abstract":"<p><p>In environments with uncertain dynamics, synthesis of optimal control policies mandates exploration. The applicability of classical learning algorithms to real-world problems is often limited by the number of time steps required for learning the environment model. Given some local side information about the differences in transition probabilities of the states, potentially obtained from the agent's onboard sensors, we generalize the idea of indirect sampling for accelerated learning to propose an algorithm that balances between exploration and exploitation. We formalize this idea by introducing the notion of the value of information in the context of a Markov decision process with unknown transition probabilities, as a measure of the expected improvement in the agent's current estimate of transition probabilities by taking a particular action. By exploiting available local side information and maximizing the estimated value of learned information at each time step, we accelerate the learning process and subsequent synthesis of the optimal control policy. Further, we define the notion of agent safety, a vital consideration for physical systems, in the context of our problem. Under certain assumptions, we provide guarantees on the safety of an agent exploring with our algorithm that exploits local side information. We illustrate agent safety and the improvement in learning speed using numerical experiments in the setting of a Mars rover, with data from onboard sensors acting as the local side information.</p>","PeriodicalId":74510,"journal":{"name":"Proceedings of the ... American Control Conference. American Control Conference","volume":"2020 ","pages":"1099-1104"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.23919/acc45564.2020.9147372","citationCount":"2","resultStr":"{\"title\":\"Safety-Guaranteed, Accelerated Learning in MDPs with Local Side Information.\",\"authors\":\"Pranay Thangeda, Melkior Ornik\",\"doi\":\"10.23919/acc45564.2020.9147372\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>In environments with uncertain dynamics, synthesis of optimal control policies mandates exploration. The applicability of classical learning algorithms to real-world problems is often limited by the number of time steps required for learning the environment model. Given some local side information about the differences in transition probabilities of the states, potentially obtained from the agent's onboard sensors, we generalize the idea of indirect sampling for accelerated learning to propose an algorithm that balances between exploration and exploitation. We formalize this idea by introducing the notion of the value of information in the context of a Markov decision process with unknown transition probabilities, as a measure of the expected improvement in the agent's current estimate of transition probabilities by taking a particular action. By exploiting available local side information and maximizing the estimated value of learned information at each time step, we accelerate the learning process and subsequent synthesis of the optimal control policy. Further, we define the notion of agent safety, a vital consideration for physical systems, in the context of our problem. Under certain assumptions, we provide guarantees on the safety of an agent exploring with our algorithm that exploits local side information. We illustrate agent safety and the improvement in learning speed using numerical experiments in the setting of a Mars rover, with data from onboard sensors acting as the local side information.</p>\",\"PeriodicalId\":74510,\"journal\":{\"name\":\"Proceedings of the ... American Control Conference. American Control Conference\",\"volume\":\"2020 \",\"pages\":\"1099-1104\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.23919/acc45564.2020.9147372\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ... American Control Conference. American Control Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/acc45564.2020.9147372\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2020/7/27 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... American Control Conference. American Control Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/acc45564.2020.9147372","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/7/27 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
Safety-Guaranteed, Accelerated Learning in MDPs with Local Side Information.
In environments with uncertain dynamics, synthesis of optimal control policies mandates exploration. The applicability of classical learning algorithms to real-world problems is often limited by the number of time steps required for learning the environment model. Given some local side information about the differences in transition probabilities of the states, potentially obtained from the agent's onboard sensors, we generalize the idea of indirect sampling for accelerated learning to propose an algorithm that balances between exploration and exploitation. We formalize this idea by introducing the notion of the value of information in the context of a Markov decision process with unknown transition probabilities, as a measure of the expected improvement in the agent's current estimate of transition probabilities by taking a particular action. By exploiting available local side information and maximizing the estimated value of learned information at each time step, we accelerate the learning process and subsequent synthesis of the optimal control policy. Further, we define the notion of agent safety, a vital consideration for physical systems, in the context of our problem. Under certain assumptions, we provide guarantees on the safety of an agent exploring with our algorithm that exploits local side information. We illustrate agent safety and the improvement in learning speed using numerical experiments in the setting of a Mars rover, with data from onboard sensors acting as the local side information.