{"title":"从人类产生的奖励中非近视地学习","authors":"W. B. Knox, P. Stone","doi":"10.1145/2449396.2449422","DOIUrl":null,"url":null,"abstract":"Recent research has demonstrated that human-generated reward signals can be effectively used to train agents to perform a range of reinforcement learning tasks. Such tasks are either episodic - i.e., conducted in unconnected episodes of activity that often end in either goal or failure states - or continuing - i.e., indefinitely ongoing. Another point of difference is whether the learning agent highly discounts the value of future reward - a myopic agent - or conversely values future reward appreciably. In recent work, we found that previous approaches to learning from human reward all used myopic valuation [7]. This study additionally provided evidence for the desirability of myopic valuation in task domains that are both goal-based and episodic.\n In this paper, we conduct three user studies that examine critical assumptions of our previous research: task episodicity, optimal behavior with respect to a Markov Decision Process, and lack of a failure state in the goal-based task. In the first experiment, we show that converting a simple episodic task to non-episodic (i.e., continuing) task resolves some theoretical issues present in episodic tasks with generally positive reward and - relatedly - enables highly successful learning with non-myopic valuation in multiple user studies. The primary learning algorithm in this paper, which we call \"VI-TAMER\", is it the first algorithm to successfully learn non-myopically from human-generated reward; we also empirically show that such non-myopic valuation facilitates higher-level understanding of the task. Anticipating the complexity of real-world problems, we perform two subsequent user studies - one with a failure state added - that compare (1) learning when states are updated asynchronously with local bias - i.e., states quickly reachable from the agent's current state are updated more often than other states - to (2) learning with the fully synchronous sweeps across each state in the VI-TAMER algorithm. With these locally biased updates, we find that the general positivity of human reward creates problems even for continuing tasks, revealing a distinct research challenge for future work.","PeriodicalId":87287,"journal":{"name":"IUI. International Conference on Intelligent User Interfaces","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2013-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":"{\"title\":\"Learning non-myopically from human-generated reward\",\"authors\":\"W. B. Knox, P. Stone\",\"doi\":\"10.1145/2449396.2449422\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent research has demonstrated that human-generated reward signals can be effectively used to train agents to perform a range of reinforcement learning tasks. Such tasks are either episodic - i.e., conducted in unconnected episodes of activity that often end in either goal or failure states - or continuing - i.e., indefinitely ongoing. Another point of difference is whether the learning agent highly discounts the value of future reward - a myopic agent - or conversely values future reward appreciably. In recent work, we found that previous approaches to learning from human reward all used myopic valuation [7]. This study additionally provided evidence for the desirability of myopic valuation in task domains that are both goal-based and episodic.\\n In this paper, we conduct three user studies that examine critical assumptions of our previous research: task episodicity, optimal behavior with respect to a Markov Decision Process, and lack of a failure state in the goal-based task. In the first experiment, we show that converting a simple episodic task to non-episodic (i.e., continuing) task resolves some theoretical issues present in episodic tasks with generally positive reward and - relatedly - enables highly successful learning with non-myopic valuation in multiple user studies. The primary learning algorithm in this paper, which we call \\\"VI-TAMER\\\", is it the first algorithm to successfully learn non-myopically from human-generated reward; we also empirically show that such non-myopic valuation facilitates higher-level understanding of the task. Anticipating the complexity of real-world problems, we perform two subsequent user studies - one with a failure state added - that compare (1) learning when states are updated asynchronously with local bias - i.e., states quickly reachable from the agent's current state are updated more often than other states - to (2) learning with the fully synchronous sweeps across each state in the VI-TAMER algorithm. With these locally biased updates, we find that the general positivity of human reward creates problems even for continuing tasks, revealing a distinct research challenge for future work.\",\"PeriodicalId\":87287,\"journal\":{\"name\":\"IUI. International Conference on Intelligent User Interfaces\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-03-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"29\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IUI. International Conference on Intelligent User Interfaces\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2449396.2449422\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IUI. International Conference on Intelligent User Interfaces","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2449396.2449422","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Learning non-myopically from human-generated reward
Recent research has demonstrated that human-generated reward signals can be effectively used to train agents to perform a range of reinforcement learning tasks. Such tasks are either episodic - i.e., conducted in unconnected episodes of activity that often end in either goal or failure states - or continuing - i.e., indefinitely ongoing. Another point of difference is whether the learning agent highly discounts the value of future reward - a myopic agent - or conversely values future reward appreciably. In recent work, we found that previous approaches to learning from human reward all used myopic valuation [7]. This study additionally provided evidence for the desirability of myopic valuation in task domains that are both goal-based and episodic.
In this paper, we conduct three user studies that examine critical assumptions of our previous research: task episodicity, optimal behavior with respect to a Markov Decision Process, and lack of a failure state in the goal-based task. In the first experiment, we show that converting a simple episodic task to non-episodic (i.e., continuing) task resolves some theoretical issues present in episodic tasks with generally positive reward and - relatedly - enables highly successful learning with non-myopic valuation in multiple user studies. The primary learning algorithm in this paper, which we call "VI-TAMER", is it the first algorithm to successfully learn non-myopically from human-generated reward; we also empirically show that such non-myopic valuation facilitates higher-level understanding of the task. Anticipating the complexity of real-world problems, we perform two subsequent user studies - one with a failure state added - that compare (1) learning when states are updated asynchronously with local bias - i.e., states quickly reachable from the agent's current state are updated more often than other states - to (2) learning with the fully synchronous sweeps across each state in the VI-TAMER algorithm. With these locally biased updates, we find that the general positivity of human reward creates problems even for continuing tasks, revealing a distinct research challenge for future work.