从人类产生的奖励中非近视地学习

IUI. International Conference on Intelligent User Interfaces Pub Date : 2013-03-19 DOI:10.1145/2449396.2449422

W. B. Knox, P. Stone

{"title":"从人类产生的奖励中非近视地学习","authors":"W. B. Knox, P. Stone","doi":"10.1145/2449396.2449422","DOIUrl":null,"url":null,"abstract":"Recent research has demonstrated that human-generated reward signals can be effectively used to train agents to perform a range of reinforcement learning tasks. Such tasks are either episodic - i.e., conducted in unconnected episodes of activity that often end in either goal or failure states - or continuing - i.e., indefinitely ongoing. Another point of difference is whether the learning agent highly discounts the value of future reward - a myopic agent - or conversely values future reward appreciably. In recent work, we found that previous approaches to learning from human reward all used myopic valuation [7]. This study additionally provided evidence for the desirability of myopic valuation in task domains that are both goal-based and episodic.\n In this paper, we conduct three user studies that examine critical assumptions of our previous research: task episodicity, optimal behavior with respect to a Markov Decision Process, and lack of a failure state in the goal-based task. In the first experiment, we show that converting a simple episodic task to non-episodic (i.e., continuing) task resolves some theoretical issues present in episodic tasks with generally positive reward and - relatedly - enables highly successful learning with non-myopic valuation in multiple user studies. The primary learning algorithm in this paper, which we call \"VI-TAMER\", is it the first algorithm to successfully learn non-myopically from human-generated reward; we also empirically show that such non-myopic valuation facilitates higher-level understanding of the task. Anticipating the complexity of real-world problems, we perform two subsequent user studies - one with a failure state added - that compare (1) learning when states are updated asynchronously with local bias - i.e., states quickly reachable from the agent's current state are updated more often than other states - to (2) learning with the fully synchronous sweeps across each state in the VI-TAMER algorithm. With these locally biased updates, we find that the general positivity of human reward creates problems even for continuing tasks, revealing a distinct research challenge for future work.","PeriodicalId":87287,"journal":{"name":"IUI. International Conference on Intelligent User Interfaces","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2013-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":"{\"title\":\"Learning non-myopically from human-generated reward\",\"authors\":\"W. B. Knox, P. Stone\",\"doi\":\"10.1145/2449396.2449422\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent research has demonstrated that human-generated reward signals can be effectively used to train agents to perform a range of reinforcement learning tasks. Such tasks are either episodic - i.e., conducted in unconnected episodes of activity that often end in either goal or failure states - or continuing - i.e., indefinitely ongoing. Another point of difference is whether the learning agent highly discounts the value of future reward - a myopic agent - or conversely values future reward appreciably. In recent work, we found that previous approaches to learning from human reward all used myopic valuation [7]. This study additionally provided evidence for the desirability of myopic valuation in task domains that are both goal-based and episodic.\\n In this paper, we conduct three user studies that examine critical assumptions of our previous research: task episodicity, optimal behavior with respect to a Markov Decision Process, and lack of a failure state in the goal-based task. In the first experiment, we show that converting a simple episodic task to non-episodic (i.e., continuing) task resolves some theoretical issues present in episodic tasks with generally positive reward and - relatedly - enables highly successful learning with non-myopic valuation in multiple user studies. The primary learning algorithm in this paper, which we call \\\"VI-TAMER\\\", is it the first algorithm to successfully learn non-myopically from human-generated reward; we also empirically show that such non-myopic valuation facilitates higher-level understanding of the task. Anticipating the complexity of real-world problems, we perform two subsequent user studies - one with a failure state added - that compare (1) learning when states are updated asynchronously with local bias - i.e., states quickly reachable from the agent's current state are updated more often than other states - to (2) learning with the fully synchronous sweeps across each state in the VI-TAMER algorithm. With these locally biased updates, we find that the general positivity of human reward creates problems even for continuing tasks, revealing a distinct research challenge for future work.\",\"PeriodicalId\":87287,\"journal\":{\"name\":\"IUI. International Conference on Intelligent User Interfaces\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-03-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"29\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IUI. International Conference on Intelligent User Interfaces\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2449396.2449422\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IUI. International Conference on Intelligent User Interfaces","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2449396.2449422","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 29

摘要

最近的研究表明，人类产生的奖励信号可以有效地用于训练代理执行一系列强化学习任务。这些任务要么是偶发的，即在不相关的活动中进行，通常以目标或失败状态结束，要么是持续的，即无限期地进行。另一个不同点是，学习型智能体是高度低估未来奖励的价值(近视智能体)，还是相反地重视未来奖励。在最近的工作中，我们发现以前从人类奖励中学习的方法都使用了近视估值bb0。本研究还提供了证据，证明在基于目标和情景的任务域中，近视评估是可取的。在本文中，我们进行了三个用户研究，检查了我们之前研究的关键假设:任务偶然性，关于马尔可夫决策过程的最佳行为，以及基于目标的任务中缺乏失败状态。在第一个实验中，我们展示了将一个简单的情景任务转换为非情景(即持续)任务解决了通常具有积极奖励的情景任务中存在的一些理论问题，并相关地在多用户研究中实现了高度成功的非短视评估学习。本文的主要学习算法，我们称之为“VI-TAMER”，它是第一个成功地从人类产生的奖励中学习非近视的算法;我们还通过实证证明，这种非短视的评估有助于对任务的更高层次的理解。考虑到现实世界问题的复杂性，我们进行了两个后续的用户研究——其中一个添加了故障状态——比较了(1)当状态以本地偏差异步更新时的学习(即，从代理当前状态快速到达的状态比其他状态更新得更频繁)和(2)在VI-TAMER算法中对每个状态进行完全同步扫描的学习。有了这些局部偏见的更新，我们发现人类奖励的总体积极性即使对持续的任务也会产生问题，这为未来的工作揭示了一个明显的研究挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Learning non-myopically from human-generated reward

Recent research has demonstrated that human-generated reward signals can be effectively used to train agents to perform a range of reinforcement learning tasks. Such tasks are either episodic - i.e., conducted in unconnected episodes of activity that often end in either goal or failure states - or continuing - i.e., indefinitely ongoing. Another point of difference is whether the learning agent highly discounts the value of future reward - a myopic agent - or conversely values future reward appreciably. In recent work, we found that previous approaches to learning from human reward all used myopic valuation [7]. This study additionally provided evidence for the desirability of myopic valuation in task domains that are both goal-based and episodic. In this paper, we conduct three user studies that examine critical assumptions of our previous research: task episodicity, optimal behavior with respect to a Markov Decision Process, and lack of a failure state in the goal-based task. In the first experiment, we show that converting a simple episodic task to non-episodic (i.e., continuing) task resolves some theoretical issues present in episodic tasks with generally positive reward and - relatedly - enables highly successful learning with non-myopic valuation in multiple user studies. The primary learning algorithm in this paper, which we call "VI-TAMER", is it the first algorithm to successfully learn non-myopically from human-generated reward; we also empirically show that such non-myopic valuation facilitates higher-level understanding of the task. Anticipating the complexity of real-world problems, we perform two subsequent user studies - one with a failure state added - that compare (1) learning when states are updated asynchronously with local bias - i.e., states quickly reachable from the agent's current state are updated more often than other states - to (2) learning with the fully synchronous sweeps across each state in the VI-TAMER algorithm. With these locally biased updates, we find that the general positivity of human reward creates problems even for continuing tasks, revealing a distinct research challenge for future work.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IUI. International Conference on Intelligent User Interfaces

自引率

0.00%

发文量