Teaching AI Agents Ethical Values Using Reinforcement Learning and Policy Orchestration

Ritesh Noothigattu, Djallel Bouneffouf, Nicholas Mattei, Rachita Chandra, Piyush Madan, Kush R. Varshney, Murray Campbell, Moninder Singh, F. Rossi
{"title":"Teaching AI Agents Ethical Values Using Reinforcement Learning and Policy Orchestration","authors":"Ritesh Noothigattu, Djallel Bouneffouf, Nicholas Mattei, Rachita Chandra, Piyush Madan, Kush R. Varshney, Murray Campbell, Moninder Singh, F. Rossi","doi":"10.24963/ijcai.2019/891","DOIUrl":null,"url":null,"abstract":"Autonomous cyber-physical agents play an increasingly large role in our lives. To ensure that they behave in ways aligned with the values of society, we must develop techniques that allow these agents to not only maximize their reward in an environment, but also to learn and follow the implicit constraints of society. \nWe detail a novel approach that uses inverse reinforcement learning to learn a set of unspecified constraints from demonstrations and reinforcement learning to learn to maximize environmental rewards. A contextual bandit-based orchestrator then picks between the two policies: constraint-based and environment reward-based. The contextual bandit orchestrator allows the agent to mix policies in novel ways, taking the best actions from either a reward-maximizing or constrained policy. In addition, the orchestrator is transparent on which policy is being employed at each time step. We test our algorithms using Pac-Man and show that the agent is able to learn to act optimally, act within the demonstrated constraints, and mix these two functions in complex ways.","PeriodicalId":13051,"journal":{"name":"IBM J. Res. Dev.","volume":"16 1","pages":"2:1-2:9"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"56","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IBM J. Res. Dev.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.24963/ijcai.2019/891","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 56

Abstract

Autonomous cyber-physical agents play an increasingly large role in our lives. To ensure that they behave in ways aligned with the values of society, we must develop techniques that allow these agents to not only maximize their reward in an environment, but also to learn and follow the implicit constraints of society. We detail a novel approach that uses inverse reinforcement learning to learn a set of unspecified constraints from demonstrations and reinforcement learning to learn to maximize environmental rewards. A contextual bandit-based orchestrator then picks between the two policies: constraint-based and environment reward-based. The contextual bandit orchestrator allows the agent to mix policies in novel ways, taking the best actions from either a reward-maximizing or constrained policy. In addition, the orchestrator is transparent on which policy is being employed at each time step. We test our algorithms using Pac-Man and show that the agent is able to learn to act optimally, act within the demonstrated constraints, and mix these two functions in complex ways.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用强化学习和策略编排来教授人工智能代理的道德价值观
自主的网络物理代理在我们的生活中扮演着越来越重要的角色。为了确保他们的行为方式与社会价值观一致,我们必须开发技术,使这些代理不仅能够在环境中最大化他们的回报,而且还能够学习和遵循社会的隐性约束。我们详细介绍了一种新方法,该方法使用逆强化学习从演示中学习一组未指定的约束,并使用强化学习来学习最大化环境奖励。然后,基于上下文强盗的编排器在两种策略之间进行选择:基于约束的策略和基于环境奖励的策略。上下文强盗协调器允许代理以新颖的方式混合策略,从奖励最大化或约束策略中采取最佳行动。此外,编排器对于在每个时间步骤使用哪个策略是透明的。我们使用《吃豆人》测试我们的算法,并表明代理能够学习最佳行动,在演示的约束条件下行动,并以复杂的方式混合这两种功能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Teaching AI Agents Ethical Values Using Reinforcement Learning and Policy Orchestration Message from the Vice President, Systems, IBM Research Division Exploitation of optical interconnects in future server architectures Message from the Vice President, BladeCenter Development, IBM Systems and Technology Group Message from the Vice President, Systems Hardware Development, IBM Systems and Technology Group
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1