Adaptive Agents and Multi-Agent Systems最新文献_第8页

Learning from Multiple Independent Advisors in Multi-agent Reinforcement Learning 多智能体强化学习中多个独立顾问的学习

Adaptive Agents and Multi-Agent Systems

Pub Date : 2023-01-26 DOI: 10.48550/arXiv.2301.11153

Sriram Ganapathi Subramanian, Matthew E. Taylor, K. Larson, Mark Crowley

Multi-agent reinforcement learning typically suffers from the problem of sample inefficiency, where learning suitable policies involves the use of many data samples. Learning from external demonstrators is a possible solution that mitigates this problem. However, most prior approaches in this area assume the presence of a single demonstrator. Leveraging multiple knowledge sources (i.e., advisors) with expertise in distinct aspects of the environment could substantially speed up learning in complex environments. This paper considers the problem of simultaneously learning from multiple independent advisors in multi-agent reinforcement learning. The approach leverages a two-level Q-learning architecture, and extends this framework from single-agent to multi-agent settings. We provide principled algorithms that incorporate a set of advisors by both evaluating the advisors at each state and subsequently using the advisors to guide action selection. We also provide theoretical convergence and sample complexity guarantees. Experimentally, we validate our approach in three different test-beds and show that our algorithms give better performances than baselines, can effectively integrate the combined expertise of different advisors, and learn to ignore bad advice.

多智能体强化学习通常存在样本效率低下的问题，其中学习合适的策略涉及使用许多数据样本。向外部示范人员学习是缓解这一问题的可能解决方案。然而，该领域的大多数先前方法都假定存在单个演示者。利用具有环境不同方面专业知识的多个知识来源(例如，顾问)可以大大加快复杂环境中的学习速度。本文研究了多智能体强化学习中同时向多个独立顾问学习的问题。该方法利用两级q学习架构，并将该框架从单智能体扩展到多智能体设置。我们提供有原则的算法，通过评估每个状态的顾问，并随后使用顾问来指导行动选择，将一组顾问合并在一起。我们还提供了理论收敛性和样本复杂度保证。实验中，我们在三个不同的测试平台上验证了我们的方法，并表明我们的算法比基线提供了更好的性能，可以有效地整合不同顾问的综合专业知识，并学会忽略糟糕的建议。

{"title":"Learning from Multiple Independent Advisors in Multi-agent Reinforcement Learning","authors":"Sriram Ganapathi Subramanian, Matthew E. Taylor, K. Larson, Mark Crowley","doi":"10.48550/arXiv.2301.11153","DOIUrl":"https://doi.org/10.48550/arXiv.2301.11153","url":null,"abstract":"Multi-agent reinforcement learning typically suffers from the problem of sample inefficiency, where learning suitable policies involves the use of many data samples. Learning from external demonstrators is a possible solution that mitigates this problem. However, most prior approaches in this area assume the presence of a single demonstrator. Leveraging multiple knowledge sources (i.e., advisors) with expertise in distinct aspects of the environment could substantially speed up learning in complex environments. This paper considers the problem of simultaneously learning from multiple independent advisors in multi-agent reinforcement learning. The approach leverages a two-level Q-learning architecture, and extends this framework from single-agent to multi-agent settings. We provide principled algorithms that incorporate a set of advisors by both evaluating the advisors at each state and subsequently using the advisors to guide action selection. We also provide theoretical convergence and sample complexity guarantees. Experimentally, we validate our approach in three different test-beds and show that our algorithms give better performances than baselines, can effectively integrate the combined expertise of different advisors, and learn to ignore bad advice.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131501650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HoLA Robots: Mitigating Plan-Deviation Attacks in Multi-Robot Systems with Co-Observations and Horizon-Limiting Announcements HoLA机器人:具有协同观察和水平限制公告的多机器人系统中减轻计划偏差攻击

Adaptive Agents and Multi-Agent Systems

Pub Date : 2023-01-25 DOI: 10.48550/arXiv.2301.10704

Kacper Wardega, Max von Hippel, Roberto Tron, C. Nita-Rotaru, Wenchao Li

Emerging multi-robot systems rely on cooperation between humans and robots, with robots following automatically generated motion plans to service application-level tasks. Given the safety requirements associated with operating in proximity to humans and expensive infrastructure, it is important to understand and mitigate the security vulnerabilities of such systems caused by compromised robots who diverge from their assigned plans. We focus on centralized systems, where a *central entity* (CE) is responsible for determining and transmitting the motion plans to the robots, which report their location as they move following the plan. The CE checks that robots follow their assigned plans by comparing their expected location to the location they self-report. We show that this self-reporting monitoring mechanism is vulnerable to *plan-deviation attacks* where compromised robots don't follow their assigned plans while trying to conceal their movement by mis-reporting their location. We propose a two-pronged mitigation for plan-deviation attacks: (1) an attack detection technique leveraging both the robots' local sensing capabilities to report observations of other robots and *co-observation schedules* generated by the CE, and (2) a prevention technique where the CE issues *horizon-limiting announcements* to the robots, reducing their instantaneous knowledge of forward lookahead steps in the global motion plan. On a large-scale automated warehouse benchmark, we show that our solution enables attack prevention guarantees from a stealthy attacker that has compromised multiple robots.

新兴的多机器人系统依赖于人与机器人之间的合作，机器人遵循自动生成的运动计划来服务于应用级任务。考虑到与靠近人类和昂贵的基础设施相关的安全要求，理解和减轻这些系统的安全漏洞是很重要的，这些系统是由偏离指定计划的受损机器人造成的。我们专注于集中式系统，其中“中央实体”(CE)负责确定运动计划并将其传输给机器人，机器人在按照计划移动时报告其位置。CE通过比较机器人的预期位置和他们自己报告的位置来检查机器人是否遵循了分配的计划。我们表明，这种自我报告监控机制很容易受到“计划偏差攻击”的影响，在这种攻击中，受损的机器人不遵循指定的计划，同时试图通过错误报告自己的位置来隐藏自己的运动。我们提出了一种双管齐下的计划偏差攻击缓解方法:(1)一种攻击检测技术，利用机器人的局部感知能力来报告其他机器人的观察结果和由CE生成的“共同观察计划”;(2)一种预防技术，CE向机器人发出“地平线限制通知”，减少机器人对全局运动计划中向前展望步骤的瞬时知识。在大规模自动化仓库基准测试中，我们展示了我们的解决方案能够防止攻击者对多个机器人进行攻击。

{"title":"HoLA Robots: Mitigating Plan-Deviation Attacks in Multi-Robot Systems with Co-Observations and Horizon-Limiting Announcements","authors":"Kacper Wardega, Max von Hippel, Roberto Tron, C. Nita-Rotaru, Wenchao Li","doi":"10.48550/arXiv.2301.10704","DOIUrl":"https://doi.org/10.48550/arXiv.2301.10704","url":null,"abstract":"Emerging multi-robot systems rely on cooperation between humans and robots, with robots following automatically generated motion plans to service application-level tasks. Given the safety requirements associated with operating in proximity to humans and expensive infrastructure, it is important to understand and mitigate the security vulnerabilities of such systems caused by compromised robots who diverge from their assigned plans. We focus on centralized systems, where a *central entity* (CE) is responsible for determining and transmitting the motion plans to the robots, which report their location as they move following the plan. The CE checks that robots follow their assigned plans by comparing their expected location to the location they self-report. We show that this self-reporting monitoring mechanism is vulnerable to *plan-deviation attacks* where compromised robots don't follow their assigned plans while trying to conceal their movement by mis-reporting their location. We propose a two-pronged mitigation for plan-deviation attacks: (1) an attack detection technique leveraging both the robots' local sensing capabilities to report observations of other robots and *co-observation schedules* generated by the CE, and (2) a prevention technique where the CE issues *horizon-limiting announcements* to the robots, reducing their instantaneous knowledge of forward lookahead steps in the global motion plan. On a large-scale automated warehouse benchmark, we show that our solution enables attack prevention guarantees from a stealthy attacker that has compromised multiple robots.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116305268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Asymptotic Convergence and Performance of Multi-Agent Q-Learning Dynamics 多智能体q -学习动力学的渐近收敛与性能

Adaptive Agents and Multi-Agent Systems

Pub Date : 2023-01-23 DOI: 10.48550/arXiv.2301.09619

Aamal Hussain, F. Belardinelli, G. Piliouras

Achieving convergence of multiple learning agents in general $N$-player games is imperative for the development of safe and reliable machine learning (ML) algorithms and their application to autonomous systems. Yet it is known that, outside the bounds of simple two-player games, convergence cannot be taken for granted. To make progress in resolving this problem, we study the dynamics of smooth Q-Learning, a popular reinforcement learning algorithm which quantifies the tendency for learning agents to explore their state space or exploit their payoffs. We show a sufficient condition on the rate of exploration such that the Q-Learning dynamics is guaranteed to converge to a unique equilibrium in any game. We connect this result to games for which Q-Learning is known to converge with arbitrary exploration rates, including weighted Potential games and weighted zero sum polymatrix games. Finally, we examine the performance of the Q-Learning dynamic as measured by the Time Averaged Social Welfare, and comparing this with the Social Welfare achieved by the equilibrium. We provide a sufficient condition whereby the Q-Learning dynamic will outperform the equilibrium even if the dynamics do not converge.

在一般的$N$玩家游戏中实现多个学习代理的收敛对于开发安全可靠的机器学习(ML)算法及其在自治系统中的应用是必不可少的。然而，众所周知，在简单的双人游戏范围之外，趋同并不是理所当然的。为了在解决这个问题上取得进展，我们研究了平滑Q-Learning的动力学，这是一种流行的强化学习算法，它量化了学习代理探索其状态空间或利用其收益的趋势。我们给出了一个关于探索速度的充分条件，使得Q-Learning动态保证在任何博弈中收敛到一个唯一的平衡。我们将这个结果与已知Q-Learning收敛于任意探索速率的游戏联系起来，包括加权潜在游戏和加权零和多矩阵游戏。最后，我们通过时间平均社会福利来检验Q-Learning动态的表现，并将其与均衡所获得的社会福利进行比较。我们提供了一个充分条件，即即使动态不收敛，Q-Learning动态也会优于平衡。

{"title":"Asymptotic Convergence and Performance of Multi-Agent Q-Learning Dynamics","authors":"Aamal Hussain, F. Belardinelli, G. Piliouras","doi":"10.48550/arXiv.2301.09619","DOIUrl":"https://doi.org/10.48550/arXiv.2301.09619","url":null,"abstract":"Achieving convergence of multiple learning agents in general $N$-player games is imperative for the development of safe and reliable machine learning (ML) algorithms and their application to autonomous systems. Yet it is known that, outside the bounds of simple two-player games, convergence cannot be taken for granted. To make progress in resolving this problem, we study the dynamics of smooth Q-Learning, a popular reinforcement learning algorithm which quantifies the tendency for learning agents to explore their state space or exploit their payoffs. We show a sufficient condition on the rate of exploration such that the Q-Learning dynamics is guaranteed to converge to a unique equilibrium in any game. We connect this result to games for which Q-Learning is known to converge with arbitrary exploration rates, including weighted Potential games and weighted zero sum polymatrix games. Finally, we examine the performance of the Q-Learning dynamic as measured by the Time Averaged Social Welfare, and comparing this with the Social Welfare achieved by the equilibrium. We provide a sufficient condition whereby the Q-Learning dynamic will outperform the equilibrium even if the dynamics do not converge.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128083990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization 基于广义策略改进优先级的样本高效多目标学习

Adaptive Agents and Multi-Agent Systems

Pub Date : 2023-01-18 DOI: 10.48550/arXiv.2301.07784

L. N. Alegre, A. Bazzan, Diederik M. Roijers, Ann Now'e, Bruno C. da Silva

Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences over (possibly conflicting) reward functions. Such algorithms often learn a set of policies (each optimized for a particular agent preference) that can later be used to solve problems with novel preferences. We introduce a novel algorithm that uses Generalized Policy Improvement (GPI) to define principled, formally-derived prioritization schemes that improve sample-efficient learning. They implement active-learning strategies by which the agent can (i) identify the most promising preferences/objectives to train on at each moment, to more rapidly solve a given MORL problem; and (ii) identify which previous experiences are most relevant when learning a policy for a particular agent preference, via a novel Dyna-style MORL method. We prove our algorithm is guaranteed to always converge to an optimal solution in a finite number of steps, or an $epsilon$-optimal solution (for a bounded $epsilon$) if the agent is limited and can only identify possibly sub-optimal policies. We also prove that our method monotonically improves the quality of its partial solutions while learning. Finally, we introduce a bound that characterizes the maximum utility loss (with respect to the optimal solution) incurred by the partial solutions computed by our method throughout learning. We empirically show that our method outperforms state-of-the-art MORL algorithms in challenging multi-objective tasks, both with discrete and continuous state and action spaces.

多目标强化学习(MORL)算法解决顺序决策问题，其中代理可能对奖励函数有不同的偏好(可能相互冲突)。这样的算法通常会学习一组策略(每个策略都针对特定的代理偏好进行了优化)，这些策略以后可以用于解决具有新偏好的问题。我们介绍了一种新的算法，该算法使用广义策略改进(GPI)来定义原则性的，正式派生的优先级方案，以提高样本效率学习。它们执行主动学习策略，通过该策略，智能体可以(i)识别每个时刻最有希望训练的偏好/目标，以更快地解决给定的MORL问题;(ii)通过一种新颖的dyna风格的MORL方法，确定在学习特定代理偏好的策略时，哪些先前的经验是最相关的。我们证明了我们的算法保证在有限的步骤中总是收敛到一个最优解，或者如果代理是有限的并且只能识别可能的次优策略，那么我们的算法保证总是收敛到一个$epsilon$最优解(对于有界$epsilon$)。我们还证明了该方法在学习过程中单调地提高了部分解的质量。最后，我们引入了一个界，它表征了在整个学习过程中由我们的方法计算的部分解所引起的最大效用损失(相对于最优解)。我们的经验表明，我们的方法在具有离散和连续状态和动作空间的挑战性多目标任务中优于最先进的MORL算法。

{"title":"Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization","authors":"L. N. Alegre, A. Bazzan, Diederik M. Roijers, Ann Now'e, Bruno C. da Silva","doi":"10.48550/arXiv.2301.07784","DOIUrl":"https://doi.org/10.48550/arXiv.2301.07784","url":null,"abstract":"Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences over (possibly conflicting) reward functions. Such algorithms often learn a set of policies (each optimized for a particular agent preference) that can later be used to solve problems with novel preferences. We introduce a novel algorithm that uses Generalized Policy Improvement (GPI) to define principled, formally-derived prioritization schemes that improve sample-efficient learning. They implement active-learning strategies by which the agent can (i) identify the most promising preferences/objectives to train on at each moment, to more rapidly solve a given MORL problem; and (ii) identify which previous experiences are most relevant when learning a policy for a particular agent preference, via a novel Dyna-style MORL method. We prove our algorithm is guaranteed to always converge to an optimal solution in a finite number of steps, or an $epsilon$-optimal solution (for a bounded $epsilon$) if the agent is limited and can only identify possibly sub-optimal policies. We also prove that our method monotonically improves the quality of its partial solutions while learning. Finally, we introduce a bound that characterizes the maximum utility loss (with respect to the optimal solution) incurred by the partial solutions computed by our method throughout learning. We empirically show that our method outperforms state-of-the-art MORL algorithms in challenging multi-objective tasks, both with discrete and continuous state and action spaces.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133417291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Byzantine Resilience at Swarm Scale: A Decentralized Blocklist Protocol from Inter-robot Accusations 群体规模的拜占庭弹性:来自机器人间指控的分散黑名单协议

Adaptive Agents and Multi-Agent Systems

Pub Date : 2023-01-17 DOI: 10.48550/arXiv.2301.06977

Kacper Wardega, Max von Hippel, Roberto Tron, C. Nita-Rotaru, Wenchao Li

The Weighted-Mean Subsequence Reduced (W-MSR) algorithm, the state-of-the-art method for Byzantine-resilient design of decentralized multi-robot systems, is based on discarding outliers received over Linear Consensus Protocol (LCP). Although W-MSR provides well-understood theoretical guarantees relating robust network connectivity to the convergence of the underlying consensus, the method comes with several limitations preventing its use at scale: (1) the number of Byzantine robots, F, to tolerate should be known a priori, (2) the requirement that each robot maintains 2F+1 neighbors is impractical for large F, (3) information propagation is hindered by the requirement that F+1 robots independently make local measurements of the consensus property in order for the swarm's decision to change, and (4) W-MSR is specific to LCP and does not generalize to applications not implemented over LCP. In this work, we propose a Decentralized Blocklist Protocol (DBP) based on inter-robot accusations. Accusations are made on the basis of locally-made observations of misbehavior, and once shared by cooperative robots across the network are used as input to a graph matching algorithm that computes a blocklist. DBP generalizes to applications not implemented via LCP, is adaptive to the number of Byzantine robots, and allows for fast information propagation through the multi-robot system while simultaneously reducing the required network connectivity relative to W-MSR. On LCP-type applications, DBP reduces the worst-case connectivity requirement of W-MSR from (2F+1)-connected to (F+1)-connected and the number of cooperative observers required to propagate new information from F+1 to just 1 observer. We demonstrate empirically that our approach to Byzantine resilience scales to hundreds of robots on cooperative target tracking, time synchronization, and localization case studies.

加权平均子序列简化(W-MSR)算法是分散多机器人系统拜占庭弹性设计的最先进方法，它基于丢弃通过线性共识协议(LCP)接收的异常值。尽管W-MSR提供了很好理解的理论保证，将强大的网络连接与潜在共识的收敛联系起来，但该方法存在一些限制，阻碍了其大规模使用:(1)容许的拜占庭机器人数量F应该是先验已知的;(2)每个机器人保持2F+1个邻居的要求对于较大的F是不切实际的;(3)信息传播受到F+1个机器人独立地对共识属性进行局部测量以使群体决策改变的要求的阻碍;(4)W-MSR是LCP特有的，不能推广到非在LCP上实现的应用。在这项工作中，我们提出了一个基于机器人间指控的去中心化黑名单协议(DBP)。指控是基于对不当行为的本地观察，一旦被网络上的合作机器人共享，就会被用作计算封锁列表的图形匹配算法的输入。DBP可以推广到不通过LCP实现的应用，适应拜占庭机器人的数量，并允许通过多机器人系统快速传播信息，同时相对于W-MSR减少所需的网络连接。在lcp类型的应用中，DBP将W-MSR的最坏情况连接要求从(2F+1)连接降低到(F+1)连接，并且将新信息从F+1传播到1个观察者所需的合作观察者数量减少。我们通过经验证明，我们的拜占庭弹性方法适用于数百个机器人的合作目标跟踪、时间同步和定位案例研究。

{"title":"Byzantine Resilience at Swarm Scale: A Decentralized Blocklist Protocol from Inter-robot Accusations","authors":"Kacper Wardega, Max von Hippel, Roberto Tron, C. Nita-Rotaru, Wenchao Li","doi":"10.48550/arXiv.2301.06977","DOIUrl":"https://doi.org/10.48550/arXiv.2301.06977","url":null,"abstract":"The Weighted-Mean Subsequence Reduced (W-MSR) algorithm, the state-of-the-art method for Byzantine-resilient design of decentralized multi-robot systems, is based on discarding outliers received over Linear Consensus Protocol (LCP). Although W-MSR provides well-understood theoretical guarantees relating robust network connectivity to the convergence of the underlying consensus, the method comes with several limitations preventing its use at scale: (1) the number of Byzantine robots, F, to tolerate should be known a priori, (2) the requirement that each robot maintains 2F+1 neighbors is impractical for large F, (3) information propagation is hindered by the requirement that F+1 robots independently make local measurements of the consensus property in order for the swarm's decision to change, and (4) W-MSR is specific to LCP and does not generalize to applications not implemented over LCP. In this work, we propose a Decentralized Blocklist Protocol (DBP) based on inter-robot accusations. Accusations are made on the basis of locally-made observations of misbehavior, and once shared by cooperative robots across the network are used as input to a graph matching algorithm that computes a blocklist. DBP generalizes to applications not implemented via LCP, is adaptive to the number of Byzantine robots, and allows for fast information propagation through the multi-robot system while simultaneously reducing the required network connectivity relative to W-MSR. On LCP-type applications, DBP reduces the worst-case connectivity requirement of W-MSR from (2F+1)-connected to (F+1)-connected and the number of cooperative observers required to propagate new information from F+1 to just 1 observer. We demonstrate empirically that our approach to Byzantine resilience scales to hundreds of robots on cooperative target tracking, time synchronization, and localization case studies.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131953255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

PECAN: Leveraging Policy Ensemble for Context-Aware Zero-Shot Human-AI Coordination PECAN:利用政策集合实现上下文感知的零镜头人机协作

Adaptive Agents and Multi-Agent Systems

Pub Date : 2023-01-16 DOI: 10.48550/arXiv.2301.06387

Xingzhou Lou, Jiaxian Guo, Junge Zhang, Jun Wang, Kaiqi Huang, Yali Du

Zero-shot human-AI coordination holds the promise of collaborating with humans without human data. Prevailing methods try to train the ego agent with a population of partners via self-play. However, these methods suffer from two problems: 1) The diversity of a population with finite partners is limited, thereby limiting the capacity of the trained ego agent to collaborate with a novel human; 2) Current methods only provide a common best response for every partner in the population, which may result in poor zero-shot coordination performance with a novel partner or humans. To address these issues, we first propose the policy ensemble method to increase the diversity of partners in the population, and then develop a context-aware method enabling the ego agent to analyze and identify the partner's potential policy primitives so that it can take different actions accordingly. In this way, the ego agent is able to learn more universal cooperative behaviors for collaborating with diverse partners. We conduct experiments on the Overcooked environment, and evaluate the zero-shot human-AI coordination performance of our method with both behavior-cloned human proxies and real humans. The results demonstrate that our method significantly increases the diversity of partners and enables ego agents to learn more diverse behaviors than baselines, thus achieving state-of-the-art performance in all scenarios. We also open-source a human-AI coordination study framework on the Overcooked for the convenience of future studies.

零射击人类-人工智能协调有望在没有人类数据的情况下与人类合作。目前流行的方法是通过自我游戏来训练具有一群伙伴的自我代理。然而，这些方法存在两个问题:1)有限伙伴群体的多样性是有限的，从而限制了训练有素的自我代理与新人类合作的能力;2)目前的方法仅为群体中的每个伙伴提供一个共同的最佳响应，这可能导致与新伙伴或人类的零射击协调性能较差。为了解决这些问题，我们首先提出了策略集成方法来增加种群中合作伙伴的多样性，然后开发了一种上下文感知方法，使自我代理能够分析和识别合作伙伴的潜在策略原语，从而采取相应的不同行动。通过这种方式，自我代理能够学习更普遍的合作行为，以便与不同的伙伴合作。我们在Overcooked环境下进行了实验，并使用行为克隆的人类代理和真实的人类来评估我们的方法的零射击人类- ai协调性能。结果表明，我们的方法显著增加了合作伙伴的多样性，并使自我智能体能够学习比基线更多样化的行为，从而在所有场景中获得最先进的性能。我们还开源了一个关于Overcooked的人类-人工智能协调研究框架，以方便未来的研究。

{"title":"PECAN: Leveraging Policy Ensemble for Context-Aware Zero-Shot Human-AI Coordination","authors":"Xingzhou Lou, Jiaxian Guo, Junge Zhang, Jun Wang, Kaiqi Huang, Yali Du","doi":"10.48550/arXiv.2301.06387","DOIUrl":"https://doi.org/10.48550/arXiv.2301.06387","url":null,"abstract":"Zero-shot human-AI coordination holds the promise of collaborating with humans without human data. Prevailing methods try to train the ego agent with a population of partners via self-play. However, these methods suffer from two problems: 1) The diversity of a population with finite partners is limited, thereby limiting the capacity of the trained ego agent to collaborate with a novel human; 2) Current methods only provide a common best response for every partner in the population, which may result in poor zero-shot coordination performance with a novel partner or humans. To address these issues, we first propose the policy ensemble method to increase the diversity of partners in the population, and then develop a context-aware method enabling the ego agent to analyze and identify the partner's potential policy primitives so that it can take different actions accordingly. In this way, the ego agent is able to learn more universal cooperative behaviors for collaborating with diverse partners. We conduct experiments on the Overcooked environment, and evaluate the zero-shot human-AI coordination performance of our method with both behavior-cloned human proxies and real humans. The results demonstrate that our method significantly increases the diversity of partners and enables ego agents to learn more diverse behaviors than baselines, thus achieving state-of-the-art performance in all scenarios. We also open-source a human-AI coordination study framework on the Overcooked for the convenience of future studies.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127553889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Decentralized model-free reinforcement learning in stochastic games with average-reward objective 具有平均奖励目标的随机博弈中的分散无模型强化学习

Adaptive Agents and Multi-Agent Systems

Pub Date : 2023-01-13 DOI: 10.48550/arXiv.2301.05630

Romain Cravic, Nicolas Gast, B. Gaujal

We propose the first model-free algorithm that achieves low regret performance for decentralized learning in two-player zero-sum tabular stochastic games with infinite-horizon average-reward objective. In decentralized learning, the learning agent controls only one player and tries to achieve low regret performances against an arbitrary opponent. This contrasts with centralized learning where the agent tries to approximate the Nash equilibrium by controlling both players. In our infinite-horizon undiscounted setting, additional structure assumptions is needed to provide good behaviors of learning processes : here we assume for every strategy of the opponent, the agent has a way to go from any state to any other. This assumption is the analogous to the"communicating"assumption in the MDP setting. We show that our Decentralized Optimistic Nash Q-Learning (DONQ-learning) algorithm achieves both sublinear high probability regret of order $T^{3/4}$ and sublinear expected regret of order $T^{2/3}$. Moreover, our algorithm enjoys a low computational complexity and low memory space requirement compared to the previous works of (Wei et al. 2017) and (Jafarnia-Jahromi et al. 2021) in the same setting.

在具有无限视界平均奖励目标的二人零和表格随机博弈中，我们提出了第一个实现低遗憾分散学习性能的无模型算法。在去中心化学习中，学习代理只控制一个玩家，并试图在面对任意对手时获得低遗憾的表现。这与集中式学习形成对比，在集中式学习中，代理试图通过控制两个参与者来近似纳什均衡。在我们的无限视界未折现设置中，需要额外的结构假设来提供学习过程的良好行为:这里我们假设对于对手的每种策略，代理都有从任何状态到任何其他状态的方法。这个假设类似于MDP设置中的“通信”假设。我们证明了我们的分散乐观纳什q -学习(DONQ-learning)算法既实现了阶$T^{3/4}$的次线性高概率后悔，又实现了阶$T^{2/3}$的次线性期望后悔。此外，在相同设置下，与(Wei et al. 2017)和(Jafarnia-Jahromi et al. 2021)的先前工作相比，我们的算法具有较低的计算复杂度和较低的内存空间需求。

{"title":"Decentralized model-free reinforcement learning in stochastic games with average-reward objective","authors":"Romain Cravic, Nicolas Gast, B. Gaujal","doi":"10.48550/arXiv.2301.05630","DOIUrl":"https://doi.org/10.48550/arXiv.2301.05630","url":null,"abstract":"We propose the first model-free algorithm that achieves low regret performance for decentralized learning in two-player zero-sum tabular stochastic games with infinite-horizon average-reward objective. In decentralized learning, the learning agent controls only one player and tries to achieve low regret performances against an arbitrary opponent. This contrasts with centralized learning where the agent tries to approximate the Nash equilibrium by controlling both players. In our infinite-horizon undiscounted setting, additional structure assumptions is needed to provide good behaviors of learning processes : here we assume for every strategy of the opponent, the agent has a way to go from any state to any other. This assumption is the analogous to the\"communicating\"assumption in the MDP setting. We show that our Decentralized Optimistic Nash Q-Learning (DONQ-learning) algorithm achieves both sublinear high probability regret of order $T^{3/4}$ and sublinear expected regret of order $T^{2/3}$. Moreover, our algorithm enjoys a low computational complexity and low memory space requirement compared to the previous works of (Wei et al. 2017) and (Jafarnia-Jahromi et al. 2021) in the same setting.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134435627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TransfQMix: Transformers for Leveraging the Graph Structure of Multi-Agent Reinforcement Learning Problems transqmix:利用多智能体强化学习问题的图结构的变压器

Adaptive Agents and Multi-Agent Systems

Pub Date : 2023-01-13 DOI: 10.48550/arXiv.2301.05334

Matteo Gallici, Mario Martín, I. Masmitja

Coordination is one of the most difficult aspects of multi-agent reinforcement learning (MARL). One reason is that agents normally choose their actions independently of one another. In order to see coordination strategies emerging from the combination of independent policies, the recent research has focused on the use of a centralized function (CF) that learns each agent's contribution to the team reward. However, the structure in which the environment is presented to the agents and to the CF is typically overlooked. We have observed that the features used to describe the coordination problem can be represented as vertex features of a latent graph structure. Here, we present TransfQMix, a new approach that uses transformers to leverage this latent structure and learn better coordination policies. Our transformer agents perform a graph reasoning over the state of the observable entities. Our transformer Q-mixer learns a monotonic mixing-function from a larger graph that includes the internal and external states of the agents. TransfQMix is designed to be entirely transferable, meaning that same parameters can be used to control and train larger or smaller teams of agents. This enables to deploy promising approaches to save training time and derive general policies in MARL, such as transfer learning, zero-shot transfer, and curriculum learning. We report TransfQMix's performances in the Spread and StarCraft II environments. In both settings, it outperforms state-of-the-art Q-Learning models, and it demonstrates effectiveness in solving problems that other methods can not solve.

协调是多智能体强化学习(MARL)中最困难的方面之一。原因之一是，行为主体通常独立于彼此选择自己的行为。为了了解独立策略组合中出现的协调策略，最近的研究集中在使用集中函数(CF)来学习每个代理对团队奖励的贡献。然而，将环境呈现给代理和CF的结构通常被忽略。我们已经观察到用于描述协调问题的特征可以表示为潜在图结构的顶点特征。在这里，我们提出了transferqmix，这是一种使用变压器来利用这种潜在结构并学习更好的协调策略的新方法。我们的转换代理对可观察实体的状态执行图形推理。我们的变压器Q-mixer从包含代理的内部和外部状态的更大的图中学习单调混合函数。transferqmix被设计为完全可转移的，这意味着相同的参数可以用来控制和训练或大或小的代理团队。这使得部署有前途的方法能够节省训练时间并得出MARL中的一般策略，例如迁移学习，零射击迁移和课程学习。我们报告了transferqmix在Spread和星际争霸II环境中的表现。在这两种情况下，它都优于最先进的Q-Learning模型，并且它在解决其他方法无法解决的问题方面显示出有效性。

{"title":"TransfQMix: Transformers for Leveraging the Graph Structure of Multi-Agent Reinforcement Learning Problems","authors":"Matteo Gallici, Mario Martín, I. Masmitja","doi":"10.48550/arXiv.2301.05334","DOIUrl":"https://doi.org/10.48550/arXiv.2301.05334","url":null,"abstract":"Coordination is one of the most difficult aspects of multi-agent reinforcement learning (MARL). One reason is that agents normally choose their actions independently of one another. In order to see coordination strategies emerging from the combination of independent policies, the recent research has focused on the use of a centralized function (CF) that learns each agent's contribution to the team reward. However, the structure in which the environment is presented to the agents and to the CF is typically overlooked. We have observed that the features used to describe the coordination problem can be represented as vertex features of a latent graph structure. Here, we present TransfQMix, a new approach that uses transformers to leverage this latent structure and learn better coordination policies. Our transformer agents perform a graph reasoning over the state of the observable entities. Our transformer Q-mixer learns a monotonic mixing-function from a larger graph that includes the internal and external states of the agents. TransfQMix is designed to be entirely transferable, meaning that same parameters can be used to control and train larger or smaller teams of agents. This enables to deploy promising approaches to save training time and derive general policies in MARL, such as transfer learning, zero-shot transfer, and curriculum learning. We report TransfQMix's performances in the Spread and StarCraft II environments. In both settings, it outperforms state-of-the-art Q-Learning models, and it demonstrates effectiveness in solving problems that other methods can not solve.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114671897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Learning to Perceive in Deep Model-Free Reinforcement Learning 在深度无模型强化学习中学习感知

Adaptive Agents and Multi-Agent Systems

Pub Date : 2023-01-10 DOI: 10.48550/arXiv.2301.03730

Gonccalo Querido, Alberto Sardinha, Francisco S. Melo

This work proposes a novel model-free Reinforcement Learning (RL) agent that is able to learn how to complete an unknown task having access to only a part of the input observation. We take inspiration from the concepts of visual attention and active perception that are characteristic of humans and tried to apply them to our agent, creating a hard attention mechanism. In this mechanism, the model decides first which region of the input image it should look at, and only after that it has access to the pixels of that region. Current RL agents do not follow this principle and we have not seen these mechanisms applied to the same purpose as this work. In our architecture, we adapt an existing model called recurrent attention model (RAM) and combine it with the proximal policy optimization (PPO) algorithm. We investigate whether a model with these characteristics is capable of achieving similar performance to state-of-the-art model-free RL agents that access the full input observation. This analysis is made in two Atari games, Pong and SpaceInvaders, which have a discrete action space, and in CarRacing, which has a continuous action space. Besides assessing its performance, we also analyze the movement of the attention of our model and compare it with what would be an example of the human behavior. Even with such visual limitation, we show that our model matches the performance of PPO+LSTM in two of the three games tested.

这项工作提出了一种新的无模型强化学习(RL)智能体，它能够学习如何仅访问部分输入观察来完成未知任务。我们从视觉注意和主动感知的概念中获得灵感，这是人类的特征，并试图将它们应用于我们的代理，创造一个硬注意机制。在这种机制中，模型首先决定应该查看输入图像的哪个区域，然后才能访问该区域的像素。目前的RL代理不遵循这一原则，我们还没有看到这些机制应用于与这项工作相同的目的。在我们的架构中，我们采用了一种称为循环注意模型(RAM)的现有模型，并将其与近端策略优化(PPO)算法相结合。我们研究具有这些特征的模型是否能够获得与访问完整输入观察的最先进的无模型RL代理相似的性能。这一分析是针对雅达利的两款游戏《Pong》和《SpaceInvaders》进行的，这两款游戏拥有离散的动作空间，而《CarRacing》则拥有连续的动作空间。除了评估其性能外，我们还分析了我们模型的注意力运动，并将其与人类行为的例子进行比较。即使有这样的视觉限制，我们也表明我们的模型在测试的三款游戏中的两款中与PPO+LSTM的性能相匹配。

{"title":"Learning to Perceive in Deep Model-Free Reinforcement Learning","authors":"Gonccalo Querido, Alberto Sardinha, Francisco S. Melo","doi":"10.48550/arXiv.2301.03730","DOIUrl":"https://doi.org/10.48550/arXiv.2301.03730","url":null,"abstract":"This work proposes a novel model-free Reinforcement Learning (RL) agent that is able to learn how to complete an unknown task having access to only a part of the input observation. We take inspiration from the concepts of visual attention and active perception that are characteristic of humans and tried to apply them to our agent, creating a hard attention mechanism. In this mechanism, the model decides first which region of the input image it should look at, and only after that it has access to the pixels of that region. Current RL agents do not follow this principle and we have not seen these mechanisms applied to the same purpose as this work. In our architecture, we adapt an existing model called recurrent attention model (RAM) and combine it with the proximal policy optimization (PPO) algorithm. We investigate whether a model with these characteristics is capable of achieving similar performance to state-of-the-art model-free RL agents that access the full input observation. This analysis is made in two Atari games, Pong and SpaceInvaders, which have a discrete action space, and in CarRacing, which has a continuous action space. Besides assessing its performance, we also analyze the movement of the attention of our model and compare it with what would be an example of the human behavior. Even with such visual limitation, we show that our model matches the performance of PPO+LSTM in two of the three games tested.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132939545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Asynchronous Multi-Agent Reinforcement Learning for Efficient Real-Time Multi-Robot Cooperative Exploration 基于异步多智能体强化学习的高效实时多机器人协同探索

Adaptive Agents and Multi-Agent Systems

Pub Date : 2023-01-09 DOI: 10.48550/arXiv.2301.03398

Chao Yu, Xinyi Yang, Jiaxuan Gao, Jiayu Chen, Yunfei Li, Jijia Liu, Yunfei Xiang, Rui Huang, Huazhong Yang, Yi Wu, Yu Wang

We consider the problem of cooperative exploration where multiple robots need to cooperatively explore an unknown region as fast as possible. Multi-agent reinforcement learning (MARL) has recently become a trending paradigm for solving this challenge. However, existing MARL-based methods adopt action-making steps as the metric for exploration efficiency by assuming all the agents are acting in a fully synchronous manner: i.e., every single agent produces an action simultaneously and every single action is executed instantaneously at each time step. Despite its mathematical simplicity, such a synchronous MARL formulation can be problematic for real-world robotic applications. It can be typical that different robots may take slightly different wall-clock times to accomplish an atomic action or even periodically get lost due to hardware issues. Simply waiting for every robot being ready for the next action can be particularly time-inefficient. Therefore, we propose an asynchronous MARL solution, Asynchronous Coordination Explorer (ACE), to tackle this real-world challenge. We first extend a classical MARL algorithm, multi-agent PPO (MAPPO), to the asynchronous setting and additionally apply action-delay randomization to enforce the learned policy to generalize better to varying action delays in the real world. Moreover, each navigation agent is represented as a team-size-invariant CNN-based policy, which greatly benefits real-robot deployment by handling possible robot lost and allows bandwidth-efficient intra-agent communication through low-dimensional CNN features. We first validate our approach in a grid-based scenario. Both simulation and real-robot results show that ACE reduces over 10% actual exploration time compared with classical approaches. We also apply our framework to a high-fidelity visual-based environment, Habitat, achieving 28% improvement in exploration efficiency.

我们考虑了协作探索问题，其中多个机器人需要尽可能快地协作探索未知区域。多智能体强化学习(MARL)最近成为解决这一挑战的趋势范例。然而，现有的基于marl的方法采用动作制定步骤作为探索效率的度量标准，假设所有代理都以完全同步的方式行动:即每个代理同时产生一个动作，每个单个动作在每个时间步上都是瞬间执行的。尽管数学上很简单，但这种同步MARL公式对于现实世界的机器人应用可能会有问题。典型的情况是，不同的机器人完成一个原子动作所需的时间可能略有不同，甚至会因为硬件问题而周期性地丢失。简单地等待每个机器人为下一个动作做好准备可能会特别浪费时间。因此，我们提出了一个异步MARL解决方案，异步协调资源管理器(asynchronous Coordination Explorer, ACE)，来解决这个现实世界的挑战。我们首先将经典的MARL算法多智能体PPO (MAPPO)扩展到异步设置，并应用动作延迟随机化来强制学习策略更好地推广到现实世界中不同的动作延迟。此外，每个导航代理被表示为一个团队大小不变的基于CNN的策略，通过处理可能的机器人丢失，极大地有利于真实机器人的部署，并允许通过低维CNN特征进行带宽高效的代理内部通信。我们首先在基于网格的场景中验证我们的方法。仿真和实际机器人实验结果表明，与传统方法相比，ACE方法的实际探测时间减少了10%以上。我们还将我们的框架应用于高保真的基于视觉的环境Habitat，使勘探效率提高了28%。

{"title":"Asynchronous Multi-Agent Reinforcement Learning for Efficient Real-Time Multi-Robot Cooperative Exploration","authors":"Chao Yu, Xinyi Yang, Jiaxuan Gao, Jiayu Chen, Yunfei Li, Jijia Liu, Yunfei Xiang, Rui Huang, Huazhong Yang, Yi Wu, Yu Wang","doi":"10.48550/arXiv.2301.03398","DOIUrl":"https://doi.org/10.48550/arXiv.2301.03398","url":null,"abstract":"We consider the problem of cooperative exploration where multiple robots need to cooperatively explore an unknown region as fast as possible. Multi-agent reinforcement learning (MARL) has recently become a trending paradigm for solving this challenge. However, existing MARL-based methods adopt action-making steps as the metric for exploration efficiency by assuming all the agents are acting in a fully synchronous manner: i.e., every single agent produces an action simultaneously and every single action is executed instantaneously at each time step. Despite its mathematical simplicity, such a synchronous MARL formulation can be problematic for real-world robotic applications. It can be typical that different robots may take slightly different wall-clock times to accomplish an atomic action or even periodically get lost due to hardware issues. Simply waiting for every robot being ready for the next action can be particularly time-inefficient. Therefore, we propose an asynchronous MARL solution, Asynchronous Coordination Explorer (ACE), to tackle this real-world challenge. We first extend a classical MARL algorithm, multi-agent PPO (MAPPO), to the asynchronous setting and additionally apply action-delay randomization to enforce the learned policy to generalize better to varying action delays in the real world. Moreover, each navigation agent is represented as a team-size-invariant CNN-based policy, which greatly benefits real-robot deployment by handling possible robot lost and allows bandwidth-efficient intra-agent communication through low-dimensional CNN features. We first validate our approach in a grid-based scenario. Both simulation and real-robot results show that ACE reduces over 10% actual exploration time compared with classical approaches. We also apply our framework to a high-fidelity visual-based environment, Habitat, achieving 28% improvement in exploration efficiency.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116307167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6