首页 > 最新文献

Adaptive Agents and Multi-Agent Systems最新文献

英文 中文
Temporally Layered Architecture for Adaptive, Distributed and Continuous Control 用于自适应、分布式和连续控制的时间分层体系结构
Pub Date : 2022-12-25 DOI: 10.48550/arXiv.2301.00723
Devdhar Patel, Joshua Russell, Frances Walsh, T. Rahman, Terrance Sejnowski, H. Siegelmann
We present temporally layered architecture (TLA), a biologically inspired system for temporally adaptive distributed control. TLA layers a fast and a slow controller together to achieve temporal abstraction that allows each layer to focus on a different time-scale. Our design is biologically inspired and draws on the architecture of the human brain which executes actions at different timescales depending on the environment's demands. Such distributed control design is widespread across biological systems because it increases survivability and accuracy in certain and uncertain environments. We demonstrate that TLA can provide many advantages over existing approaches, including persistent exploration, adaptive control, explainable temporal behavior, compute efficiency and distributed control. We present two different algorithms for training TLA: (a) Closed-loop control, where the fast controller is trained over a pre-trained slow controller, allowing better exploration for the fast controller and closed-loop control where the fast controller decides whether to"act-or-not"at each timestep; and (b) Partially open loop control, where the slow controller is trained over a pre-trained fast controller, allowing for open loop-control where the slow controller picks a temporally extended action or defers the next n-actions to the fast controller. We evaluated our method on a suite of continuous control tasks and demonstrate the advantages of TLA over several strong baselines.
我们提出了时间分层架构(TLA),这是一种受生物学启发的系统,用于时间自适应分布式控制。TLA将一个快速控制器和一个慢速控制器分层在一起,以实现时间抽象,允许每一层关注不同的时间尺度。我们的设计受到生物学的启发,并借鉴了人类大脑的结构,根据环境的需求在不同的时间尺度上执行动作。这种分布式控制设计在生物系统中广泛存在,因为它增加了在某些和不确定环境中的生存能力和准确性。我们证明,与现有方法相比,TLA可以提供许多优点,包括持续探索、自适应控制、可解释的时间行为、计算效率和分布式控制。我们提出了两种不同的TLA训练算法:(a)闭环控制,其中快速控制器在预训练的慢速控制器上进行训练,允许更好地探索快速控制器和闭环控制,其中快速控制器决定是否在每个时间步长“行动”;(b)部分开环控制,其中慢速控制器在预训练的快速控制器上进行训练,允许开环控制,其中慢速控制器选择临时扩展动作或将下一个n个动作延迟给快速控制器。我们在一系列连续控制任务中评估了我们的方法,并证明了TLA在几个强基线上的优势。
{"title":"Temporally Layered Architecture for Adaptive, Distributed and Continuous Control","authors":"Devdhar Patel, Joshua Russell, Frances Walsh, T. Rahman, Terrance Sejnowski, H. Siegelmann","doi":"10.48550/arXiv.2301.00723","DOIUrl":"https://doi.org/10.48550/arXiv.2301.00723","url":null,"abstract":"We present temporally layered architecture (TLA), a biologically inspired system for temporally adaptive distributed control. TLA layers a fast and a slow controller together to achieve temporal abstraction that allows each layer to focus on a different time-scale. Our design is biologically inspired and draws on the architecture of the human brain which executes actions at different timescales depending on the environment's demands. Such distributed control design is widespread across biological systems because it increases survivability and accuracy in certain and uncertain environments. We demonstrate that TLA can provide many advantages over existing approaches, including persistent exploration, adaptive control, explainable temporal behavior, compute efficiency and distributed control. We present two different algorithms for training TLA: (a) Closed-loop control, where the fast controller is trained over a pre-trained slow controller, allowing better exploration for the fast controller and closed-loop control where the fast controller decides whether to\"act-or-not\"at each timestep; and (b) Partially open loop control, where the slow controller is trained over a pre-trained fast controller, allowing for open loop-control where the slow controller picks a temporally extended action or defers the next n-actions to the fast controller. We evaluated our method on a suite of continuous control tasks and demonstrate the advantages of TLA over several strong baselines.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"20 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116691874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Comparison of New Swarm Task Allocation Algorithms in Unknown Environments with Varying Task Density 不同任务密度未知环境下新的群任务分配算法的比较
Pub Date : 2022-12-01 DOI: 10.48550/arXiv.2212.00844
Grace Cai, Noble Harasha, N. Lynch
Task allocation is an important problem for robot swarms to solve, allowing agents to reduce task completion time by performing tasks in a distributed fashion. Existing task allocation algorithms often assume prior knowledge of task location and demand or fail to consider the effects of the geometric distribution of tasks on the completion time and communication cost of the algorithms. In this paper, we examine an environment where agents must explore and discover tasks with positive demand and successfully assign themselves to complete all such tasks. We first provide a new discrete general model for modeling swarms. Operating within this theoretical framework, we propose two new task allocation algorithms for initially unknown environments -- one based on N-site selection and the other on virtual pheromones. We analyze each algorithm separately and also evaluate the effectiveness of the two algorithms in dense vs. sparse task distributions. Compared to the Levy walk, which has been theorized to be optimal for foraging, our virtual pheromone inspired algorithm is much faster in sparse to medium task densities but is communication and agent intensive. Our site selection inspired algorithm also outperforms Levy walk in sparse task densities and is a less resource-intensive option than our virtual pheromone algorithm for this case. Because the performance of both algorithms relative to random walk is dependent on task density, our results shed light on how task density is important in choosing a task allocation algorithm in initially unknown environments.
任务分配是机器人群需要解决的一个重要问题,它允许智能体以分布式方式执行任务,从而减少任务完成时间。现有的任务分配算法通常假设任务位置和需求的先验知识,或者没有考虑任务几何分布对算法完成时间和通信成本的影响。在本文中,我们研究了一个环境,其中智能体必须探索和发现具有积极需求的任务,并成功地分配自己完成所有这些任务。我们首先提供了一个新的离散一般模型来建模群体。在这一理论框架下,我们提出了两种新的任务分配算法,用于最初未知的环境——一种基于N-site选择,另一种基于虚拟信息素。我们分别分析了每种算法,并评估了两种算法在密集和稀疏任务分布中的有效性。与理论上最适合觅食的Levy行走相比,我们的虚拟信息素启发算法在稀疏到中等任务密度下要快得多,但需要通信和智能体密集。我们的站点选择启发算法在稀疏任务密度下也优于Levy walk,并且在这种情况下比我们的虚拟信息素算法是一个更少资源密集型的选择。由于两种算法相对于随机行走的性能都依赖于任务密度,我们的结果揭示了任务密度在最初未知环境中选择任务分配算法的重要性。
{"title":"A Comparison of New Swarm Task Allocation Algorithms in Unknown Environments with Varying Task Density","authors":"Grace Cai, Noble Harasha, N. Lynch","doi":"10.48550/arXiv.2212.00844","DOIUrl":"https://doi.org/10.48550/arXiv.2212.00844","url":null,"abstract":"Task allocation is an important problem for robot swarms to solve, allowing agents to reduce task completion time by performing tasks in a distributed fashion. Existing task allocation algorithms often assume prior knowledge of task location and demand or fail to consider the effects of the geometric distribution of tasks on the completion time and communication cost of the algorithms. In this paper, we examine an environment where agents must explore and discover tasks with positive demand and successfully assign themselves to complete all such tasks. We first provide a new discrete general model for modeling swarms. Operating within this theoretical framework, we propose two new task allocation algorithms for initially unknown environments -- one based on N-site selection and the other on virtual pheromones. We analyze each algorithm separately and also evaluate the effectiveness of the two algorithms in dense vs. sparse task distributions. Compared to the Levy walk, which has been theorized to be optimal for foraging, our virtual pheromone inspired algorithm is much faster in sparse to medium task densities but is communication and agent intensive. Our site selection inspired algorithm also outperforms Levy walk in sparse task densities and is a less resource-intensive option than our virtual pheromone algorithm for this case. Because the performance of both algorithms relative to random walk is dependent on task density, our results shed light on how task density is important in choosing a task allocation algorithm in initially unknown environments.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121328724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
On Regret-optimal Cooperative Nonstochastic Multi-armed Bandits 关于后悔最优合作非随机多武装土匪
Pub Date : 2022-11-30 DOI: 10.48550/arXiv.2211.17154
Jialin Yi, M. Vojnović
We consider the nonstochastic multi-agent multi-armed bandit problem with agents collaborating via a communication network with delays. We show a lower bound for individual regret of all agents. We show that with suitable regularizers and communication protocols, a collaborative multi-agent emph{follow-the-regularized-leader} (FTRL) algorithm has an individual regret upper bound that matches the lower bound up to a constant factor when the number of arms is large enough relative to degrees of agents in the communication graph. We also show that an FTRL algorithm with a suitable regularizer is regret optimal with respect to the scaling with the edge-delay parameter. We present numerical experiments validating our theoretical results and demonstrate cases when our algorithms outperform previously proposed algorithms.
研究了具有时滞通信网络的非随机多智能体多臂盗匪问题。我们给出了所有主体的个体后悔的下界。我们表明,使用合适的正则化器和通信协议,当臂的数量相对于通信图中代理的程度足够大时,协作多智能体emph{跟随正则化领导者}(FTRL)算法具有与下界匹配到一个常数因子的个体遗憾上界。我们还证明了具有合适的正则化器的FTRL算法对于边缘延迟参数的缩放是遗憾最优的。我们提出了数值实验来验证我们的理论结果,并演示了我们的算法优于先前提出的算法的情况。
{"title":"On Regret-optimal Cooperative Nonstochastic Multi-armed Bandits","authors":"Jialin Yi, M. Vojnović","doi":"10.48550/arXiv.2211.17154","DOIUrl":"https://doi.org/10.48550/arXiv.2211.17154","url":null,"abstract":"We consider the nonstochastic multi-agent multi-armed bandit problem with agents collaborating via a communication network with delays. We show a lower bound for individual regret of all agents. We show that with suitable regularizers and communication protocols, a collaborative multi-agent emph{follow-the-regularized-leader} (FTRL) algorithm has an individual regret upper bound that matches the lower bound up to a constant factor when the number of arms is large enough relative to degrees of agents in the communication graph. We also show that an FTRL algorithm with a suitable regularizer is regret optimal with respect to the scaling with the edge-delay parameter. We present numerical experiments validating our theoretical results and demonstrate cases when our algorithms outperform previously proposed algorithms.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114282464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Welfare and Fairness in Multi-objective Reinforcement Learning 多目标强化学习中的福利与公平
Pub Date : 2022-11-30 DOI: 10.48550/arXiv.2212.01382
Zimeng Fan, Nianli Peng, Muhang Tian, Brandon Fain
We study fair multi-objective reinforcement learning in which an agent must learn a policy that simultaneously achieves high reward on multiple dimensions of a vector-valued reward. Motivated by the fair resource allocation literature, we model this as an expected welfare maximization problem, for some non-linear fair welfare function of the vector of long-term cumulative rewards. One canonical example of such a function is the Nash Social Welfare, or geometric mean, the log transform of which is also known as the Proportional Fairness objective. We show that even approximately optimal optimization of the expected Nash Social Welfare is computationally intractable even in the tabular case. Nevertheless, we provide a novel adaptation of Q-learning that combines non-linear scalarized learning updates and non-stationary action selection to learn effective policies for optimizing nonlinear welfare functions. We show that our algorithm is provably convergent, and we demonstrate experimentally that our approach outperforms techniques based on linear scalarization, mixtures of optimal linear scalarizations, or stationary action selection for the Nash Social Welfare Objective.
我们研究了公平多目标强化学习,其中智能体必须学习一种策略,在向量值奖励的多个维度上同时获得高奖励。受公平资源分配文献的启发,我们将其建模为一个期望福利最大化问题,对于长期累积奖励向量的某种非线性公平福利函数。这种函数的一个典型例子是纳什社会福利,或几何均值,其对数变换也被称为比例公平目标。我们证明,即使在表格情况下,期望纳什社会福利的近似最优优化在计算上也是难以处理的。然而,我们提供了一种新的q -学习方法,它结合了非线性标量化学习更新和非平稳行动选择来学习优化非线性福利函数的有效策略。我们证明了我们的算法是可证明的收敛性,并且我们通过实验证明了我们的方法优于基于线性标量化、最优线性标量化混合或纳什社会福利目标的平稳行动选择的技术。
{"title":"Welfare and Fairness in Multi-objective Reinforcement Learning","authors":"Zimeng Fan, Nianli Peng, Muhang Tian, Brandon Fain","doi":"10.48550/arXiv.2212.01382","DOIUrl":"https://doi.org/10.48550/arXiv.2212.01382","url":null,"abstract":"We study fair multi-objective reinforcement learning in which an agent must learn a policy that simultaneously achieves high reward on multiple dimensions of a vector-valued reward. Motivated by the fair resource allocation literature, we model this as an expected welfare maximization problem, for some non-linear fair welfare function of the vector of long-term cumulative rewards. One canonical example of such a function is the Nash Social Welfare, or geometric mean, the log transform of which is also known as the Proportional Fairness objective. We show that even approximately optimal optimization of the expected Nash Social Welfare is computationally intractable even in the tabular case. Nevertheless, we provide a novel adaptation of Q-learning that combines non-linear scalarized learning updates and non-stationary action selection to learn effective policies for optimizing nonlinear welfare functions. We show that our algorithm is provably convergent, and we demonstrate experimentally that our approach outperforms techniques based on linear scalarization, mixtures of optimal linear scalarizations, or stationary action selection for the Nash Social Welfare Objective.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124530685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Artificial prediction markets present a novel opportunity for human-AI collaboration 人工预测市场为人类与人工智能的合作提供了一个新的机会
Pub Date : 2022-11-29 DOI: 10.48550/arXiv.2211.16590
T. Chakravorti, Vaibhav Singh, Sarah Rajmajer, Michael Mclaughlin, Robert Fraleigh, Christopher Griffin, A. Kwasnica, David M. Pennock, C. L. Giles
Despite high-profile successes in the field of Artificial Intelligence, machine-driven technologies still suffer important limitations, particularly for complex tasks where creativity, planning, common sense, intuition, or learning from limited data is required. These limitations motivate effective methods for human-machine collaboration. Our work makes two primary contributions. We thoroughly experiment with an artificial prediction market model to understand the effects of market parameters on model performance for benchmark classification tasks. We then demonstrate, through simulation, the impact of exogenous agents in the market, where these exogenous agents represent primitive human behaviors. This work lays the foundation for a novel set of hybrid human-AI machine learning algorithms.
尽管在人工智能领域取得了引人注目的成功,但机器驱动的技术仍然存在重要的局限性,特别是对于需要创造力、计划、常识、直觉或从有限数据中学习的复杂任务。这些限制激发了人机协作的有效方法。我们的工作有两个主要贡献。我们对人工预测市场模型进行了彻底的实验,以了解市场参数对基准分类任务的模型性能的影响。然后,我们通过模拟证明了外生因素在市场中的影响,这些外生因素代表了原始的人类行为。这项工作为一套新的混合人类-人工智能机器学习算法奠定了基础。
{"title":"Artificial prediction markets present a novel opportunity for human-AI collaboration","authors":"T. Chakravorti, Vaibhav Singh, Sarah Rajmajer, Michael Mclaughlin, Robert Fraleigh, Christopher Griffin, A. Kwasnica, David M. Pennock, C. L. Giles","doi":"10.48550/arXiv.2211.16590","DOIUrl":"https://doi.org/10.48550/arXiv.2211.16590","url":null,"abstract":"Despite high-profile successes in the field of Artificial Intelligence, machine-driven technologies still suffer important limitations, particularly for complex tasks where creativity, planning, common sense, intuition, or learning from limited data is required. These limitations motivate effective methods for human-machine collaboration. Our work makes two primary contributions. We thoroughly experiment with an artificial prediction market model to understand the effects of market parameters on model performance for benchmark classification tasks. We then demonstrate, through simulation, the impact of exogenous agents in the market, where these exogenous agents represent primitive human behaviors. This work lays the foundation for a novel set of hybrid human-AI machine learning algorithms.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129409236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Fair Allocation of Two Types of Chores 两类家务的公平分配
Pub Date : 2022-11-02 DOI: 10.48550/arXiv.2211.00879
H. Aziz, Jeremy Lindsay, Angus Ritossa, Mashbat Suzuki
We consider the problem of fair allocation of indivisible chores under additive valuations. We assume that the chores are divided into two types and under this scenario, we present several results. Our first result is a new characterization of Pareto optimal allocations in our setting, and a polynomial-time algorithm to compute an envy-free up to one item (EF1) and Pareto optimal allocation. We then turn to the question of whether we can achieve a stronger fairness concept called envy-free up any item (EFX). We present a polynomial-time algorithm that returns an EFX allocation. Finally, we show that for our setting, it can be checked in polynomial time whether an envy-free allocation exists or not.
考虑可加性估值条件下不可分杂务的公平分配问题。我们假设杂务分为两种类型,在这种情况下,我们呈现了几种结果。我们的第一个结果是在我们的设置中帕累托最优分配的新特征,以及一个多项式时间算法来计算最多一个项目(EF1)和帕累托最优分配。然后我们转向我们是否能够实现一个更强大的公平概念,即所谓的“无嫉妒任何物品”(EFX)。我们提出了一个多项式时间算法,它返回一个EFX分配。最后,我们证明了对于我们的设置,可以在多项式时间内检查是否存在无嫉妒分配。
{"title":"Fair Allocation of Two Types of Chores","authors":"H. Aziz, Jeremy Lindsay, Angus Ritossa, Mashbat Suzuki","doi":"10.48550/arXiv.2211.00879","DOIUrl":"https://doi.org/10.48550/arXiv.2211.00879","url":null,"abstract":"We consider the problem of fair allocation of indivisible chores under additive valuations. We assume that the chores are divided into two types and under this scenario, we present several results. Our first result is a new characterization of Pareto optimal allocations in our setting, and a polynomial-time algorithm to compute an envy-free up to one item (EF1) and Pareto optimal allocation. We then turn to the question of whether we can achieve a stronger fairness concept called envy-free up any item (EFX). We present a polynomial-time algorithm that returns an EFX allocation. Finally, we show that for our setting, it can be checked in polynomial time whether an envy-free allocation exists or not.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125509143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Indexability is Not Enough for Whittle: Improved, Near-Optimal Algorithms for Restless Bandits 可索引性对Whittle来说是不够的:改进的、接近最优的算法
Pub Date : 2022-10-31 DOI: 10.48550/arXiv.2211.00112
Abheek Ghosh, Dheeraj M. Nagaraj, Manish Jain, Milind Tambe
We study the problem of planning restless multi-armed bandits (RMABs) with multiple actions. This is a popular model for multi-agent systems with applications like multi-channel communication, monitoring and machine maintenance tasks, and healthcare. Whittle index policies, which are based on Lagrangian relaxations, are widely used in these settings due to their simplicity and near-optimality under certain conditions. In this work, we first show that Whittle index policies can fail in simple and practically relevant RMAB settings, even when the RMABs are indexable. We discuss why the optimality guarantees fail and why asymptotic optimality may not translate well to practically relevant planning horizons. We then propose an alternate planning algorithm based on the mean-field method, which can provably and efficiently obtain near-optimal policies with a large number of arms, without the stringent structural assumptions required by the Whittle index policies. This borrows ideas from existing research with some improvements: our approach is hyper-parameter free, and we provide an improved non-asymptotic analysis which has: (a) no requirement for exogenous hyper-parameters and tighter polynomial dependence on known problem parameters; (b) high probability bounds which show that the reward of the policy is reliable; and (c) matching sub-optimality lower bounds for this algorithm with respect to the number of arms, thus demonstrating the tightness of our bounds. Our extensive experimental analysis shows that the mean-field approach matches or outperforms other baselines.
研究了多行动不宁多武装盗匪(RMABs)的规划问题。这是具有多通道通信、监视和机器维护任务以及医疗保健等应用程序的多代理系统的流行模型。基于拉格朗日松弛的Whittle索引策略由于其简单性和在某些条件下的接近最优性而被广泛应用于这些设置中。在这项工作中,我们首先表明,即使RMAB是可索引的,Whittle索引策略也可能在简单和实际相关的RMAB设置中失败。我们讨论了为什么最优性保证失败以及为什么渐近最优性可能不能很好地转化为实际相关的规划范围。然后,我们提出了一种基于平均场方法的备用规划算法,该算法可以证明并有效地获得具有大量武器的近最优策略,而不需要Whittle指数策略所要求的严格结构假设。本文借鉴了已有研究的思想,并进行了一些改进:我们的方法是无超参数的,我们提供了一种改进的非渐近分析,它具有:(a)不需要外生超参数,对已知问题参数的多项式依赖性更强;(b)表明该政策的回报可靠的高概率界限;(c)匹配该算法的次最优下界与臂的数量,从而证明了我们的边界的严密性。我们广泛的实验分析表明,平均场方法匹配或优于其他基线。
{"title":"Indexability is Not Enough for Whittle: Improved, Near-Optimal Algorithms for Restless Bandits","authors":"Abheek Ghosh, Dheeraj M. Nagaraj, Manish Jain, Milind Tambe","doi":"10.48550/arXiv.2211.00112","DOIUrl":"https://doi.org/10.48550/arXiv.2211.00112","url":null,"abstract":"We study the problem of planning restless multi-armed bandits (RMABs) with multiple actions. This is a popular model for multi-agent systems with applications like multi-channel communication, monitoring and machine maintenance tasks, and healthcare. Whittle index policies, which are based on Lagrangian relaxations, are widely used in these settings due to their simplicity and near-optimality under certain conditions. In this work, we first show that Whittle index policies can fail in simple and practically relevant RMAB settings, even when the RMABs are indexable. We discuss why the optimality guarantees fail and why asymptotic optimality may not translate well to practically relevant planning horizons. We then propose an alternate planning algorithm based on the mean-field method, which can provably and efficiently obtain near-optimal policies with a large number of arms, without the stringent structural assumptions required by the Whittle index policies. This borrows ideas from existing research with some improvements: our approach is hyper-parameter free, and we provide an improved non-asymptotic analysis which has: (a) no requirement for exogenous hyper-parameters and tighter polynomial dependence on known problem parameters; (b) high probability bounds which show that the reward of the policy is reliable; and (c) matching sub-optimality lower bounds for this algorithm with respect to the number of arms, thus demonstrating the tightness of our bounds. Our extensive experimental analysis shows that the mean-field approach matches or outperforms other baselines.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125254002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Imitating Opponent to Win: Adversarial Policy Imitation Learning in Two-player Competitive Games 模仿对手取胜:双人竞争博弈中的对抗性策略模仿学习
Pub Date : 2022-10-30 DOI: 10.48550/arXiv.2210.16915
Viet The Bui, Tien Mai, T. Nguyen
Recent research on vulnerabilities of deep reinforcement learning (RL) has shown that adversarial policies adopted by an adversary agent can influence a target RL agent (victim agent) to perform poorly in a multi-agent environment. In existing studies, adversarial policies are directly trained based on experiences of interacting with the victim agent. There is a key shortcoming of this approach; knowledge derived from historical interactions may not be properly generalized to unexplored policy regions of the victim agent, making the trained adversarial policy significantly less effective. In this work, we design a new effective adversarial policy learning algorithm that overcomes this shortcoming. The core idea of our new algorithm is to create a new imitator to imitate the victim agent's policy while the adversarial policy will be trained not only based on interactions with the victim agent but also based on feedback from the imitator to forecast victim's intention. By doing so, we can leverage the capability of imitation learning in well capturing underlying characteristics of the victim policy only based on sample trajectories of the victim. Our victim imitation learning model differs from prior models as the environment's dynamics are driven by adversary's policy and will keep changing during the adversarial policy training. We provide a provable bound to guarantee a desired imitating policy when the adversary's policy becomes stable. We further strengthen our adversarial policy learning by making our imitator a stronger version of the victim. Finally, our extensive experiments using four competitive MuJoCo game environments show that our proposed adversarial policy learning algorithm outperforms state-of-the-art algorithms.
最近对深度强化学习(RL)漏洞的研究表明,在多智能体环境中,敌对智能体采用的对抗性策略会影响目标RL智能体(受害者智能体)的表现。在现有的研究中,对抗性策略是根据与受害者代理的互动经验直接训练的。这种方法有一个关键的缺点;从历史相互作用中获得的知识可能无法适当地推广到受害者代理未探索的政策区域,从而使训练好的对抗性政策显着降低有效性。在这项工作中,我们设计了一种新的有效的对抗性策略学习算法来克服这一缺点。我们的新算法的核心思想是创建一个新的模仿者来模仿受害者代理的策略,而对抗策略的训练不仅基于与受害者代理的交互,而且基于模仿者的反馈来预测受害者的意图。通过这样做,我们可以利用模仿学习的能力,仅基于受害者的样本轨迹就能很好地捕捉受害者政策的潜在特征。我们的受害者模仿学习模型与之前的模型不同,因为环境的动态是由对手的政策驱动的,并且在对抗政策训练期间会不断变化。我们提供了一个可证明的界,以保证当对手的策略趋于稳定时所期望的模仿策略。通过使模仿者成为受害者的更强版本,我们进一步加强了对抗性政策的学习。最后,我们使用四种竞争的MuJoCo游戏环境进行的广泛实验表明,我们提出的对抗策略学习算法优于最先进的算法。
{"title":"Imitating Opponent to Win: Adversarial Policy Imitation Learning in Two-player Competitive Games","authors":"Viet The Bui, Tien Mai, T. Nguyen","doi":"10.48550/arXiv.2210.16915","DOIUrl":"https://doi.org/10.48550/arXiv.2210.16915","url":null,"abstract":"Recent research on vulnerabilities of deep reinforcement learning (RL) has shown that adversarial policies adopted by an adversary agent can influence a target RL agent (victim agent) to perform poorly in a multi-agent environment. In existing studies, adversarial policies are directly trained based on experiences of interacting with the victim agent. There is a key shortcoming of this approach; knowledge derived from historical interactions may not be properly generalized to unexplored policy regions of the victim agent, making the trained adversarial policy significantly less effective. In this work, we design a new effective adversarial policy learning algorithm that overcomes this shortcoming. The core idea of our new algorithm is to create a new imitator to imitate the victim agent's policy while the adversarial policy will be trained not only based on interactions with the victim agent but also based on feedback from the imitator to forecast victim's intention. By doing so, we can leverage the capability of imitation learning in well capturing underlying characteristics of the victim policy only based on sample trajectories of the victim. Our victim imitation learning model differs from prior models as the environment's dynamics are driven by adversary's policy and will keep changing during the adversarial policy training. We provide a provable bound to guarantee a desired imitating policy when the adversary's policy becomes stable. We further strengthen our adversarial policy learning by making our imitator a stronger version of the victim. Finally, our extensive experiments using four competitive MuJoCo game environments show that our proposed adversarial policy learning algorithm outperforms state-of-the-art algorithms.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129266971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Worst-Case Adaptive Submodular Cover 最坏情况下自适应子模块覆盖
Pub Date : 2022-10-25 DOI: 10.48550/arXiv.2210.13694
Jing Yuan, Shaojie Tang
In this paper, we study the adaptive submodular cover problem under the worst-case setting. This problem generalizes many previously studied problems, namely, the pool-based active learning and the stochastic submodular set cover. The input of our problem is a set of items (e.g., medical tests) and each item has a random state (e.g., the outcome of a medical test), whose realization is initially unknown. One must select an item at a fixed cost in order to observe its realization. There is an utility function which maps a subset of items and their states to a non-negative real number. We aim to sequentially select a group of items to achieve a ``target value'' while minimizing the maximum cost across realizations (a.k.a. worst-case cost). To facilitate our study, we assume that the utility function is emph{worst-case submodular}, a property that is commonly found in many machine learning applications. With this assumption, we develop a tight $(log (Q/eta)+1)$-approximation policy, where $Q$ is the ``target value'' and $eta$ is the smallest difference between $Q$ and any achievable utility value $hat{Q}
本文研究了最坏情况下的自适应子模覆盖问题。这个问题推广了许多先前研究过的问题,即基于池的主动学习和随机子模集覆盖。我们问题的输入是一组项目(例如,医学测试),每个项目都有一个随机状态(例如,医学测试的结果),其实现最初是未知的。一个人必须以固定的成本选择一个项目,以便观察它的实现。有一个实用函数,它将项目的子集及其状态映射为一个非负实数。我们的目标是依次选择一组项目来实现“目标值”,同时最小化跨实现的最大成本(即最坏情况成本)。为了方便我们的研究,我们假设效用函数是emph{最坏情况子模},这是许多机器学习应用中常见的属性。有了这个假设,我们开发了一个严格的$(log (Q/eta)+1)$ -近似策略,其中$Q$是“目标值”,$eta$是$Q$与任何可实现的效用值$hat{Q}
{"title":"Worst-Case Adaptive Submodular Cover","authors":"Jing Yuan, Shaojie Tang","doi":"10.48550/arXiv.2210.13694","DOIUrl":"https://doi.org/10.48550/arXiv.2210.13694","url":null,"abstract":"In this paper, we study the adaptive submodular cover problem under the worst-case setting. This problem generalizes many previously studied problems, namely, the pool-based active learning and the stochastic submodular set cover. The input of our problem is a set of items (e.g., medical tests) and each item has a random state (e.g., the outcome of a medical test), whose realization is initially unknown. One must select an item at a fixed cost in order to observe its realization. There is an utility function which maps a subset of items and their states to a non-negative real number. We aim to sequentially select a group of items to achieve a ``target value'' while minimizing the maximum cost across realizations (a.k.a. worst-case cost). To facilitate our study, we assume that the utility function is emph{worst-case submodular}, a property that is commonly found in many machine learning applications. With this assumption, we develop a tight $(log (Q/eta)+1)$-approximation policy, where $Q$ is the ``target value'' and $eta$ is the smallest difference between $Q$ and any achievable utility value $hat{Q}<Q$. We also study a worst-case maximum-coverage problem, a dual problem of the minimum-cost-cover problem, whose goal is to select a group of items to maximize its worst-case utility subject to a budget constraint. To solve this problem, we develop a $(1-1/e)/2$-approximation solution.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115119396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Online matching with delays and stochastic arrival times 具有延迟和随机到达时间的在线匹配
Pub Date : 2022-10-13 DOI: 10.48550/arXiv.2210.07018
Mathieu Mari, M. Pawlowski, Runtian Ren, P. Sankowski
This paper presents a new research direction for the Min-cost Perfect Matching with Delays (MPMD) - a problem introduced by Emek et al. (STOC'16). In the original version of this problem, we are given an $n$-point metric space, where requests arrive in an online fashion. The goal is to minimise the matching cost for an even number of requests. However, contrary to traditional online matching problems, a request does not have to be paired immediately at the time of its arrival. Instead, the decision of whether to match a request can be postponed for time $t$ at a delay cost of $t$. For this reason, the goal of the MPMD is to minimise the overall sum of distance and delay costs. Interestingly, for adversarially generated requests, no online algorithm can achieve a competitive ratio better than $O(log n/log log n)$ (Ashlagi et al., APPROX/RANDOM'17). Here, we consider a stochastic version of the MPMD problem where the input requests follow a Poisson arrival process. For such a problem, we show that the above lower bound can be improved by presenting two deterministic online algorithms, which, in expectation, are constant-competitive. The first one is a simple greedy algorithm that matches any two requests once the sum of their delay costs exceeds their connection cost, i.e., the distance between them. The second algorithm builds on the tools used to analyse the first one in order to obtain even better performance guarantees. This result is rather surprising as the greedy approach for the adversarial model achieves a competitive ratio of $Omega(m^{log frac{3}{2}+varepsilon})$, where $m$ denotes the number of requests served (Azar et al., TOCS'20). Finally, we prove that it is possible to obtain similar results for the general case when the delay cost follows an arbitrary positive and non-decreasing function, as well as for the MPMD variant with penalties to clear pending requests.
本文提出了Emek等人(STOC'16)提出的最小代价带延迟完美匹配(MPMD)问题的一个新的研究方向。在这个问题的原始版本中,我们给出了一个$n$ -点度量空间,其中请求以在线方式到达。目标是使偶数请求的匹配成本最小化。然而,与传统的在线匹配问题相反,请求不需要在到达时立即配对。相反,是否匹配请求的决定可以延迟$t$时间,延迟成本为$t$。出于这个原因,MPMD的目标是最小化距离和延迟成本的总和。有趣的是,对于对抗性生成的请求,没有在线算法可以获得比$O(log n/log log n)$更好的竞争比(Ashlagi等人,APPROX/RANDOM'17)。这里,我们考虑MPMD问题的随机版本,其中输入请求遵循泊松到达过程。对于这样的问题,我们证明了上述下界可以通过提出两种确定性在线算法来改进,这两种算法在期望上是恒定竞争的。第一种算法是简单的贪心算法,当任意两个请求的延迟代价之和超过它们的连接代价(即它们之间的距离)时,该算法匹配任意两个请求。第二种算法建立在用于分析第一种算法的工具之上,以获得更好的性能保证。这个结果相当令人惊讶,因为对抗模型的贪婪方法达到了$Omega(m^{log frac{3}{2}+varepsilon})$的竞争比,其中$m$表示服务的请求数量(Azar等人,TOCS'20)。最后,我们证明了对于延迟成本遵循任意正且非递减函数的一般情况,以及具有清除未决请求惩罚的MPMD变体,可以获得类似的结果。
{"title":"Online matching with delays and stochastic arrival times","authors":"Mathieu Mari, M. Pawlowski, Runtian Ren, P. Sankowski","doi":"10.48550/arXiv.2210.07018","DOIUrl":"https://doi.org/10.48550/arXiv.2210.07018","url":null,"abstract":"This paper presents a new research direction for the Min-cost Perfect Matching with Delays (MPMD) - a problem introduced by Emek et al. (STOC'16). In the original version of this problem, we are given an $n$-point metric space, where requests arrive in an online fashion. The goal is to minimise the matching cost for an even number of requests. However, contrary to traditional online matching problems, a request does not have to be paired immediately at the time of its arrival. Instead, the decision of whether to match a request can be postponed for time $t$ at a delay cost of $t$. For this reason, the goal of the MPMD is to minimise the overall sum of distance and delay costs. Interestingly, for adversarially generated requests, no online algorithm can achieve a competitive ratio better than $O(log n/log log n)$ (Ashlagi et al., APPROX/RANDOM'17). Here, we consider a stochastic version of the MPMD problem where the input requests follow a Poisson arrival process. For such a problem, we show that the above lower bound can be improved by presenting two deterministic online algorithms, which, in expectation, are constant-competitive. The first one is a simple greedy algorithm that matches any two requests once the sum of their delay costs exceeds their connection cost, i.e., the distance between them. The second algorithm builds on the tools used to analyse the first one in order to obtain even better performance guarantees. This result is rather surprising as the greedy approach for the adversarial model achieves a competitive ratio of $Omega(m^{log frac{3}{2}+varepsilon})$, where $m$ denotes the number of requests served (Azar et al., TOCS'20). Finally, we prove that it is possible to obtain similar results for the general case when the delay cost follows an arbitrary positive and non-decreasing function, as well as for the MPMD variant with penalties to clear pending requests.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"227 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124501151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Adaptive Agents and Multi-Agent Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1