首页 > 最新文献

Adaptive Agents and Multi-Agent Systems最新文献

英文 中文
Toward Risk-based Optimistic Exploration for Cooperative Multi-Agent Reinforcement Learning 基于风险的多智能体协作强化学习的乐观探索
Pub Date : 2023-03-03 DOI: 10.48550/arXiv.2303.01768
Ji-Yun Oh, Joonkee Kim, Minchan Jeong, Se-Young Yun
The multi-agent setting is intricate and unpredictable since the behaviors of multiple agents influence one another. To address this environmental uncertainty, distributional reinforcement learning algorithms that incorporate uncertainty via distributional output have been integrated with multi-agent reinforcement learning (MARL) methods, achieving state-of-the-art performance. However, distributional MARL algorithms still rely on the traditional $epsilon$-greedy, which does not take cooperative strategy into account. In this paper, we present a risk-based exploration that leads to collaboratively optimistic behavior by shifting the sampling region of distribution. Initially, we take expectations from the upper quantiles of state-action values for exploration, which are optimistic actions, and gradually shift the sampling region of quantiles to the full distribution for exploitation. By ensuring that each agent is exposed to the same level of risk, we can force them to take cooperatively optimistic actions. Our method shows remarkable performance in multi-agent settings requiring cooperative exploration based on quantile regression appropriately controlling the level of risk.
由于多个智能体的行为相互影响,多智能体的设置是复杂和不可预测的。为了解决这种环境的不确定性,分布式强化学习算法通过分布式输出将不确定性与多智能体强化学习(MARL)方法集成在一起,实现了最先进的性能。然而,分布式MARL算法仍然依赖于传统的$epsilon$-greedy,没有考虑合作策略。在本文中,我们提出了一种基于风险的探索,通过改变分布的抽样区域来导致协作乐观行为。首先,我们从状态-动作值的上分位数取期望进行探索,这是一种乐观动作,然后逐步将分位数的抽样区域移至全分布进行开发。通过确保每个主体都面临相同程度的风险,我们可以迫使他们采取合作乐观的行动。我们的方法在需要基于分位数回归的协作探索的多智能体设置中表现出显著的性能,该方法适当地控制了风险水平。
{"title":"Toward Risk-based Optimistic Exploration for Cooperative Multi-Agent Reinforcement Learning","authors":"Ji-Yun Oh, Joonkee Kim, Minchan Jeong, Se-Young Yun","doi":"10.48550/arXiv.2303.01768","DOIUrl":"https://doi.org/10.48550/arXiv.2303.01768","url":null,"abstract":"The multi-agent setting is intricate and unpredictable since the behaviors of multiple agents influence one another. To address this environmental uncertainty, distributional reinforcement learning algorithms that incorporate uncertainty via distributional output have been integrated with multi-agent reinforcement learning (MARL) methods, achieving state-of-the-art performance. However, distributional MARL algorithms still rely on the traditional $epsilon$-greedy, which does not take cooperative strategy into account. In this paper, we present a risk-based exploration that leads to collaboratively optimistic behavior by shifting the sampling region of distribution. Initially, we take expectations from the upper quantiles of state-action values for exploration, which are optimistic actions, and gradually shift the sampling region of quantiles to the full distribution for exploitation. By ensuring that each agent is exposed to the same level of risk, we can force them to take cooperatively optimistic actions. Our method shows remarkable performance in multi-agent settings requiring cooperative exploration based on quantile regression appropriately controlling the level of risk.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116436789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Variational Approach to Mutual Information-Based Coordination for Multi-Agent Reinforcement Learning 多智能体强化学习中基于互信息协调的变分方法
Pub Date : 2023-03-01 DOI: 10.48550/arXiv.2303.00451
Woojun Kim, Whiyoung Jung, Myungsik Cho, Young-Jin Sung
In this paper, we propose a new mutual information framework for multi-agent reinforcement learning to enable multiple agents to learn coordinated behaviors by regularizing the accumulated return with the simultaneous mutual information between multi-agent actions. By introducing a latent variable to induce nonzero mutual information between multi-agent actions and applying a variational bound, we derive a tractable lower bound on the considered MMI-regularized objective function. The derived tractable objective can be interpreted as maximum entropy reinforcement learning combined with uncertainty reduction of other agents actions. Applying policy iteration to maximize the derived lower bound, we propose a practical algorithm named variational maximum mutual information multi-agent actor-critic, which follows centralized learning with decentralized execution. We evaluated VM3-AC for several games requiring coordination, and numerical results show that VM3-AC outperforms other MARL algorithms in multi-agent tasks requiring high-quality coordination.
本文提出了一种新的多智能体强化学习互信息框架,利用多智能体动作之间同时存在的互信息对累计收益进行正则化,使多智能体能够学习协调行为。通过引入潜在变量来诱导多智能体动作之间的非零互信息,并应用变分界,得到了考虑的mmi -正则化目标函数的可处理下界。导出的可处理目标可以解释为最大熵强化学习结合其他代理行为的不确定性减少。我们采用策略迭代的方法来最大化推导出的下界,提出了一种实用的变分最大互信息多智能体actor-critic算法,该算法遵循集中学习和分散执行。我们在几个需要协调的游戏中评估了VM3-AC,数值结果表明VM3-AC在需要高质量协调的多智能体任务中优于其他MARL算法。
{"title":"A Variational Approach to Mutual Information-Based Coordination for Multi-Agent Reinforcement Learning","authors":"Woojun Kim, Whiyoung Jung, Myungsik Cho, Young-Jin Sung","doi":"10.48550/arXiv.2303.00451","DOIUrl":"https://doi.org/10.48550/arXiv.2303.00451","url":null,"abstract":"In this paper, we propose a new mutual information framework for multi-agent reinforcement learning to enable multiple agents to learn coordinated behaviors by regularizing the accumulated return with the simultaneous mutual information between multi-agent actions. By introducing a latent variable to induce nonzero mutual information between multi-agent actions and applying a variational bound, we derive a tractable lower bound on the considered MMI-regularized objective function. The derived tractable objective can be interpreted as maximum entropy reinforcement learning combined with uncertainty reduction of other agents actions. Applying policy iteration to maximize the derived lower bound, we propose a practical algorithm named variational maximum mutual information multi-agent actor-critic, which follows centralized learning with decentralized execution. We evaluated VM3-AC for several games requiring coordination, and numerical results show that VM3-AC outperforms other MARL algorithms in multi-agent tasks requiring high-quality coordination.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"236 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122865703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Mitigating Skewed Bidding for Conference Paper Assignment 减轻会议论文分配的扭曲投标
Pub Date : 2023-03-01 DOI: 10.48550/arXiv.2303.00435
Inbal Rozencweig, R. Meir, Nick Mattei, Ofra Amir
The explosion of conference paper submissions in AI and related fields, has underscored the need to improve many aspects of the peer review process, especially the matching of papers and reviewers. Recent work argues that the key to improve this matching is to modify aspects of the emph{bidding phase} itself, to ensure that the set of bids over papers is balanced, and in particular to avoid emph{orphan papers}, i.e., those papers that receive no bids. In an attempt to understand and mitigate this problem, we have developed a flexible bidding platform to test adaptations to the bidding process. Using this platform, we performed a field experiment during the bidding phase of a medium-size international workshop that compared two bidding methods. We further examined via controlled experiments on Amazon Mechanical Turk various factors that affect bidding, in particular the order in which papers are presented cite{cabanac2013capitalizing,fiez2020super}; and information on paper demand cite{meir2021market}. Our results suggest that several simple adaptations, that can be added to any existing platform, may significantly reduce the skew in bids, thereby improving the allocation for both reviewers and conference organizers.
人工智能及相关领域会议论文提交的爆炸式增长,突显了改进同行评审过程的许多方面的必要性,特别是论文和审稿人的匹配。最近的研究认为,改善这种匹配的关键是修改emph{投标阶段}本身的各个方面,以确保对论文的投标是平衡的,特别是要避免emph{孤儿论文},即那些没有收到投标的论文。为了理解和缓解这个问题,我们开发了一个灵活的竞标平台来测试竞标过程的适应性。利用该平台,我们在一个中型国际研讨会的招标阶段进行了现场实验,比较了两种招标方法。我们通过亚马逊土耳其机械的对照实验进一步研究了影响竞标的各种因素,特别是论文呈现的顺序cite{cabanac2013capitalizing,fiez2020super};以及纸张需求信息cite{meir2021market}。我们的研究结果表明,几个简单的调整,可以添加到任何现有的平台,可能会显著减少投标的倾斜,从而改善审稿人和会议组织者的分配。
{"title":"Mitigating Skewed Bidding for Conference Paper Assignment","authors":"Inbal Rozencweig, R. Meir, Nick Mattei, Ofra Amir","doi":"10.48550/arXiv.2303.00435","DOIUrl":"https://doi.org/10.48550/arXiv.2303.00435","url":null,"abstract":"The explosion of conference paper submissions in AI and related fields, has underscored the need to improve many aspects of the peer review process, especially the matching of papers and reviewers. Recent work argues that the key to improve this matching is to modify aspects of the emph{bidding phase} itself, to ensure that the set of bids over papers is balanced, and in particular to avoid emph{orphan papers}, i.e., those papers that receive no bids. In an attempt to understand and mitigate this problem, we have developed a flexible bidding platform to test adaptations to the bidding process. Using this platform, we performed a field experiment during the bidding phase of a medium-size international workshop that compared two bidding methods. We further examined via controlled experiments on Amazon Mechanical Turk various factors that affect bidding, in particular the order in which papers are presented cite{cabanac2013capitalizing,fiez2020super}; and information on paper demand cite{meir2021market}. Our results suggest that several simple adaptations, that can be added to any existing platform, may significantly reduce the skew in bids, thereby improving the allocation for both reviewers and conference organizers.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116309607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Coordination of Multiple Robots along Given Paths with Bounded Junction Complexity 具有有界结点复杂度的多机器人在给定路径上的协调
Pub Date : 2023-03-01 DOI: 10.48550/arXiv.2303.00745
Mikkel Abrahamsen, Tzvika Geft, D. Halperin, Barak Ugav
We study a fundamental NP-hard motion coordination problem for multi-robot/multi-agent systems: We are given a graph $G$ and set of agents, where each agent has a given directed path in $G$. Each agent is initially located on the first vertex of its path. At each time step an agent can move to the next vertex on its path, provided that the vertex is not occupied by another agent. The goal is to find a sequence of such moves along the given paths so that each reaches its target, or to report that no such sequence exists. The problem models guidepath-based transport systems, which is a pertinent abstraction for traffic in a variety of contemporary applications, ranging from train networks or Automated Guided Vehicles (AGVs) in factories, through computer game animations, to qubit transport in quantum computing. It also arises as a sub-problem in the more general multi-robot motion-planning problem. We provide a fine-grained tractability analysis of the problem by considering new assumptions and identifying minimal values of key parameters for which the problem remains NP-hard. Our analysis identifies a critical parameter called vertex multiplicity (VM), defined as the maximum number of paths passing through the same vertex. We show that a prevalent variant of the problem, which is equivalent to Sequential Resource Allocation (concerning deadlock prevention for concurrent processes), is NP-hard even when VM is 3. On the positive side, for VM $le$ 2 we give an efficient algorithm that iteratively resolves cycles of blocking relations among agents. We also present a variant that is NP-hard when the VM is 2 even when $G$ is a 2D grid and each path lies in a single grid row or column. By studying highly distilled yet NP-hard variants, we deepen the understanding of what makes the problem intractable and thereby guide the search for efficient solutions under practical assumptions.
我们研究了一个基本的多机器人/多智能体系统的NP-hard运动协调问题:我们给定一个图$G$和一组智能体,其中每个智能体在$G$中有一个给定的有向路径。每个代理最初位于其路径的第一个顶点。在每个时间步,agent可以移动到其路径上的下一个顶点,前提是该顶点未被其他agent占用。目标是沿着给定的路径找到这样的移动序列,使每个移动都达到目标,或者报告不存在这样的序列。该问题对基于路径的传输系统进行了建模,这是各种当代应用中交通的相关抽象,从火车网络或工厂中的自动引导车辆(agv),到计算机游戏动画,再到量子计算中的量子比特传输。它也是更一般的多机器人运动规划问题中的一个子问题。我们通过考虑新的假设和确定问题仍然是np困难的关键参数的最小值,提供了对问题的细粒度可跟踪性分析。我们的分析确定了一个称为顶点多重性(VM)的关键参数,定义为通过同一顶点的最大路径数。我们展示了该问题的一个普遍变体,相当于顺序资源分配(涉及并发进程的死锁预防),即使在VM为3时也是np困难的。积极的一面是,对于VM $le$ 2,我们给出了一个有效的算法,迭代地解决代理之间的阻塞关系循环。我们还提出了一种变体,当VM为2时,即使$G$是一个2D网格,并且每个路径位于单个网格行或列中,它也是np困难的。通过研究高度精炼但np困难的变体,我们加深了对问题难以解决的原因的理解,从而指导在实际假设下寻找有效的解决方案。
{"title":"Coordination of Multiple Robots along Given Paths with Bounded Junction Complexity","authors":"Mikkel Abrahamsen, Tzvika Geft, D. Halperin, Barak Ugav","doi":"10.48550/arXiv.2303.00745","DOIUrl":"https://doi.org/10.48550/arXiv.2303.00745","url":null,"abstract":"We study a fundamental NP-hard motion coordination problem for multi-robot/multi-agent systems: We are given a graph $G$ and set of agents, where each agent has a given directed path in $G$. Each agent is initially located on the first vertex of its path. At each time step an agent can move to the next vertex on its path, provided that the vertex is not occupied by another agent. The goal is to find a sequence of such moves along the given paths so that each reaches its target, or to report that no such sequence exists. The problem models guidepath-based transport systems, which is a pertinent abstraction for traffic in a variety of contemporary applications, ranging from train networks or Automated Guided Vehicles (AGVs) in factories, through computer game animations, to qubit transport in quantum computing. It also arises as a sub-problem in the more general multi-robot motion-planning problem. We provide a fine-grained tractability analysis of the problem by considering new assumptions and identifying minimal values of key parameters for which the problem remains NP-hard. Our analysis identifies a critical parameter called vertex multiplicity (VM), defined as the maximum number of paths passing through the same vertex. We show that a prevalent variant of the problem, which is equivalent to Sequential Resource Allocation (concerning deadlock prevention for concurrent processes), is NP-hard even when VM is 3. On the positive side, for VM $le$ 2 we give an efficient algorithm that iteratively resolves cycles of blocking relations among agents. We also present a variant that is NP-hard when the VM is 2 even when $G$ is a 2D grid and each path lies in a single grid row or column. By studying highly distilled yet NP-hard variants, we deepen the understanding of what makes the problem intractable and thereby guide the search for efficient solutions under practical assumptions.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125677242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
IQ-Flow: Mechanism Design for Inducing Cooperative Behavior to Self-Interested Agents in Sequential Social Dilemmas 序贯社会困境中诱导自利主体合作行为的机制设计
Pub Date : 2023-02-28 DOI: 10.48550/arXiv.2302.14604
Bengisu Guresti, Abdullah Vanlioglu, N. K. Ure
Achieving and maintaining cooperation between agents to accomplish a common objective is one of the central goals of Multi-Agent Reinforcement Learning (MARL). Nevertheless in many real-world scenarios, separately trained and specialized agents are deployed into a shared environment, or the environment requires multiple objectives to be achieved by different coexisting parties. These variations among specialties and objectives are likely to cause mixed motives that eventually result in a social dilemma where all the parties are at a loss. In order to resolve this issue, we propose the Incentive Q-Flow (IQ-Flow) algorithm, which modifies the system's reward setup with an incentive regulator agent such that the cooperative policy also corresponds to the self-interested policy for the agents. Unlike the existing methods that learn to incentivize self-interested agents, IQ-Flow does not make any assumptions about agents' policies or learning algorithms, which enables the generalization of the developed framework to a wider array of applications. IQ-Flow performs an offline evaluation of the optimality of the learned policies using the data provided by other agents to determine cooperative and self-interested policies. Next, IQ-Flow uses meta-gradient learning to estimate how policy evaluation changes according to given incentives and modifies the incentive such that the greedy policy for cooperative objective and self-interested objective yield the same actions. We present the operational characteristics of IQ-Flow in Iterated Matrix Games. We demonstrate that IQ-Flow outperforms the state-of-the-art incentive design algorithm in Escape Room and 2-Player Cleanup environments. We further demonstrate that the pretrained IQ-Flow mechanism significantly outperforms the performance of the shared reward setup in the 2-Player Cleanup environment.
实现和维持智能体之间的合作以实现共同的目标是多智能体强化学习(MARL)的中心目标之一。然而,在许多现实场景中,单独训练和专门的代理被部署到共享环境中,或者环境需要由不同共存的各方实现多个目标。这些专业和目标之间的差异很可能导致动机混杂,最终导致各方无所适从的社会困境。为了解决这一问题,我们提出了激励Q-Flow (IQ-Flow)算法,该算法使用激励调节代理修改系统的奖励设置,使合作策略也对应于代理的自利益策略。与现有的学习激励自利代理的方法不同,IQ-Flow没有对代理的策略或学习算法做出任何假设,这使得开发的框架能够推广到更广泛的应用程序。IQ-Flow使用其他代理提供的数据对学习策略的最优性进行离线评估,以确定合作和自利策略。接下来,IQ-Flow使用元梯度学习来估计政策评估如何根据给定的激励而变化,并修改激励,使合作目标的贪婪政策和自利目标产生相同的行为。我们提出了迭代矩阵博弈中IQ-Flow的操作特征。我们证明了IQ-Flow在Escape Room和2-Player Cleanup环境中优于最先进的激励设计算法。我们进一步证明,在2人清理环境中,预训练的IQ-Flow机制显著优于共享奖励设置的性能。
{"title":"IQ-Flow: Mechanism Design for Inducing Cooperative Behavior to Self-Interested Agents in Sequential Social Dilemmas","authors":"Bengisu Guresti, Abdullah Vanlioglu, N. K. Ure","doi":"10.48550/arXiv.2302.14604","DOIUrl":"https://doi.org/10.48550/arXiv.2302.14604","url":null,"abstract":"Achieving and maintaining cooperation between agents to accomplish a common objective is one of the central goals of Multi-Agent Reinforcement Learning (MARL). Nevertheless in many real-world scenarios, separately trained and specialized agents are deployed into a shared environment, or the environment requires multiple objectives to be achieved by different coexisting parties. These variations among specialties and objectives are likely to cause mixed motives that eventually result in a social dilemma where all the parties are at a loss. In order to resolve this issue, we propose the Incentive Q-Flow (IQ-Flow) algorithm, which modifies the system's reward setup with an incentive regulator agent such that the cooperative policy also corresponds to the self-interested policy for the agents. Unlike the existing methods that learn to incentivize self-interested agents, IQ-Flow does not make any assumptions about agents' policies or learning algorithms, which enables the generalization of the developed framework to a wider array of applications. IQ-Flow performs an offline evaluation of the optimality of the learned policies using the data provided by other agents to determine cooperative and self-interested policies. Next, IQ-Flow uses meta-gradient learning to estimate how policy evaluation changes according to given incentives and modifies the incentive such that the greedy policy for cooperative objective and self-interested objective yield the same actions. We present the operational characteristics of IQ-Flow in Iterated Matrix Games. We demonstrate that IQ-Flow outperforms the state-of-the-art incentive design algorithm in Escape Room and 2-Player Cleanup environments. We further demonstrate that the pretrained IQ-Flow mechanism significantly outperforms the performance of the shared reward setup in the 2-Player Cleanup environment.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131477169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Reinforcement Learning with Depreciating Assets 资产折旧强化学习
Pub Date : 2023-02-27 DOI: 10.48550/arXiv.2302.14176
Taylor Dohmen, Ashutosh Trivedi
A basic assumption of traditional reinforcement learning is that the value of a reward does not change once it is received by an agent. The present work forgoes this assumption and considers the situation where the value of a reward decays proportionally to the time elapsed since it was obtained. Emphasizing the inflection point occurring at the time of payment, we use the term asset to refer to a reward that is currently in the possession of an agent. Adopting this language, we initiate the study of depreciating assets within the framework of infinite-horizon quantitative optimization. In particular, we propose a notion of asset depreciation, inspired by classical exponential discounting, where the value of an asset is scaled by a fixed discount factor at each time step after it is obtained by the agent. We formulate a Bellman-style equational characterization of optimality in this context and develop a model-free reinforcement learning approach to obtain optimal policies.
传统强化学习的一个基本假设是,一旦智能体收到奖励,奖励的价值就不会改变。目前的工作放弃了这一假设,并考虑了奖励的价值与获得奖励后的时间成比例地衰减的情况。为了强调在支付时发生的拐点,我们使用术语“资产”来指代目前由代理人拥有的奖励。采用这种语言,我们在无限视界定量优化的框架下,对资产折旧问题进行了研究。特别地,我们提出了一个资产折旧的概念,受到经典指数贴现的启发,其中资产的价值在agent获得后的每个时间步都被固定的贴现因子缩放。在这种情况下,我们制定了bellman风格的最优性方程表征,并开发了一种无模型强化学习方法来获得最优策略。
{"title":"Reinforcement Learning with Depreciating Assets","authors":"Taylor Dohmen, Ashutosh Trivedi","doi":"10.48550/arXiv.2302.14176","DOIUrl":"https://doi.org/10.48550/arXiv.2302.14176","url":null,"abstract":"A basic assumption of traditional reinforcement learning is that the value of a reward does not change once it is received by an agent. The present work forgoes this assumption and considers the situation where the value of a reward decays proportionally to the time elapsed since it was obtained. Emphasizing the inflection point occurring at the time of payment, we use the term asset to refer to a reward that is currently in the possession of an agent. Adopting this language, we initiate the study of depreciating assets within the framework of infinite-horizon quantitative optimization. In particular, we propose a notion of asset depreciation, inspired by classical exponential discounting, where the value of an asset is scaled by a fixed discount factor at each time step after it is obtained by the agent. We formulate a Bellman-style equational characterization of optimality in this context and develop a model-free reinforcement learning approach to obtain optimal policies.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134448818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Equilibrium Bandits: Learning Optimal Equilibria of Unknown Dynamics 均衡强盗:学习未知动态的最优均衡
Pub Date : 2023-02-27 DOI: 10.48550/arXiv.2302.13653
Siddharth Chandak, Ilai Bistritz, N. Bambos
Consider a decision-maker that can pick one out of $K$ actions to control an unknown system, for $T$ turns. The actions are interpreted as different configurations or policies. Holding the same action fixed, the system asymptotically converges to a unique equilibrium, as a function of this action. The dynamics of the system are unknown to the decision-maker, which can only observe a noisy reward at the end of every turn. The decision-maker wants to maximize its accumulated reward over the $T$ turns. Learning what equilibria are better results in higher rewards, but waiting for the system to converge to equilibrium costs valuable time. Existing bandit algorithms, either stochastic or adversarial, achieve linear (trivial) regret for this problem. We present a novel algorithm, termed Upper Equilibrium Concentration Bound (UECB), that knows to switch an action quickly if it is not worth it to wait until the equilibrium is reached. This is enabled by employing convergence bounds to determine how far the system is from equilibrium. We prove that UECB achieves a regret of $mathcal{O}(log(T)+tau_clog(tau_c)+tau_cloglog(T))$ for this equilibrium bandit problem where $tau_c$ is the worst case approximate convergence time to equilibrium. We then show that both epidemic control and game control are special cases of equilibrium bandits, where $tau_clog tau_c$ typically dominates the regret. We then test UECB numerically for both of these applications.
考虑一个决策者,他可以从$K$动作中选择一个动作来控制未知系统,即$T$回合。操作被解释为不同的配置或策略。当作用不变时,系统作为作用的函数渐近收敛到一个唯一的平衡点。系统的动态对决策者来说是未知的,他们只能在每个回合结束时观察到一个嘈杂的奖励。决策者希望在$T$回合中获得最大的累积奖励。了解什么是更好的均衡会带来更高的奖励,但等待系统收敛到均衡会花费宝贵的时间。现有的强盗算法,无论是随机的还是对抗性的,对这个问题都实现了线性(微不足道的)遗憾。我们提出了一种新的算法,称为上平衡浓度界(UECB),它知道如果不值得等待直到达到平衡,就迅速切换动作。这可以通过采用收敛界来确定系统离平衡状态有多远来实现。我们证明了对于这个均衡强盗问题,UECB实现了$mathcal{O}(log(T)+tau_clog(tau_c)+tau_cloglog(T))$的遗憾,其中$tau_c$是最坏情况下的近似收敛时间。然后,我们证明流行病控制和博弈控制都是均衡强盗的特殊情况,其中$tau_clog tau_c$通常支配后悔。然后,我们对这两个应用程序的UECB进行数值测试。
{"title":"Equilibrium Bandits: Learning Optimal Equilibria of Unknown Dynamics","authors":"Siddharth Chandak, Ilai Bistritz, N. Bambos","doi":"10.48550/arXiv.2302.13653","DOIUrl":"https://doi.org/10.48550/arXiv.2302.13653","url":null,"abstract":"Consider a decision-maker that can pick one out of $K$ actions to control an unknown system, for $T$ turns. The actions are interpreted as different configurations or policies. Holding the same action fixed, the system asymptotically converges to a unique equilibrium, as a function of this action. The dynamics of the system are unknown to the decision-maker, which can only observe a noisy reward at the end of every turn. The decision-maker wants to maximize its accumulated reward over the $T$ turns. Learning what equilibria are better results in higher rewards, but waiting for the system to converge to equilibrium costs valuable time. Existing bandit algorithms, either stochastic or adversarial, achieve linear (trivial) regret for this problem. We present a novel algorithm, termed Upper Equilibrium Concentration Bound (UECB), that knows to switch an action quickly if it is not worth it to wait until the equilibrium is reached. This is enabled by employing convergence bounds to determine how far the system is from equilibrium. We prove that UECB achieves a regret of $mathcal{O}(log(T)+tau_clog(tau_c)+tau_cloglog(T))$ for this equilibrium bandit problem where $tau_c$ is the worst case approximate convergence time to equilibrium. We then show that both epidemic control and game control are special cases of equilibrium bandits, where $tau_clog tau_c$ typically dominates the regret. We then test UECB numerically for both of these applications.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129931787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Strategic (Timed) Computation Tree Logic 战略(定时)计算树逻辑
Pub Date : 2023-02-26 DOI: 10.48550/arXiv.2302.13405
Jaime Arias, W. Jamroga, W. Penczek, L. Petrucci, Teofil Sidoruk
We define extensions of CTL and TCTL with strategic operators, called Strategic CTL (SCTL) and Strategic TCTL (STCTL), respectively. For each of the above logics we give a synchronous and asynchronous semantics, i.e., STCTL is interpreted over networks of extended Timed Automata (TA) that either make synchronous moves or synchronise via joint actions. We consider several semantics regarding information: imperfect (i) and perfect (I), and recall: imperfect (r) and perfect (R). We prove that SCTL is more expressive than ATL for all semantics, and this holds for the timed versions as well. Moreover, the model checking problem for SCTL[ir] is of the same complexity as for ATL[ir], the model checking problem for STCTL[ir] is of the same complexity as for TCTL, while for STCTL[iR] it is undecidable as for ATL[iR]. The above results suggest to use SCTL[ir] and STCTL[ir] in practical applications. Therefore, we use the tool IMITATOR to support model checking of STCTL[ir].
我们用战略操作符定义了CTL和TCTL的扩展,分别称为战略CTL (SCTL)和战略TCTL (STCTL)。对于上述每个逻辑,我们都给出了同步和异步语义,即STCTL通过扩展的定时自动机(TA)网络进行解释,这些网络要么进行同步移动,要么通过联合动作进行同步。我们考虑了关于信息的几种语义:不完美(i)和完美(i),以及回忆:不完美(r)和完美(r)。我们证明了SCTL在所有语义上都比ATL更具表现力,这也适用于定时版本。此外,SCTL[ir]的模型检查问题与ATL[ir]的模型检查问题具有相同的复杂性,STCTL[ir]的模型检查问题与TCTL的模型检查问题具有相同的复杂性,而STCTL[ir]的模型检查问题与ATL[ir]的模型检查问题具有不确定性。上述结果建议在实际应用中使用SCTL[ir]和STCTL[ir]。因此,我们使用IMITATOR工具来支持STCTL的模型检查[ir]。
{"title":"Strategic (Timed) Computation Tree Logic","authors":"Jaime Arias, W. Jamroga, W. Penczek, L. Petrucci, Teofil Sidoruk","doi":"10.48550/arXiv.2302.13405","DOIUrl":"https://doi.org/10.48550/arXiv.2302.13405","url":null,"abstract":"We define extensions of CTL and TCTL with strategic operators, called Strategic CTL (SCTL) and Strategic TCTL (STCTL), respectively. For each of the above logics we give a synchronous and asynchronous semantics, i.e., STCTL is interpreted over networks of extended Timed Automata (TA) that either make synchronous moves or synchronise via joint actions. We consider several semantics regarding information: imperfect (i) and perfect (I), and recall: imperfect (r) and perfect (R). We prove that SCTL is more expressive than ATL for all semantics, and this holds for the timed versions as well. Moreover, the model checking problem for SCTL[ir] is of the same complexity as for ATL[ir], the model checking problem for STCTL[ir] is of the same complexity as for TCTL, while for STCTL[iR] it is undecidable as for ATL[iR]. The above results suggest to use SCTL[ir] and STCTL[ir] in practical applications. Therefore, we use the tool IMITATOR to support model checking of STCTL[ir].","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125975278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
MMS Allocations of Chores with Connectivity Constraints: New Methods and New Results 具有连通性约束的多任务分配:新方法和新结果
Pub Date : 2023-02-26 DOI: 10.48550/arXiv.2302.13224
Mingyu Xiao, Guoliang Qiu, Sen Huang
We study the problem of allocating indivisible chores to agents under the Maximin share (MMS) fairness notion. The chores are embedded on a graph and each bundle of chores assigned to an agent should be connected. Although there is a simple algorithm for MMS allocations of goods on trees, it remains open whether MMS allocations of chores on trees always exist or not, which is a simple but annoying problem in chores allocation. In this paper, we introduce a new method for chores allocation with connectivity constraints, called the group-satisfied method, that can show the existence of MMS allocations of chores on several subclasses of trees. Even these subcases are non-trivial and our results can be considered as a significant step to the open problem. We also consider MMS allocations of chores on cycles where we get the tight approximation ratio for three agents. Our result was obtained via the linear programming (LP) method, which enables us not only to compute approximate MMS allocations but also to construct tight examples of the nonexistence of MMS allocations without complicated combinatorial analysis. These two proposed methods, the group-satisfied method and the LP method, have the potential to solve more related problems.
本文研究了在最大最小共享(MMS)公平性概念下,将不可分割的任务分配给agent的问题。这些杂务被嵌入到一个图中,分配给一个代理的每一束杂务都应该被连接起来。虽然有一种简单的树上物品的MMS分配算法,但树上杂务的MMS分配是否总是存在仍然是一个开放的问题,这是杂务分配中一个简单而恼人的问题。本文提出了一种新的具有连通性约束的杂务分配方法,称为群满足方法,该方法可以证明杂务在树的几个子类上存在MMS分配。即使这些子案例都是非平凡的,我们的结果也可以被认为是解决开放问题的重要一步。我们还考虑了周期上杂务的MMS分配,我们得到了三个代理的紧密近似比。我们的结果是通过线性规划(LP)方法得到的,该方法不仅可以计算MMS分配的近似,而且可以构造MMS分配不存在的紧密例子,而无需进行复杂的组合分析。这两种被提出的方法,群体满意法和LP法,具有解决更多相关问题的潜力。
{"title":"MMS Allocations of Chores with Connectivity Constraints: New Methods and New Results","authors":"Mingyu Xiao, Guoliang Qiu, Sen Huang","doi":"10.48550/arXiv.2302.13224","DOIUrl":"https://doi.org/10.48550/arXiv.2302.13224","url":null,"abstract":"We study the problem of allocating indivisible chores to agents under the Maximin share (MMS) fairness notion. The chores are embedded on a graph and each bundle of chores assigned to an agent should be connected. Although there is a simple algorithm for MMS allocations of goods on trees, it remains open whether MMS allocations of chores on trees always exist or not, which is a simple but annoying problem in chores allocation. In this paper, we introduce a new method for chores allocation with connectivity constraints, called the group-satisfied method, that can show the existence of MMS allocations of chores on several subclasses of trees. Even these subcases are non-trivial and our results can be considered as a significant step to the open problem. We also consider MMS allocations of chores on cycles where we get the tight approximation ratio for three agents. Our result was obtained via the linear programming (LP) method, which enables us not only to compute approximate MMS allocations but also to construct tight examples of the nonexistence of MMS allocations without complicated combinatorial analysis. These two proposed methods, the group-satisfied method and the LP method, have the potential to solve more related problems.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122305444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data Structures for Deviation Payoffs 偏差收益的数据结构
Pub Date : 2023-02-26 DOI: 10.48550/arXiv.2302.13232
Bryce Wiedenbeck, Erik Brinkman
We present new data structures for representing symmetric normal-form games. These data structures are optimized for efficiently computing the expected utility of each unilateral pure-strategy deviation from a symmetric mixed-strategy profile. The cumulative effect of numerous incremental innovations is a dramatic speedup in the computation of symmetric mixed-strategy Nash equilibria, making it practical to represent and solve games with dozens to hundreds of players. These data structures naturally extend to role-symmetric and action-graph games with similar benefits.
我们提出了新的数据结构来表示对称的正规博弈。对这些数据结构进行了优化,以便有效地计算每个单边纯策略偏离对称混合策略配置文件的预期效用。许多渐进式创新的累积效应显著加快了对称混合策略纳什均衡的计算速度,使几十到几百个参与者的博弈表现和解决变得可行。这些数据结构自然扩展到角色对称和动作图形游戏中,具有类似的优势。
{"title":"Data Structures for Deviation Payoffs","authors":"Bryce Wiedenbeck, Erik Brinkman","doi":"10.48550/arXiv.2302.13232","DOIUrl":"https://doi.org/10.48550/arXiv.2302.13232","url":null,"abstract":"We present new data structures for representing symmetric normal-form games. These data structures are optimized for efficiently computing the expected utility of each unilateral pure-strategy deviation from a symmetric mixed-strategy profile. The cumulative effect of numerous incremental innovations is a dramatic speedup in the computation of symmetric mixed-strategy Nash equilibria, making it practical to represent and solve games with dozens to hundreds of players. These data structures naturally extend to role-symmetric and action-graph games with similar benefits.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131718768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Adaptive Agents and Multi-Agent Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1