Conference on Learning for Dynamics & Control最新文献

英文中文

Multi-Agent Reinforcement Learning with Reward Delays 具有奖励延迟的多智能体强化学习

Conference on Learning for Dynamics & Control

Pub Date : 2022-12-02 DOI: 10.48550/arXiv.2212.01441

Yuyang Zhang, Runyu Zhang, Gen Li, Yu Gu, N. Li

This paper considers multi-agent reinforcement learning (MARL) where the rewards are received after delays and the delay time varies across agents and across time steps. Based on the V-learning framework, this paper proposes MARL algorithms that efficiently deal with reward delays. When the delays are finite, our algorithm reaches a coarse correlated equilibrium (CCE) with rate $tilde{mathcal{O}}(frac{H^3sqrt{Smathcal{T}_K}}{K}+frac{H^3sqrt{SA}}{sqrt{K}})$ where $K$ is the number of episodes, $H$ is the planning horizon, $S$ is the size of the state space, $A$ is the size of the largest action space, and $mathcal{T}_K$ is the measure of total delay formally defined in the paper. Moreover, our algorithm is extended to cases with infinite delays through a reward skipping scheme. It achieves convergence rate similar to the finite delay case.

本文考虑了多智能体强化学习(MARL)，其中奖励是在延迟后接收的，延迟时间在不同的智能体和时间步长之间是不同的。在v -学习框架的基础上，提出了有效处理奖励延迟的MARL算法。当延迟有限时，我们的算法达到一个速率为$tilde{mathcal{O}}(frac{H^3sqrt{Smathcal{T}_K}}{K}+frac{H^3sqrt{SA}}{sqrt{K}})$的粗相关平衡(CCE)，其中$K$为事件数，$H$为规划视界，$S$为状态空间的大小，$A$为最大动作空间的大小，$mathcal{T}_K$为本文正式定义的总延迟度量。此外，通过奖励跳跃方案将算法扩展到具有无限延迟的情况。它的收敛速度与有限延迟情况相似。

引用次数: 3

Multi-Task Imitation Learning for Linear Dynamical Systems 线性动力系统的多任务模仿学习

Conference on Learning for Dynamics & Control

Pub Date : 2022-12-01 DOI: 10.48550/arXiv.2212.00186

Thomas Zhang, Katie Kang, Bruce Lee, C. Tomlin, S. Levine, Stephen Tu, N. Matni

We study representation learning for efficient imitation learning over linear systems. In particular, we consider a setting where learning is split into two phases: (a) a pre-training step where a shared $k$-dimensional representation is learned from $H$ source policies, and (b) a target policy fine-tuning step where the learned representation is used to parameterize the policy class. We find that the imitation gap over trajectories generated by the learned target policy is bounded by $tilde{O}left( frac{k n_x}{HN_{mathrm{shared}}} + frac{k n_u}{N_{mathrm{target}}}right)$, where $n_x>k$ is the state dimension, $n_u$ is the input dimension, $N_{mathrm{shared}}$ denotes the total amount of data collected for each policy during representation learning, and $N_{mathrm{target}}$ is the amount of target task data. This result formalizes the intuition that aggregating data across related tasks to learn a representation can significantly improve the sample efficiency of learning a target task. The trends suggested by this bound are corroborated in simulation.

我们研究了线性系统上高效模仿学习的表示学习。特别是，我们考虑将学习分为两个阶段的设置:(a)预训练步骤，其中从$H$源策略中学习共享的$k$维表示，以及(b)目标策略微调步骤，其中学习到的表示用于参数化策略类。我们发现由学习的目标策略生成的轨迹上的模仿间隙由$tilde{O}left( frac{k n_x}{HN_{mathrm{shared}}} + frac{k n_u}{N_{mathrm{target}}}right)$限定，其中$n_x>k$为状态维，$n_u$为输入维，$N_{mathrm{shared}}$表示表示学习过程中每个策略收集的数据总量，$N_{mathrm{target}}$为目标任务数据量。这个结果形式化了一种直觉，即跨相关任务聚合数据来学习一个表示可以显著提高学习目标任务的样本效率。模拟结果证实了这一界限所暗示的趋势。

引用次数: 8

Top-k data selection via distributed sample quantile inference 通过分布式样本分位数推理选择Top-k数据

Conference on Learning for Dynamics & Control

Pub Date : 2022-12-01 DOI: 10.48550/arXiv.2212.00230

Xu Zhang, M. Vasconcelos

We consider the problem of determining the top-$k$ largest measurements from a dataset distributed among a network of $n$ agents with noisy communication links. We show that this scenario can be cast as a distributed convex optimization problem called sample quantile inference, which we solve using a two-time-scale stochastic approximation algorithm. Herein, we prove the algorithm's convergence in the almost sure sense to an optimal solution. Moreover, our algorithm handles noise and empirically converges to the correct answer within a small number of iterations.

我们考虑从分布在具有噪声通信链路的$n$代理网络中的数据集中确定top- k$最大测量值的问题。我们表明，这种情况可以转换为一个称为样本分位数推理的分布式凸优化问题，我们使用双时间尺度随机近似算法来解决这个问题。在此，我们证明了算法在几乎确定意义下收敛于最优解。此外，我们的算法处理噪声，经验地收敛到正确的答案在少数迭代。

引用次数: 0

CatlNet: Learning Communication and Coordination Policies from CaTL+ Specifications CatlNet:从CaTL+规范中学习沟通和协调策略

Conference on Learning for Dynamics & Control

Pub Date : 2022-11-30 DOI: 10.48550/arXiv.2212.11792

Wenliang Liu, Kevin J. Leahy, Zachary T. Serlin, C. Belta

In this paper, we propose a learning-based framework to simultaneously learn the communication and distributed control policies for a heterogeneous multi-agent system (MAS) under complex mission requirements from Capability Temporal Logic plus (CaTL+) specifications. Both policies are trained, implemented, and deployed using a novel neural network model called CatlNet. Taking advantage of the robustness measure of CaTL+, we train CatlNet centrally to maximize it where network parameters are shared among all agents, allowing CatlNet to scale to large teams easily. CatlNet can then be deployed distributedly. A plan repair algorithm is also introduced to guide CatlNet's training and improve both training efficiency and the overall performance of CatlNet. The CatlNet approach is tested in simulation and results show that, after training, CatlNet can steer the decentralized MAS system online to satisfy a CaTL+ specification with a high success rate.

在本文中，我们提出了一个基于学习的框架来同时学习复杂任务需求下异构多智能体系统(MAS)的通信和分布式控制策略。这两个策略都是使用一种名为CatlNet的新型神经网络模型来训练、实现和部署的。利用CaTL+的鲁棒性度量，我们集中训练CatlNet，使其在所有代理之间共享网络参数的情况下最大化，从而使CatlNet可以轻松扩展到大型团队。CatlNet可以分布式部署。引入计划修复算法来指导CatlNet的训练，提高训练效率和CatlNet的整体性能。对CatlNet方法进行了仿真测试，结果表明，经过训练，CatlNet能够以较高的成功率引导分散的MAS系统在线满足CaTL+规范。

引用次数: 1

Lie Group Forced Variational Integrator Networks for Learning and Control of Robot Systems 机器人系统学习与控制的李群强迫变分积分器网络

Conference on Learning for Dynamics & Control

Pub Date : 2022-11-29 DOI: 10.48550/arXiv.2211.16006

Valentin Duruisseaux, T. Duong, M. Leok, Nikolay A. Atanasov

Incorporating prior knowledge of physics laws and structural properties of dynamical systems into the design of deep learning architectures has proven to be a powerful technique for improving their computational efficiency and generalization capacity. Learning accurate models of robot dynamics is critical for safe and stable control. Autonomous mobile robots, including wheeled, aerial, and underwater vehicles, can be modeled as controlled Lagrangian or Hamiltonian rigid-body systems evolving on matrix Lie groups. In this paper, we introduce a new structure-preserving deep learning architecture, the Lie group Forced Variational Integrator Network (LieFVIN), capable of learning controlled Lagrangian or Hamiltonian dynamics on Lie groups, either from position-velocity or position-only data. By design, LieFVINs preserve both the Lie group structure on which the dynamics evolve and the symplectic structure underlying the Hamiltonian or Lagrangian systems of interest. The proposed architecture learns surrogate discrete-time flow maps allowing accurate and fast prediction without numerical-integrator, neural-ODE, or adjoint techniques, which are needed for vector fields. Furthermore, the learnt discrete-time dynamics can be utilized with computationally scalable discrete-time (optimal) control strategies.

将物理定律和动力系统结构特性的先验知识纳入深度学习体系结构的设计已被证明是提高其计算效率和泛化能力的有力技术。学习准确的机器人动力学模型是安全稳定控制的关键。自主移动机器人，包括轮式、空中和水下机器人，可以建模为在矩阵李群上进化的可控拉格朗日或哈密顿刚体系统。在本文中，我们引入了一种新的保持结构的深度学习架构——李群强迫变分积分器网络(LieFVIN)，它能够从位置-速度或位置-速度数据中学习李群上的可控拉格朗日或哈密顿动力学。通过设计，LieFVINs既保留了动力学演化的李群结构，又保留了相关的哈密顿或拉格朗日系统的辛结构。所提出的架构学习代理离散时间流图，允许准确和快速的预测，而不需要矢量场所需的数字积分器、神经ode或伴随技术。此外，学习到的离散时间动力学可以用于计算可扩展的离散时间(最优)控制策略。

{"title":"Lie Group Forced Variational Integrator Networks for Learning and Control of Robot Systems","authors":"Valentin Duruisseaux, T. Duong, M. Leok, Nikolay A. Atanasov","doi":"10.48550/arXiv.2211.16006","DOIUrl":"https://doi.org/10.48550/arXiv.2211.16006","url":null,"abstract":"Incorporating prior knowledge of physics laws and structural properties of dynamical systems into the design of deep learning architectures has proven to be a powerful technique for improving their computational efficiency and generalization capacity. Learning accurate models of robot dynamics is critical for safe and stable control. Autonomous mobile robots, including wheeled, aerial, and underwater vehicles, can be modeled as controlled Lagrangian or Hamiltonian rigid-body systems evolving on matrix Lie groups. In this paper, we introduce a new structure-preserving deep learning architecture, the Lie group Forced Variational Integrator Network (LieFVIN), capable of learning controlled Lagrangian or Hamiltonian dynamics on Lie groups, either from position-velocity or position-only data. By design, LieFVINs preserve both the Lie group structure on which the dynamics evolve and the symplectic structure underlying the Hamiltonian or Lagrangian systems of interest. The proposed architecture learns surrogate discrete-time flow maps allowing accurate and fast prediction without numerical-integrator, neural-ODE, or adjoint techniques, which are needed for vector fields. Furthermore, the learnt discrete-time dynamics can be utilized with computationally scalable discrete-time (optimal) control strategies.","PeriodicalId":268449,"journal":{"name":"Conference on Learning for Dynamics & Control","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133362903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Provably Efficient Model-free RL in Leader-Follower MDP with Linear Function Approximation 基于线性函数逼近的Leader-Follower MDP中可证明的高效无模型RL

Conference on Learning for Dynamics & Control

Pub Date : 2022-11-28 DOI: 10.48550/arXiv.2211.15792

A. Ghosh

We consider a multi-agent episodic MDP setup where an agent (leader) takes action at each step of the episode followed by another agent (follower). The state evolution and rewards depend on the joint action pair of the leader and the follower. Such type of interactions can find applications in many domains such as smart grids, mechanism design, security, and policymaking. We are interested in how to learn policies for both the players with provable performance guarantee under a bandit feedback setting. We focus on a setup where both the leader and followers are {em non-myopic}, i.e., they both seek to maximize their rewards over the entire episode and consider a linear MDP which can model continuous state-space which is very common in many RL applications. We propose a {em model-free} RL algorithm and show that $tilde{mathcal{O}}(sqrt{d^3H^3T})$ regret bounds can be achieved for both the leader and the follower, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps under the bandit feedback information setup. Thus, our result holds even when the number of states becomes infinite. The algorithm relies on {em novel} adaptation of the LSVI-UCB algorithm. Specifically, we replace the standard greedy policy (as the best response) with the soft-max policy for both the leader and the follower. This turns out to be key in establishing uniform concentration bound for the value functions. To the best of our knowledge, this is the first sub-linear regret bound guarantee for the Markov games with non-myopic followers with function approximation.

我们考虑一个多代理情景MDP设置，其中一个代理(领导者)在情节的每一步采取行动，然后是另一个代理(追随者)。状态演化和奖励取决于领导者和追随者的共同行动对。这种类型的交互可以在许多领域中找到应用，例如智能电网、机制设计、安全性和政策制定。我们感兴趣的是如何在强盗反馈设置下为两个具有可证明性能保证的玩家学习策略。我们专注于领导者和追随者都{em是非短视}的设置，即他们都寻求在整个事件中最大化他们的奖励，并考虑一个线性MDP，它可以建模连续状态空间，这在许多强化学习应用中非常常见。我们提出了一种{em无模型}强化学习算法，并表明$tilde{mathcal{O}}(sqrt{d^3H^3T})$对于领导者和追随者都可以实现后悔边界，其中$d$是特征映射的维度，$H$是情节的长度，$T$是强盗反馈信息设置下的总步数。因此，即使状态数变为无穷大，我们的结果仍然成立。该算法基于对LSVI-UCB算法的{em新颖}适应。具体来说，我们将标准贪婪策略(作为最佳响应)替换为领导者和追随者的软最大策略。这是建立价值函数统一集中界的关键。据我们所知，这是具有函数近似的非近视眼追随者的马尔可夫博弈的第一个亚线性遗憾界保证。

{"title":"Provably Efficient Model-free RL in Leader-Follower MDP with Linear Function Approximation","authors":"A. Ghosh","doi":"10.48550/arXiv.2211.15792","DOIUrl":"https://doi.org/10.48550/arXiv.2211.15792","url":null,"abstract":"We consider a multi-agent episodic MDP setup where an agent (leader) takes action at each step of the episode followed by another agent (follower). The state evolution and rewards depend on the joint action pair of the leader and the follower. Such type of interactions can find applications in many domains such as smart grids, mechanism design, security, and policymaking. We are interested in how to learn policies for both the players with provable performance guarantee under a bandit feedback setting. We focus on a setup where both the leader and followers are {em non-myopic}, i.e., they both seek to maximize their rewards over the entire episode and consider a linear MDP which can model continuous state-space which is very common in many RL applications. We propose a {em model-free} RL algorithm and show that $tilde{mathcal{O}}(sqrt{d^3H^3T})$ regret bounds can be achieved for both the leader and the follower, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps under the bandit feedback information setup. Thus, our result holds even when the number of states becomes infinite. The algorithm relies on {em novel} adaptation of the LSVI-UCB algorithm. Specifically, we replace the standard greedy policy (as the best response) with the soft-max policy for both the leader and the follower. This turns out to be key in establishing uniform concentration bound for the value functions. To the best of our knowledge, this is the first sub-linear regret bound guarantee for the Markov games with non-myopic followers with function approximation.","PeriodicalId":268449,"journal":{"name":"Conference on Learning for Dynamics & Control","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128858362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CLAS: Coordinating Multi-Robot Manipulation with Central Latent Action Spaces 基于中心潜在动作空间的多机器人协调操作

Conference on Learning for Dynamics & Control

Pub Date : 2022-11-28 DOI: 10.48550/arXiv.2211.15824

Elie Aljalbout, Maximilian Karl, Patrick van der Smagt

Multi-robot manipulation tasks involve various control entities that can be separated into dynamically independent parts. A typical example of such real-world tasks is dual-arm manipulation. Learning to naively solve such tasks with reinforcement learning is often unfeasible due to the sample complexity and exploration requirements growing with the dimensionality of the action and state spaces. Instead, we would like to handle such environments as multi-agent systems and have several agents control parts of the whole. However, decentralizing the generation of actions requires coordination across agents through a channel limited to information central to the task. This paper proposes an approach to coordinating multi-robot manipulation through learned latent action spaces that are shared across different agents. We validate our method in simulated multi-robot manipulation tasks and demonstrate improvement over previous baselines in terms of sample efficiency and learning performance.

多机器人操作任务涉及各种控制实体，这些控制实体可以被分离成动态独立的部分。这类现实世界任务的一个典型例子是双臂操作。由于样本复杂性和探索需求随着动作和状态空间维度的增长而增长，学习用强化学习来天真地解决这类任务通常是不可行的。相反，我们希望将这样的环境作为多代理系统来处理，并让几个代理控制整个系统的一部分。然而，分散操作的生成需要通过限于任务中心信息的通道在代理之间进行协调。本文提出了一种通过在不同智能体之间共享的学习潜在动作空间来协调多机器人操作的方法。我们在模拟的多机器人操作任务中验证了我们的方法，并证明了在样本效率和学习性能方面比以前的基线有所改进。

引用次数: 1

Rectified Pessimistic-Optimistic Learning for Stochastic Continuum-armed Bandit with Constraints 约束下随机连续武装强盗的修正悲观乐观学习

Conference on Learning for Dynamics & Control

Pub Date : 2022-11-27 DOI: 10.48550/arXiv.2211.14720

Heng Guo, Qi Zhu, Xin Liu

This paper studies the problem of stochastic continuum-armed bandit with constraints (SCBwC), where we optimize a black-box reward function $f(x)$ subject to a black-box constraint function $g(x)leq 0$ over a continuous space $mathcal X$. We model reward and constraint functions via Gaussian processes (GPs) and propose a Rectified Pessimistic-Optimistic Learning framework (RPOL), a penalty-based method incorporating optimistic and pessimistic GP bandit learning for reward and constraint functions, respectively. We consider the metric of cumulative constraint violation $sum_{t=1}^T(g(x_t))^{+},$ which is strictly stronger than the traditional long-term constraint violation $sum_{t=1}^Tg(x_t).$ The rectified design for the penalty update and the pessimistic learning for the constraint function in RPOL guarantee the cumulative constraint violation is minimal. RPOL can achieve sublinear regret and cumulative constraint violation for SCBwC and its variants (e.g., under delayed feedback and non-stationary environment). These theoretical results match their unconstrained counterparts. Our experiments justify RPOL outperforms several existing baseline algorithms.

研究带约束的随机连续武装盗匪(SCBwC)问题，在连续空间$mathcal X$上，我们根据一个黑盒约束函数$g(x)leq 0$对一个黑盒奖励函数$f(x)$进行优化。我们通过高斯过程(GP)对奖励和约束函数建模，并提出了一种修正的悲观-乐观学习框架(RPOL)，这是一种基于惩罚的方法，分别将奖励和约束函数的乐观和悲观GP强盗学习结合起来。我们考虑累积约束违反的度量$sum_{t=1}^T(g(x_t))^{+},$严格强于传统的长期约束违反$sum_{t=1}^Tg(x_t).$惩罚更新的修正设计和RPOL中约束函数的悲观学习保证了累积约束违反最小。RPOL可以实现SCBwC及其变体(如延迟反馈和非平稳环境下)的亚线性后悔和累积约束违反。这些理论结果与不受约束的结果相匹配。我们的实验证明RPOL优于几种现有的基线算法。

{"title":"Rectified Pessimistic-Optimistic Learning for Stochastic Continuum-armed Bandit with Constraints","authors":"Heng Guo, Qi Zhu, Xin Liu","doi":"10.48550/arXiv.2211.14720","DOIUrl":"https://doi.org/10.48550/arXiv.2211.14720","url":null,"abstract":"This paper studies the problem of stochastic continuum-armed bandit with constraints (SCBwC), where we optimize a black-box reward function $f(x)$ subject to a black-box constraint function $g(x)leq 0$ over a continuous space $mathcal X$. We model reward and constraint functions via Gaussian processes (GPs) and propose a Rectified Pessimistic-Optimistic Learning framework (RPOL), a penalty-based method incorporating optimistic and pessimistic GP bandit learning for reward and constraint functions, respectively. We consider the metric of cumulative constraint violation $sum_{t=1}^T(g(x_t))^{+},$ which is strictly stronger than the traditional long-term constraint violation $sum_{t=1}^Tg(x_t).$ The rectified design for the penalty update and the pessimistic learning for the constraint function in RPOL guarantee the cumulative constraint violation is minimal. RPOL can achieve sublinear regret and cumulative constraint violation for SCBwC and its variants (e.g., under delayed feedback and non-stationary environment). These theoretical results match their unconstrained counterparts. Our experiments justify RPOL outperforms several existing baseline algorithms.","PeriodicalId":268449,"journal":{"name":"Conference on Learning for Dynamics & Control","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130964082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

DLKoopman: A deep learning software package for Koopman theory DLKoopman:用于Koopman理论的深度学习软件包

Conference on Learning for Dynamics & Control

Pub Date : 2022-11-15 DOI: 10.48550/arXiv.2211.08992

Sourya Dey, Eric K. Davis

We present DLKoopman -- a software package for Koopman theory that uses deep learning to learn an encoding of a nonlinear dynamical system into a linear space, while simultaneously learning the linear dynamics. While several previous efforts have either restricted the ability to learn encodings, or been bespoke efforts designed for specific systems, DLKoopman is a generalized tool that can be applied to data-driven learning and optimization of any dynamical system. It can either be trained on data from individual states (snapshots) of a system and used to predict its unknown states, or trained on data from trajectories of a system and used to predict unknown trajectories for new initial states. DLKoopman is available on the Python Package Index (PyPI) as 'dlkoopman', and includes extensive documentation and tutorials. Additional contributions of the package include a novel metric called Average Normalized Absolute Error for evaluating performance, and a ready-to-use hyperparameter search module for improving performance.

我们提出DLKoopman——一个用于Koopman理论的软件包，它使用深度学习来学习将非线性动力系统编码到线性空间中，同时学习线性动力学。虽然之前的一些努力要么限制了学习编码的能力，要么是为特定系统定制的努力，但DLKoopman是一个通用的工具，可以应用于任何动态系统的数据驱动学习和优化。它既可以从系统的单个状态(快照)的数据上进行训练，并用于预测其未知状态，也可以从系统的轨迹数据上进行训练，并用于预测新的初始状态的未知轨迹。DLKoopman可以在Python包索引(PyPI)上以' DLKoopman '的形式获得，它包括大量的文档和教程。该软件包的其他贡献包括用于评估性能的称为平均归一化绝对误差的新度量，以及用于改进性能的现成超参数搜索模块。

引用次数: 0

Competing Bandits in Time Varying Matching Markets 时变匹配市场中的竞争强盗

Conference on Learning for Dynamics & Control

Pub Date : 2022-10-21 DOI: 10.48550/arXiv.2210.11692

Deepan Muthirayan, C. Maheshwari, P. Khargonekar, S. Sastry

We study the problem of online learning in two-sided non-stationary matching markets, where the objective is to converge to a stable match. In particular, we consider the setting where one side of the market, the arms, has fixed known set of preferences over the other side, the players. While this problem has been studied when the players have fixed but unknown preferences, in this work we study the problem of how to learn when the preferences of the players are time varying and unknown. Our contribution is a methodology that can handle any type of preference structure and variation scenario. We show that, with the proposed algorithm, each player receives a uniform sub-linear regret of {$widetilde{mathcal{O}}(L^{1/2}_TT^{1/2})$} up to the number of changes in the underlying preferences of the agents, $L_T$. Therefore, we show that the optimal rates for single-agent learning can be achieved in spite of the competition up to a difference of a constant factor. We also discuss extensions of this algorithm to the case where the number of changes need not be known a priori.

我们研究了双边非平稳匹配市场中的在线学习问题，其目标是收敛到一个稳定匹配。特别地，我们考虑这样一种情况，即市场的一方，即武器，对另一方，即参与者有固定的已知偏好。当参与者有固定但未知的偏好时，这个问题已经被研究过，在这项工作中，我们研究的问题是，当参与者的偏好随时间变化且未知时，如何学习。我们的贡献是一种可以处理任何类型的偏好结构和变化场景的方法。我们表明，使用所提出的算法，每个参与者都收到一个统一的亚线性遗憾{$ widdetilde {mathcal{O}}(L^{1/2}_TT^{1/2})$}，直至代理的潜在偏好的变化次数$L_T$。因此，我们证明了单智能体学习的最优速率可以在竞争达到一个常数因素差异的情况下实现。我们还讨论了该算法的扩展到不需要先验地知道变化数量的情况。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Conference on Learning for Dynamics & Control

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀