Conference on Learning for Dynamics & Control最新文献

英文中文

Improving Gradient Computation for Differentiable Physics Simulation with Contacts 带接触的可微物理模拟的梯度计算改进

Conference on Learning for Dynamics & Control

Pub Date : 2023-04-28 DOI: 10.48550/arXiv.2305.00092

Yaofeng Desmond Zhong, Jiequn Han, Biswadip Dey, Georgia Olympia Brikis

Differentiable simulation enables gradients to be back-propagated through physics simulations. In this way, one can learn the dynamics and properties of a physics system by gradient-based optimization or embed the whole differentiable simulation as a layer in a deep learning model for downstream tasks, such as planning and control. However, differentiable simulation at its current stage is not perfect and might provide wrong gradients that deteriorate its performance in learning tasks. In this paper, we study differentiable rigid-body simulation with contacts. We find that existing differentiable simulation methods provide inaccurate gradients when the contact normal direction is not fixed - a general situation when the contacts are between two moving objects. We propose to improve gradient computation by continuous collision detection and leverage the time-of-impact (TOI) to calculate the post-collision velocities. We demonstrate our proposed method, referred to as TOI-Velocity, on two optimal control problems. We show that with TOI-Velocity, we are able to learn an optimal control sequence that matches the analytical solution, while without TOI-Velocity, existing differentiable simulation methods fail to do so.

可微模拟使梯度能够通过物理模拟进行反向传播。通过这种方式，人们可以通过基于梯度的优化来学习物理系统的动力学和特性，或者将整个可微模拟作为一个层嵌入深度学习模型中，用于下游任务，如规划和控制。然而，目前阶段的可微模拟并不完美，可能会提供错误的梯度，从而降低其在学习任务中的性能。本文研究了带接触的可微刚体仿真问题。我们发现，当接触法向不固定时，现有的可微模拟方法提供了不准确的梯度，这是两个运动物体之间接触的一般情况。我们提出通过连续碰撞检测改进梯度计算，并利用碰撞时间(TOI)来计算碰撞后的速度。我们在两个最优控制问题上证明了我们提出的方法，称为TOI-Velocity。我们表明，使用TOI-Velocity，我们能够学习与解析解匹配的最优控制序列，而没有TOI-Velocity，现有的可微仿真方法无法做到这一点。

{"title":"Improving Gradient Computation for Differentiable Physics Simulation with Contacts","authors":"Yaofeng Desmond Zhong, Jiequn Han, Biswadip Dey, Georgia Olympia Brikis","doi":"10.48550/arXiv.2305.00092","DOIUrl":"https://doi.org/10.48550/arXiv.2305.00092","url":null,"abstract":"Differentiable simulation enables gradients to be back-propagated through physics simulations. In this way, one can learn the dynamics and properties of a physics system by gradient-based optimization or embed the whole differentiable simulation as a layer in a deep learning model for downstream tasks, such as planning and control. However, differentiable simulation at its current stage is not perfect and might provide wrong gradients that deteriorate its performance in learning tasks. In this paper, we study differentiable rigid-body simulation with contacts. We find that existing differentiable simulation methods provide inaccurate gradients when the contact normal direction is not fixed - a general situation when the contacts are between two moving objects. We propose to improve gradient computation by continuous collision detection and leverage the time-of-impact (TOI) to calculate the post-collision velocities. We demonstrate our proposed method, referred to as TOI-Velocity, on two optimal control problems. We show that with TOI-Velocity, we are able to learn an optimal control sequence that matches the analytical solution, while without TOI-Velocity, existing differentiable simulation methods fail to do so.","PeriodicalId":268449,"journal":{"name":"Conference on Learning for Dynamics & Control","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125392198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Roll-Drop: accounting for observation noise with a single parameter 滚落:单参数计算观测噪声

Conference on Learning for Dynamics & Control

Pub Date : 2023-04-25 DOI: 10.48550/arXiv.2304.13150

Luigi Campanaro, D. Martini, Siddhant Gangapurwala, W. Merkt, I. Havoutis

This paper proposes a simple strategy for sim-to-real in Deep-Reinforcement Learning (DRL) -- called Roll-Drop -- that uses dropout during simulation to account for observation noise during deployment without explicitly modelling its distribution for each state. DRL is a promising approach to control robots for highly dynamic and feedback-based manoeuvres, and accurate simulators are crucial to providing cheap and abundant data to learn the desired behaviour. Nevertheless, the simulated data are noiseless and generally show a distributional shift that challenges the deployment on real machines where sensor readings are affected by noise. The standard solution is modelling the latter and injecting it during training; while this requires a thorough system identification, Roll-Drop enhances the robustness to sensor noise by tuning only a single parameter. We demonstrate an 80% success rate when up to 25% noise is injected in the observations, with twice higher robustness than the baselines. We deploy the controller trained in simulation on a Unitree A1 platform and assess this improved robustness on the physical system.

本文提出了一种在深度强化学习(DRL)中从模拟到真实的简单策略——称为Roll-Drop——该策略在模拟期间使用dropout来解释部署期间的观察噪声，而无需明确地为每个状态建模其分布。DRL是一种很有前途的方法来控制机器人进行高动态和基于反馈的操作，精确的模拟器对于提供廉价和丰富的数据来学习所需的行为至关重要。尽管如此，模拟数据是无噪声的，并且通常显示出分布变化，这对传感器读数受噪声影响的真实机器的部署提出了挑战。标准的解决方案是对后者进行建模，并在训练期间注射;虽然这需要彻底的系统识别，但Roll-Drop仅通过调整单个参数来增强对传感器噪声的鲁棒性。我们证明，当在观测中注入高达25%的噪声时，成功率为80%，鲁棒性比基线高两倍。我们在Unitree A1平台上部署经过仿真训练的控制器，并在物理系统上评估这种改进的鲁棒性。

{"title":"Roll-Drop: accounting for observation noise with a single parameter","authors":"Luigi Campanaro, D. Martini, Siddhant Gangapurwala, W. Merkt, I. Havoutis","doi":"10.48550/arXiv.2304.13150","DOIUrl":"https://doi.org/10.48550/arXiv.2304.13150","url":null,"abstract":"This paper proposes a simple strategy for sim-to-real in Deep-Reinforcement Learning (DRL) -- called Roll-Drop -- that uses dropout during simulation to account for observation noise during deployment without explicitly modelling its distribution for each state. DRL is a promising approach to control robots for highly dynamic and feedback-based manoeuvres, and accurate simulators are crucial to providing cheap and abundant data to learn the desired behaviour. Nevertheless, the simulated data are noiseless and generally show a distributional shift that challenges the deployment on real machines where sensor readings are affected by noise. The standard solution is modelling the latter and injecting it during training; while this requires a thorough system identification, Roll-Drop enhances the robustness to sensor noise by tuning only a single parameter. We demonstrate an 80% success rate when up to 25% noise is injected in the observations, with twice higher robustness than the baselines. We deploy the controller trained in simulation on a Unitree A1 platform and assess this improved robustness on the physical system.","PeriodicalId":268449,"journal":{"name":"Conference on Learning for Dynamics & Control","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132800555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Filter-Aware Model-Predictive Control 滤波器感知模型预测控制

Conference on Learning for Dynamics & Control

Pub Date : 2023-04-20 DOI: 10.48550/arXiv.2304.10246

Baris Kayalibay, Atanas Mirchev, Ahmed Agha, Patrick van der Smagt, Justin Bayer

Partially-observable problems pose a trade-off between reducing costs and gathering information. They can be solved optimally by planning in belief space, but that is often prohibitively expensive. Model-predictive control (MPC) takes the alternative approach of using a state estimator to form a belief over the state, and then plan in state space. This ignores potential future observations during planning and, as a result, cannot actively increase or preserve the certainty of its own state estimate. We find a middle-ground between planning in belief space and completely ignoring its dynamics by only reasoning about its future accuracy. Our approach, filter-aware MPC, penalises the loss of information by what we call"trackability", the expected error of the state estimator. We show that model-based simulation allows condensing trackability into a neural network, which allows fast planning. In experiments involving visual navigation, realistic every-day environments and a two-link robot arm, we show that filter-aware MPC vastly improves regular MPC.

部分可观察到的问题需要在降低成本和收集信息之间进行权衡。它们可以通过在信念空间中进行规划而得到最佳解决，但这通常是非常昂贵的。模型预测控制(MPC)采用另一种方法，即使用状态估计器在状态上形成一个信念，然后在状态空间中进行规划。这在规划过程中忽略了潜在的未来观察，因此，不能主动增加或保持其自身状态估计的确定性。我们在信念空间的规划和完全忽略其动态之间找到了一个中间地带，只考虑其未来的准确性。我们的方法，过滤器感知MPC，通过我们所谓的“可追踪性”(状态估计器的预期误差)来惩罚信息的丢失。我们表明，基于模型的仿真可以将可跟踪性压缩到神经网络中，从而实现快速规划。在涉及视觉导航、现实的日常环境和双连杆机械臂的实验中，我们表明滤波感知的MPC大大改善了常规的MPC。

引用次数: 0

Continuous Versatile Jumping Using Learned Action Residuals 使用学习动作余量的连续全能跳跃

Conference on Learning for Dynamics & Control

Pub Date : 2023-04-17 DOI: 10.48550/arXiv.2304.08663

Yuxiang Yang, Xiang Meng, Wenhao Yu, Tingnan Zhang, Jie Tan, Byron Boots

Jumping is essential for legged robots to traverse through difficult terrains. In this work, we propose a hierarchical framework that combines optimal control and reinforcement learning to learn continuous jumping motions for quadrupedal robots. The core of our framework is a stance controller, which combines a manually designed acceleration controller with a learned residual policy. As the acceleration controller warm starts policy for efficient training, the trained policy overcomes the limitation of the acceleration controller and improves the jumping stability. In addition, a low-level whole-body controller converts the body pose command from the stance controller to motor commands. After training in simulation, our framework can be deployed directly to the real robot, and perform versatile, continuous jumping motions, including omni-directional jumps at up to 50cm high, 60cm forward, and jump-turning at up to 90 degrees. Please visit our website for more results: https://sites.google.com/view/learning-to-jump.

跳跃是有腿机器人穿越复杂地形的必要条件。在这项工作中，我们提出了一个结合最优控制和强化学习的分层框架来学习四足机器人的连续跳跃运动。该框架的核心是姿态控制器，它结合了手动设计的加速度控制器和学习到的残差策略。作为有效训练的加速度控制器热启动策略，训练后的策略克服了加速度控制器的局限性，提高了跳跃的稳定性。此外，低级全身控制器将身体姿势命令从姿态控制器转换为运动命令。经过模拟训练，我们的框架可以直接部署到真实的机器人上，并执行多功能的连续跳跃动作，包括高达50cm的全方位跳跃，向前60cm的跳跃，以及高达90度的跳跃转弯。更多结果请访问我们的网站:https://sites.google.com/view/learning-to-jump。

引用次数: 4

Full Gradient Deep Reinforcement Learning for Average-Reward Criterion 基于平均奖励准则的全梯度深度强化学习

Conference on Learning for Dynamics & Control

Pub Date : 2023-04-07 DOI: 10.48550/arXiv.2304.03729

Tejas Pagare, V. Borkar, Konstantin Avrachenkov

We extend the provably convergent Full Gradient DQN algorithm for discounted reward Markov decision processes from Avrachenkov et al. (2021) to average reward problems. We experimentally compare widely used RVI Q-Learning with recently proposed Differential Q-Learning in the neural function approximation setting with Full Gradient DQN and DQN. We also extend this to learn Whittle indices for Markovian restless multi-armed bandits. We observe a better convergence rate of the proposed Full Gradient variant across different tasks.

我们将Avrachenkov等人(2021)的贴现奖励马尔可夫决策过程的可证明收敛的全梯度DQN算法扩展到平均奖励问题。我们通过实验比较了广泛使用的RVI Q-Learning和最近提出的基于全梯度DQN和DQN的神经函数逼近设置中的微分Q-Learning。我们还将其推广到学习马尔可夫不安分多臂强盗的惠特尔指数。我们观察到所提出的全梯度变量在不同任务之间具有更好的收敛率。

引用次数: 1

Learning Stability Attention in Vision-based End-to-end Driving Policies 基于视觉的端到端驾驶策略中学习稳定性的关注

Conference on Learning for Dynamics & Control

Pub Date : 2023-04-05 DOI: 10.48550/arXiv.2304.02733

Tsun-Hsuan Wang, Wei Xiao, Makram Chahine, Alexander Amini, Ramin M. Hasani, Daniela Rus

Modern end-to-end learning systems can learn to explicitly infer control from perception. However, it is difficult to guarantee stability and robustness for these systems since they are often exposed to unstructured, high-dimensional, and complex observation spaces (e.g., autonomous driving from a stream of pixel inputs). We propose to leverage control Lyapunov functions (CLFs) to equip end-to-end vision-based policies with stability properties and introduce stability attention in CLFs (att-CLFs) to tackle environmental changes and improve learning flexibility. We also present an uncertainty propagation technique that is tightly integrated into att-CLFs. We demonstrate the effectiveness of att-CLFs via comparison with classical CLFs, model predictive control, and vanilla end-to-end learning in a photo-realistic simulator and on a real full-scale autonomous vehicle.

现代端到端学习系统可以学习明确地从感知推断控制。然而，很难保证这些系统的稳定性和鲁棒性，因为它们经常暴露在非结构化、高维和复杂的观察空间中(例如，来自像素输入流的自动驾驶)。我们建议利用控制李雅普诺夫函数(clf)为端到端基于视觉的策略配备稳定性特性，并在clf (at - clf)中引入稳定性注意，以应对环境变化并提高学习灵活性。我们还提出了一种与at - clfs紧密集成的不确定性传播技术。我们通过在逼真的模拟器和真实的全尺寸自动驾驶汽车上与经典的clf、模型预测控制和香草端到端学习进行比较，证明了at - clf的有效性。

引用次数: 2

Accelerating Trajectory Generation for Quadrotors Using Transformers 利用变压器加速四旋翼飞行器的轨迹生成

Conference on Learning for Dynamics & Control

Pub Date : 2023-03-27 DOI: 10.48550/arXiv.2303.15606

Srinath Tankasala, M. Pryor

In this work, we address the problem of computation time for trajectory generation in quadrotors. Most trajectory generation methods for waypoint navigation of quadrotors, for example minimum snap/jerk and minimum-time, are structured as bi-level optimizations. The first level involves allocating time across all input waypoints and the second step is to minimize the snap/jerk of the trajectory under that time allocation. Such an optimization can be computationally expensive to solve. In our approach we treat trajectory generation as a supervised learning problem between a sequential set of inputs and outputs. We adapt a transformer model to learn the optimal time allocations for a given set of input waypoints, thus making it into a single step optimization. We demonstrate the performance of the transformer model by training it to predict the time allocations for a minimum snap trajectory generator. The trained transformer model is able to predict accurate time allocations with fewer data samples and smaller model size, compared to a feedforward network (FFN), demonstrating that it is able to model the sequential nature of the waypoint navigation problem.

在这项工作中，我们解决了四旋翼飞行器轨迹生成的计算时间问题。大多数四旋翼飞行器航路点导航的轨迹生成方法，如最小突振和最小时间，都是双层优化的。第一级涉及分配所有输入路径点的时间，第二步是最小化在该时间分配下的轨迹的突然/突然。这样的优化在计算上是昂贵的。在我们的方法中，我们将轨迹生成视为一系列输入和输出之间的监督学习问题。我们采用变压器模型来学习给定输入路径点集合的最佳时间分配，从而使其成为单步优化。我们通过训练变压器模型来预测最小弹跳轨迹生成器的时间分配来证明其性能。与前馈网络(FFN)相比，经过训练的变压器模型能够用更少的数据样本和更小的模型尺寸预测准确的时间分配，这表明它能够模拟航路点导航问题的顺序性。

{"title":"Accelerating Trajectory Generation for Quadrotors Using Transformers","authors":"Srinath Tankasala, M. Pryor","doi":"10.48550/arXiv.2303.15606","DOIUrl":"https://doi.org/10.48550/arXiv.2303.15606","url":null,"abstract":"In this work, we address the problem of computation time for trajectory generation in quadrotors. Most trajectory generation methods for waypoint navigation of quadrotors, for example minimum snap/jerk and minimum-time, are structured as bi-level optimizations. The first level involves allocating time across all input waypoints and the second step is to minimize the snap/jerk of the trajectory under that time allocation. Such an optimization can be computationally expensive to solve. In our approach we treat trajectory generation as a supervised learning problem between a sequential set of inputs and outputs. We adapt a transformer model to learn the optimal time allocations for a given set of input waypoints, thus making it into a single step optimization. We demonstrate the performance of the transformer model by training it to predict the time allocations for a minimum snap trajectory generator. The trained transformer model is able to predict accurate time allocations with fewer data samples and smaller model size, compared to a feedforward network (FFN), demonstrating that it is able to model the sequential nature of the waypoint navigation problem.","PeriodicalId":268449,"journal":{"name":"Conference on Learning for Dynamics & Control","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115775476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Compositional Neural Certificates for Networked Dynamical Systems 网络动力系统的组合神经证书

Conference on Learning for Dynamics & Control

Pub Date : 2023-03-25 DOI: 10.48550/arXiv.2303.14564

Songyuan Zhang, Yumeng Xiu, Guannan Qu, Chuchu Fan

Developing stable controllers for large-scale networked dynamical systems is crucial but has long been challenging due to two key obstacles: certifiability and scalability. In this paper, we present a general framework to solve these challenges using compositional neural certificates based on ISS (Input-to-State Stability) Lyapunov functions. Specifically, we treat a large networked dynamical system as an interconnection of smaller subsystems and develop methods that can find each subsystem a decentralized controller and an ISS Lyapunov function; the latter can be collectively composed to prove the global stability of the system. To ensure the scalability of our approach, we develop generalizable and robust ISS Lyapunov functions where a single function can be used across different subsystems and the certificates we produced for small systems can be generalized to be used on large systems with similar structures. We encode both ISS Lyapunov functions and controllers as neural networks and propose a novel training methodology to handle the logic in ISS Lyapunov conditions that encodes the interconnection with neighboring subsystems. We demonstrate our approach in systems including Platoon, Drone formation control, and Power systems. Experimental results show that our framework can reduce the tracking error up to 75% compared with RL algorithms when applied to large-scale networked systems.

为大型网络动态系统开发稳定控制器是至关重要的，但由于两个关键障碍:可认证性和可扩展性，长期以来一直具有挑战性。在本文中，我们提出了一个通用框架来解决这些挑战，使用基于ISS(输入到状态稳定性)Lyapunov函数的组合神经证书。具体而言，我们将大型网络动力系统视为较小子系统的互连，并开发了可以为每个子系统找到分散控制器和ISS Lyapunov函数的方法;后者可以被集合起来证明系统的全局稳定性。为了确保我们方法的可扩展性，我们开发了可推广和健壮的ISS Lyapunov函数，其中单个函数可以跨不同子系统使用，我们为小型系统制作的证书可以推广到具有类似结构的大型系统上。我们将ISS Lyapunov函数和控制器编码为神经网络，并提出了一种新的训练方法来处理ISS Lyapunov条件下的逻辑，该方法对与相邻子系统的互连进行编码。我们在系统中展示了我们的方法，包括排，无人机编队控制和动力系统。实验结果表明，与RL算法相比，该框架在大规模网络系统中的跟踪误差降低了75%。

{"title":"Compositional Neural Certificates for Networked Dynamical Systems","authors":"Songyuan Zhang, Yumeng Xiu, Guannan Qu, Chuchu Fan","doi":"10.48550/arXiv.2303.14564","DOIUrl":"https://doi.org/10.48550/arXiv.2303.14564","url":null,"abstract":"Developing stable controllers for large-scale networked dynamical systems is crucial but has long been challenging due to two key obstacles: certifiability and scalability. In this paper, we present a general framework to solve these challenges using compositional neural certificates based on ISS (Input-to-State Stability) Lyapunov functions. Specifically, we treat a large networked dynamical system as an interconnection of smaller subsystems and develop methods that can find each subsystem a decentralized controller and an ISS Lyapunov function; the latter can be collectively composed to prove the global stability of the system. To ensure the scalability of our approach, we develop generalizable and robust ISS Lyapunov functions where a single function can be used across different subsystems and the certificates we produced for small systems can be generalized to be used on large systems with similar structures. We encode both ISS Lyapunov functions and controllers as neural networks and propose a novel training methodology to handle the logic in ISS Lyapunov conditions that encodes the interconnection with neighboring subsystems. We demonstrate our approach in systems including Platoon, Drone formation control, and Power systems. Experimental results show that our framework can reduce the tracking error up to 75% compared with RL algorithms when applied to large-scale networked systems.","PeriodicalId":268449,"journal":{"name":"Conference on Learning for Dynamics & Control","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125494070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Policy Evaluation in Distributional LQR 分布式LQR中的策略评价

Conference on Learning for Dynamics & Control

Pub Date : 2023-03-23 DOI: 10.48550/arXiv.2303.13657

Zifan Wang, Yulong Gao, Si Wang, M. Zavlanos, A. Abate, K. Johansson

Distributional reinforcement learning (DRL) enhances the understanding of the effects of the randomness in the environment by letting agents learn the distribution of a random return, rather than its expected value as in standard RL. At the same time, a main challenge in DRL is that policy evaluation in DRL typically relies on the representation of the return distribution, which needs to be carefully designed. In this paper, we address this challenge for a special class of DRL problems that rely on linear quadratic regulator (LQR) for control, advocating for a new distributional approach to LQR, which we call emph{distributional LQR}. Specifically, we provide a closed-form expression of the distribution of the random return which, remarkably, is applicable to all exogenous disturbances on the dynamics, as long as they are independent and identically distributed (i.i.d.). While the proposed exact return distribution consists of infinitely many random variables, we show that this distribution can be approximated by a finite number of random variables, and the associated approximation error can be analytically bounded under mild assumptions. Using the approximate return distribution, we propose a zeroth-order policy gradient algorithm for risk-averse LQR using the Conditional Value at Risk (CVaR) as a measure of risk. Numerical experiments are provided to illustrate our theoretical results.

分布式强化学习(DRL)通过让智能体学习随机回报的分布，而不是像标准强化学习那样学习其期望值，增强了对环境中随机性影响的理解。同时，DRL的一个主要挑战是DRL中的政策评估通常依赖于回报分布的表示，需要仔细设计。在本文中，我们针对一类依赖线性二次调节器(LQR)进行控制的特殊DRL问题解决了这一挑战，提倡一种新的LQR分布方法，我们称之为emph{分布LQR}。具体地说，我们提供了随机收益分布的一个封闭形式表达式，值得注意的是，它适用于动力学上的所有外源干扰，只要它们是独立和同分布的(i.i.d)。虽然所提出的精确返回分布由无限多个随机变量组成，但我们证明了该分布可以由有限数量的随机变量近似，并且在温和假设下，相关的近似误差可以解析地有界。利用近似收益分布，我们提出了一种零阶策略梯度算法，用于风险规避LQR，使用条件风险值(CVaR)作为风险度量。数值实验证明了我们的理论结果。

{"title":"Policy Evaluation in Distributional LQR","authors":"Zifan Wang, Yulong Gao, Si Wang, M. Zavlanos, A. Abate, K. Johansson","doi":"10.48550/arXiv.2303.13657","DOIUrl":"https://doi.org/10.48550/arXiv.2303.13657","url":null,"abstract":"Distributional reinforcement learning (DRL) enhances the understanding of the effects of the randomness in the environment by letting agents learn the distribution of a random return, rather than its expected value as in standard RL. At the same time, a main challenge in DRL is that policy evaluation in DRL typically relies on the representation of the return distribution, which needs to be carefully designed. In this paper, we address this challenge for a special class of DRL problems that rely on linear quadratic regulator (LQR) for control, advocating for a new distributional approach to LQR, which we call emph{distributional LQR}. Specifically, we provide a closed-form expression of the distribution of the random return which, remarkably, is applicable to all exogenous disturbances on the dynamics, as long as they are independent and identically distributed (i.i.d.). While the proposed exact return distribution consists of infinitely many random variables, we show that this distribution can be approximated by a finite number of random variables, and the associated approximation error can be analytically bounded under mild assumptions. Using the approximate return distribution, we propose a zeroth-order policy gradient algorithm for risk-averse LQR using the Conditional Value at Risk (CVaR) as a measure of risk. Numerical experiments are provided to illustrate our theoretical results.","PeriodicalId":268449,"journal":{"name":"Conference on Learning for Dynamics & Control","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130181222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Hybrid Systems Neural Control with Region-of-Attraction Planner 基于吸引区域规划的混合系统神经控制

Conference on Learning for Dynamics & Control

Pub Date : 2023-03-18 DOI: 10.48550/arXiv.2303.10327

Yue Meng, Chuchu Fan

Hybrid systems are prevalent in robotics. However, ensuring the stability of hybrid systems is challenging due to sophisticated continuous and discrete dynamics. A system with all its system modes stable can still be unstable. Hence special treatments are required at mode switchings to stabilize the system. In this work, we propose a hierarchical, neural network (NN)-based method to control general hybrid systems. For each system mode, we first learn an NN Lyapunov function and an NN controller to ensure the states within the region of attraction (RoA) can be stabilized. Then an RoA NN estimator is learned across different modes. Upon mode switching, we propose a differentiable planner to ensure the states after switching can land in next mode's RoA, hence stabilizing the hybrid system. We provide novel theoretical stability guarantees and conduct experiments in car tracking control, pogobot navigation, and bipedal walker locomotion. Our method only requires 0.25X of the training time as needed by other learning-based methods. With low running time (10-50X faster than model predictive control (MPC)), our controller achieves a higher stability/success rate over other baselines such as MPC, reinforcement learning (RL), common Lyapunov methods (CLF), linear quadratic regulator (LQR), quadratic programming (QP) and Hamilton-Jacobian-based methods (HJB). The project page is on https://mit-realm.github.io/hybrid-clf.

混合系统在机器人技术中很普遍。然而，由于复杂的连续和离散动力学，确保混合动力系统的稳定性具有挑战性。一个所有系统模式都稳定的系统仍然可能是不稳定的。因此，需要在模式切换时进行特殊处理以稳定系统。在这项工作中，我们提出了一种基于层次神经网络(NN)的方法来控制一般混合系统。对于每个系统模式，我们首先学习一个NN Lyapunov函数和一个NN控制器，以确保在吸引区域(RoA)内的状态可以稳定。然后学习不同模式下的RoA神经网络估计器。在模式切换时，我们提出了一个可微规划器，以保证切换后的状态能够到达下一模式的RoA，从而稳定混合系统。我们提供了新的理论稳定性保证，并在汽车跟踪控制、pogobot导航和两足步行器运动方面进行了实验。我们的方法只需要其他基于学习的方法训练时间的0.25倍。由于运行时间短(比模型预测控制(MPC)快10-50倍)，我们的控制器比其他基准(如MPC，强化学习(RL)，常见Lyapunov方法(CLF)，线性二次调节器(LQR)，二次规划(QP)和基于汉密尔顿-雅可比方法(HJB))实现了更高的稳定性/成功率。项目页面在https://mit-realm.github.io/hybrid-clf。

{"title":"Hybrid Systems Neural Control with Region-of-Attraction Planner","authors":"Yue Meng, Chuchu Fan","doi":"10.48550/arXiv.2303.10327","DOIUrl":"https://doi.org/10.48550/arXiv.2303.10327","url":null,"abstract":"Hybrid systems are prevalent in robotics. However, ensuring the stability of hybrid systems is challenging due to sophisticated continuous and discrete dynamics. A system with all its system modes stable can still be unstable. Hence special treatments are required at mode switchings to stabilize the system. In this work, we propose a hierarchical, neural network (NN)-based method to control general hybrid systems. For each system mode, we first learn an NN Lyapunov function and an NN controller to ensure the states within the region of attraction (RoA) can be stabilized. Then an RoA NN estimator is learned across different modes. Upon mode switching, we propose a differentiable planner to ensure the states after switching can land in next mode's RoA, hence stabilizing the hybrid system. We provide novel theoretical stability guarantees and conduct experiments in car tracking control, pogobot navigation, and bipedal walker locomotion. Our method only requires 0.25X of the training time as needed by other learning-based methods. With low running time (10-50X faster than model predictive control (MPC)), our controller achieves a higher stability/success rate over other baselines such as MPC, reinforcement learning (RL), common Lyapunov methods (CLF), linear quadratic regulator (LQR), quadratic programming (QP) and Hamilton-Jacobian-based methods (HJB). The project page is on https://mit-realm.github.io/hybrid-clf.","PeriodicalId":268449,"journal":{"name":"Conference on Learning for Dynamics & Control","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124569570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Conference on Learning for Dynamics & Control

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀