Pub Date : 2022-12-06DOI: 10.48550/arXiv.2212.03228
Kai Hsu, D. Nguyen, J. Fisac
The deployment of robots in uncontrolled environments requires them to operate robustly under previously unseen scenarios, like irregular terrain and wind conditions. Unfortunately, while rigorous safety frameworks from robust optimal control theory scale poorly to high-dimensional nonlinear dynamics, control policies computed by more tractable"deep"methods lack guarantees and tend to exhibit little robustness to uncertain operating conditions. This work introduces a novel approach enabling scalable synthesis of robust safety-preserving controllers for robotic systems with general nonlinear dynamics subject to bounded modeling error by combining game-theoretic safety analysis with adversarial reinforcement learning in simulation. Following a soft actor-critic scheme, a safety-seeking fallback policy is co-trained with an adversarial"disturbance"agent that aims to invoke the worst-case realization of model error and training-to-deployment discrepancy allowed by the designer's uncertainty. While the learned control policy does not intrinsically guarantee safety, it is used to construct a real-time safety filter (or shield) with robust safety guarantees based on forward reachability rollouts. This shield can be used in conjunction with a safety-agnostic control policy, precluding any task-driven actions that could result in loss of safety. We evaluate our learning-based safety approach in a 5D race car simulator, compare the learned safety policy to the numerically obtained optimal solution, and empirically validate the robust safety guarantee of our proposed safety shield against worst-case model discrepancy.
{"title":"ISAACS: Iterative Soft Adversarial Actor-Critic for Safety","authors":"Kai Hsu, D. Nguyen, J. Fisac","doi":"10.48550/arXiv.2212.03228","DOIUrl":"https://doi.org/10.48550/arXiv.2212.03228","url":null,"abstract":"The deployment of robots in uncontrolled environments requires them to operate robustly under previously unseen scenarios, like irregular terrain and wind conditions. Unfortunately, while rigorous safety frameworks from robust optimal control theory scale poorly to high-dimensional nonlinear dynamics, control policies computed by more tractable\"deep\"methods lack guarantees and tend to exhibit little robustness to uncertain operating conditions. This work introduces a novel approach enabling scalable synthesis of robust safety-preserving controllers for robotic systems with general nonlinear dynamics subject to bounded modeling error by combining game-theoretic safety analysis with adversarial reinforcement learning in simulation. Following a soft actor-critic scheme, a safety-seeking fallback policy is co-trained with an adversarial\"disturbance\"agent that aims to invoke the worst-case realization of model error and training-to-deployment discrepancy allowed by the designer's uncertainty. While the learned control policy does not intrinsically guarantee safety, it is used to construct a real-time safety filter (or shield) with robust safety guarantees based on forward reachability rollouts. This shield can be used in conjunction with a safety-agnostic control policy, precluding any task-driven actions that could result in loss of safety. We evaluate our learning-based safety approach in a 5D race car simulator, compare the learned safety policy to the numerically obtained optimal solution, and empirically validate the robust safety guarantee of our proposed safety shield against worst-case model discrepancy.","PeriodicalId":268449,"journal":{"name":"Conference on Learning for Dynamics & Control","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116337380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-06DOI: 10.48550/arXiv.2212.02693
Killian Wood, E. Dall’Anese
In this work, we consider a time-varying stochastic saddle point problem in which the objective is revealed sequentially, and the data distribution depends on the decision variables. Problems of this type express the distributional dependence via a distributional map, and are known to have two distinct types of solutions--saddle points and equilibrium points. We demonstrate that, under suitable conditions, online primal-dual type algorithms are capable of tracking equilibrium points. In contrast, since computing closed-form gradient of the objective requires knowledge of the distributional map, we offer an online stochastic primal-dual algorithm for tracking equilibrium trajectories. We provide bounds in expectation and in high probability, with the latter leveraging a sub-Weibull model for the gradient error. We illustrate our results on an electric vehicle charging problem where responsiveness to prices follows a location-scale family based distributional map.
{"title":"Online Saddle Point Tracking with Decision-Dependent Data","authors":"Killian Wood, E. Dall’Anese","doi":"10.48550/arXiv.2212.02693","DOIUrl":"https://doi.org/10.48550/arXiv.2212.02693","url":null,"abstract":"In this work, we consider a time-varying stochastic saddle point problem in which the objective is revealed sequentially, and the data distribution depends on the decision variables. Problems of this type express the distributional dependence via a distributional map, and are known to have two distinct types of solutions--saddle points and equilibrium points. We demonstrate that, under suitable conditions, online primal-dual type algorithms are capable of tracking equilibrium points. In contrast, since computing closed-form gradient of the objective requires knowledge of the distributional map, we offer an online stochastic primal-dual algorithm for tracking equilibrium trajectories. We provide bounds in expectation and in high probability, with the latter leveraging a sub-Weibull model for the gradient error. We illustrate our results on an electric vehicle charging problem where responsiveness to prices follows a location-scale family based distributional map.","PeriodicalId":268449,"journal":{"name":"Conference on Learning for Dynamics & Control","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132421756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-06DOI: 10.48550/arXiv.2212.03194
Sheng Cheng, Lin Song, Minkyung Kim, Shenlong Wang, N. Hovakimyan
Controller tuning is a vital step to ensure the controller delivers its designed performance. DiffTune has been proposed as an automatic tuning method that unrolls the dynamical system and controller into a computational graph and uses auto-differentiation to obtain the gradient for the controller's parameter update. However, DiffTune uses the vanilla gradient descent to iteratively update the parameter, in which the performance largely depends on the choice of the learning rate (as a hyperparameter). In this paper, we propose to use hyperparameter-free methods to update the controller parameters. We find the optimal parameter update by maximizing the loss reduction, where a predicted loss based on the approximated state and control is used for the maximization. Two methods are proposed to optimally update the parameters and are compared with related variants in simulations on a Dubin's car and a quadrotor. Simulation experiments show that the proposed first-order method outperforms the hyperparameter-based methods and is more robust than the second-order hyperparameter-free methods.
{"title":"DiffTune+: Hyperparameter-Free Auto-Tuning using Auto-Differentiation","authors":"Sheng Cheng, Lin Song, Minkyung Kim, Shenlong Wang, N. Hovakimyan","doi":"10.48550/arXiv.2212.03194","DOIUrl":"https://doi.org/10.48550/arXiv.2212.03194","url":null,"abstract":"Controller tuning is a vital step to ensure the controller delivers its designed performance. DiffTune has been proposed as an automatic tuning method that unrolls the dynamical system and controller into a computational graph and uses auto-differentiation to obtain the gradient for the controller's parameter update. However, DiffTune uses the vanilla gradient descent to iteratively update the parameter, in which the performance largely depends on the choice of the learning rate (as a hyperparameter). In this paper, we propose to use hyperparameter-free methods to update the controller parameters. We find the optimal parameter update by maximizing the loss reduction, where a predicted loss based on the approximated state and control is used for the maximization. Two methods are proposed to optimally update the parameters and are compared with related variants in simulations on a Dubin's car and a quadrotor. Simulation experiments show that the proposed first-order method outperforms the hyperparameter-based methods and is more robust than the second-order hyperparameter-free methods.","PeriodicalId":268449,"journal":{"name":"Conference on Learning for Dynamics & Control","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122997807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-05DOI: 10.48550/arXiv.2212.02179
Adithya Ramesh, Balaraman Ravindran
We apply reinforcement learning (RL) to robotics tasks. One of the drawbacks of traditional RL algorithms has been their poor sample efficiency. One approach to improve the sample efficiency is model-based RL. In our model-based RL algorithm, we learn a model of the environment, essentially its transition dynamics and reward function, use it to generate imaginary trajectories and backpropagate through them to update the policy, exploiting the differentiability of the model. Intuitively, learning more accurate models should lead to better model-based RL performance. Recently, there has been growing interest in developing better deep neural network based dynamics models for physical systems, by utilizing the structure of the underlying physics. We focus on robotic systems undergoing rigid body motion without contacts. We compare two versions of our model-based RL algorithm, one which uses a standard deep neural network based dynamics model and the other which uses a much more accurate, physics-informed neural network based dynamics model. We show that, in model-based RL, model accuracy mainly matters in environments that are sensitive to initial conditions, where numerical errors accumulate fast. In these environments, the physics-informed version of our algorithm achieves significantly better average-return and sample efficiency. In environments that are not sensitive to initial conditions, both versions of our algorithm achieve similar average-return, while the physics-informed version achieves better sample efficiency. We also show that, in challenging environments, physics-informed model-based RL achieves better average-return than state-of-the-art model-free RL algorithms such as Soft Actor-Critic, as it computes the policy-gradient analytically, while the latter estimates it through sampling.
{"title":"Physics-Informed Model-Based Reinforcement Learning","authors":"Adithya Ramesh, Balaraman Ravindran","doi":"10.48550/arXiv.2212.02179","DOIUrl":"https://doi.org/10.48550/arXiv.2212.02179","url":null,"abstract":"We apply reinforcement learning (RL) to robotics tasks. One of the drawbacks of traditional RL algorithms has been their poor sample efficiency. One approach to improve the sample efficiency is model-based RL. In our model-based RL algorithm, we learn a model of the environment, essentially its transition dynamics and reward function, use it to generate imaginary trajectories and backpropagate through them to update the policy, exploiting the differentiability of the model. Intuitively, learning more accurate models should lead to better model-based RL performance. Recently, there has been growing interest in developing better deep neural network based dynamics models for physical systems, by utilizing the structure of the underlying physics. We focus on robotic systems undergoing rigid body motion without contacts. We compare two versions of our model-based RL algorithm, one which uses a standard deep neural network based dynamics model and the other which uses a much more accurate, physics-informed neural network based dynamics model. We show that, in model-based RL, model accuracy mainly matters in environments that are sensitive to initial conditions, where numerical errors accumulate fast. In these environments, the physics-informed version of our algorithm achieves significantly better average-return and sample efficiency. In environments that are not sensitive to initial conditions, both versions of our algorithm achieve similar average-return, while the physics-informed version achieves better sample efficiency. We also show that, in challenging environments, physics-informed model-based RL achieves better average-return than state-of-the-art model-free RL algorithms such as Soft Actor-Critic, as it computes the policy-gradient analytically, while the latter estimates it through sampling.","PeriodicalId":268449,"journal":{"name":"Conference on Learning for Dynamics & Control","volume":"os-48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127788590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-05DOI: 10.3929/ethz-b-000615512
Antoine P. Leeman, Johannes Köhler, S. Bennani, M. Zeilinger
Safety filters provide modular techniques to augment potentially unsafe control inputs (e.g. from learning-based controllers or humans) with safety guarantees in the form of constraint satisfaction. In this paper, we present an improved model predictive safety filter (MPSF) formulation, which incorporates system level synthesis techniques in the design. The resulting SL-MPSF scheme ensures safety for linear systems subject to bounded disturbances in an enlarged safe set. It requires less severe and frequent modifications of potentially unsafe control inputs compared to existing MPSF formulations to certify safety. In addition, we propose an explicit variant of the SL-MPSF formulation, which maintains scalability, and reduces the required online computational effort - the main drawback of the MPSF. The benefits of the proposed system level safety filter formulations compared to state-of-the-art MPSF formulations are demonstrated using a numerical example.
{"title":"Predictive safety filter using system level synthesis","authors":"Antoine P. Leeman, Johannes Köhler, S. Bennani, M. Zeilinger","doi":"10.3929/ethz-b-000615512","DOIUrl":"https://doi.org/10.3929/ethz-b-000615512","url":null,"abstract":"Safety filters provide modular techniques to augment potentially unsafe control inputs (e.g. from learning-based controllers or humans) with safety guarantees in the form of constraint satisfaction. In this paper, we present an improved model predictive safety filter (MPSF) formulation, which incorporates system level synthesis techniques in the design. The resulting SL-MPSF scheme ensures safety for linear systems subject to bounded disturbances in an enlarged safe set. It requires less severe and frequent modifications of potentially unsafe control inputs compared to existing MPSF formulations to certify safety. In addition, we propose an explicit variant of the SL-MPSF formulation, which maintains scalability, and reduces the required online computational effort - the main drawback of the MPSF. The benefits of the proposed system level safety filter formulations compared to state-of-the-art MPSF formulations are demonstrated using a numerical example.","PeriodicalId":268449,"journal":{"name":"Conference on Learning for Dynamics & Control","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134418953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-05DOI: 10.48550/arXiv.2212.02072
Leilei Cui, Zhong-Ping Jiang
This paper proposes a novel robust reinforcement learning framework for discrete-time systems with model mismatch that may arise from the sim2real gap. A key strategy is to invoke advanced techniques from control theory. Using the formulation of the classical risk-sensitive linear quadratic Gaussian control, a dual-loop policy iteration algorithm is proposed to generate a robust optimal controller. The dual-loop policy iteration algorithm is shown to be globally exponentially and uniformly convergent, and robust against disturbance during the learning process. This robustness property is called small-disturbance input-to-state stability and guarantees that the proposed policy iteration algorithm converges to a small neighborhood of the optimal controller as long as the disturbance at each learning step is small. In addition, when the system dynamics is unknown, a novel model-free off-policy policy iteration algorithm is proposed for the same class of dynamical system with additive Gaussian noise. Finally, numerical examples are provided for the demonstration of the proposed algorithm.
{"title":"A Reinforcement Learning Look at Risk-Sensitive Linear Quadratic Gaussian Control","authors":"Leilei Cui, Zhong-Ping Jiang","doi":"10.48550/arXiv.2212.02072","DOIUrl":"https://doi.org/10.48550/arXiv.2212.02072","url":null,"abstract":"This paper proposes a novel robust reinforcement learning framework for discrete-time systems with model mismatch that may arise from the sim2real gap. A key strategy is to invoke advanced techniques from control theory. Using the formulation of the classical risk-sensitive linear quadratic Gaussian control, a dual-loop policy iteration algorithm is proposed to generate a robust optimal controller. The dual-loop policy iteration algorithm is shown to be globally exponentially and uniformly convergent, and robust against disturbance during the learning process. This robustness property is called small-disturbance input-to-state stability and guarantees that the proposed policy iteration algorithm converges to a small neighborhood of the optimal controller as long as the disturbance at each learning step is small. In addition, when the system dynamics is unknown, a novel model-free off-policy policy iteration algorithm is proposed for the same class of dynamical system with additive Gaussian noise. Finally, numerical examples are provided for the demonstration of the proposed algorithm.","PeriodicalId":268449,"journal":{"name":"Conference on Learning for Dynamics & Control","volume":"412 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127598683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-03DOI: 10.48550/arXiv.2212.01554
Kehan Long, Yinzhuang Yi, J. Cortés, Nikolay A. Atanasov
This paper develops methods for proving Lyapunov stability of dynamical systems subject to disturbances with an unknown distribution. We assume only a finite set of disturbance samples is available and that the true online disturbance realization may be drawn from a different distribution than the given samples. We formulate an optimization problem to search for a sum-of-squares (SOS) Lyapunov function and introduce a distributionally robust version of the Lyapunov function derivative constraint. We show that this constraint may be reformulated as several SOS constraints, ensuring that the search for a Lyapunov function remains in the class of SOS polynomial optimization problems. For general systems, we provide a distributionally robust chance-constrained formulation for neural network Lyapunov function search. Simulations demonstrate the validity and efficiency of either formulation on non-linear uncertain dynamical systems.
{"title":"Distributionally Robust Lyapunov Function Search Under Uncertainty","authors":"Kehan Long, Yinzhuang Yi, J. Cortés, Nikolay A. Atanasov","doi":"10.48550/arXiv.2212.01554","DOIUrl":"https://doi.org/10.48550/arXiv.2212.01554","url":null,"abstract":"This paper develops methods for proving Lyapunov stability of dynamical systems subject to disturbances with an unknown distribution. We assume only a finite set of disturbance samples is available and that the true online disturbance realization may be drawn from a different distribution than the given samples. We formulate an optimization problem to search for a sum-of-squares (SOS) Lyapunov function and introduce a distributionally robust version of the Lyapunov function derivative constraint. We show that this constraint may be reformulated as several SOS constraints, ensuring that the search for a Lyapunov function remains in the class of SOS polynomial optimization problems. For general systems, we provide a distributionally robust chance-constrained formulation for neural network Lyapunov function search. Simulations demonstrate the validity and efficiency of either formulation on non-linear uncertain dynamical systems.","PeriodicalId":268449,"journal":{"name":"Conference on Learning for Dynamics & Control","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117123010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-03DOI: 10.48550/arXiv.2212.01498
Pengzhi Yang, Shumon Koga, Arash Asgharivaskasi, Nikolay A. Atanasov
This paper proposes a novel model-based policy gradient algorithm for tracking dynamic targets using a mobile robot, equipped with an onboard sensor with limited field of view. The task is to obtain a continuous control policy for the mobile robot to collect sensor measurements that reduce uncertainty in the target states, measured by the target distribution entropy. We design a neural network control policy with the robot $SE(3)$ pose and the mean vector and information matrix of the joint target distribution as inputs and attention layers to handle variable numbers of targets. We also derive the gradient of the target entropy with respect to the network parameters explicitly, allowing efficient model-based policy gradient optimization.
{"title":"Policy Learning for Active Target Tracking over Continuous SE(3) Trajectories","authors":"Pengzhi Yang, Shumon Koga, Arash Asgharivaskasi, Nikolay A. Atanasov","doi":"10.48550/arXiv.2212.01498","DOIUrl":"https://doi.org/10.48550/arXiv.2212.01498","url":null,"abstract":"This paper proposes a novel model-based policy gradient algorithm for tracking dynamic targets using a mobile robot, equipped with an onboard sensor with limited field of view. The task is to obtain a continuous control policy for the mobile robot to collect sensor measurements that reduce uncertainty in the target states, measured by the target distribution entropy. We design a neural network control policy with the robot $SE(3)$ pose and the mean vector and information matrix of the joint target distribution as inputs and attention layers to handle variable numbers of targets. We also derive the gradient of the target entropy with respect to the network parameters explicitly, allowing efficient model-based policy gradient optimization.","PeriodicalId":268449,"journal":{"name":"Conference on Learning for Dynamics & Control","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115616165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-03DOI: 10.48550/arXiv.2212.01503
Tahiya Salam, Alice K. Li, M. Hsieh
Transfer operators offer linear representations and global, physically meaningful features of nonlinear dynamical systems. Discovering transfer operators, such as the Koopman operator, require careful crafted dictionaries of observables, acting on states of the dynamical system. This is ad hoc and requires the full dataset for evaluation. In this paper, we offer an optimization scheme to allow joint learning of the observables and Koopman operator with online data. Our results show we are able to reconstruct the evolution and represent the global features of complex dynamical systems.
{"title":"Online Estimation of the Koopman Operator Using Fourier Features","authors":"Tahiya Salam, Alice K. Li, M. Hsieh","doi":"10.48550/arXiv.2212.01503","DOIUrl":"https://doi.org/10.48550/arXiv.2212.01503","url":null,"abstract":"Transfer operators offer linear representations and global, physically meaningful features of nonlinear dynamical systems. Discovering transfer operators, such as the Koopman operator, require careful crafted dictionaries of observables, acting on states of the dynamical system. This is ad hoc and requires the full dataset for evaluation. In this paper, we offer an optimization scheme to allow joint learning of the observables and Koopman operator with online data. Our results show we are able to reconstruct the evolution and represent the global features of complex dynamical systems.","PeriodicalId":268449,"journal":{"name":"Conference on Learning for Dynamics & Control","volume":"99 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113993793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-02DOI: 10.48550/arXiv.2212.01343
F. D. Lellis, M. Coraggio, G. Russo, Mirco Musolesi, M. Bernardo
One of the major challenges in Deep Reinforcement Learning for control is the need for extensive training to learn the policy. Motivated by this, we present the design of the Control-Tutored Deep Q-Networks (CT-DQN) algorithm, a Deep Reinforcement Learning algorithm that leverages a control tutor, i.e., an exogenous control law, to reduce learning time. The tutor can be designed using an approximate model of the system, without any assumption about the knowledge of the system's dynamics. There is no expectation that it will be able to achieve the control objective if used stand-alone. During learning, the tutor occasionally suggests an action, thus partially guiding exploration. We validate our approach on three scenarios from OpenAI Gym: the inverted pendulum, lunar lander, and car racing. We demonstrate that CT-DQN is able to achieve better or equivalent data efficiency with respect to the classic function approximation solutions.
{"title":"CT-DQN: Control-Tutored Deep Reinforcement Learning","authors":"F. D. Lellis, M. Coraggio, G. Russo, Mirco Musolesi, M. Bernardo","doi":"10.48550/arXiv.2212.01343","DOIUrl":"https://doi.org/10.48550/arXiv.2212.01343","url":null,"abstract":"One of the major challenges in Deep Reinforcement Learning for control is the need for extensive training to learn the policy. Motivated by this, we present the design of the Control-Tutored Deep Q-Networks (CT-DQN) algorithm, a Deep Reinforcement Learning algorithm that leverages a control tutor, i.e., an exogenous control law, to reduce learning time. The tutor can be designed using an approximate model of the system, without any assumption about the knowledge of the system's dynamics. There is no expectation that it will be able to achieve the control objective if used stand-alone. During learning, the tutor occasionally suggests an action, thus partially guiding exploration. We validate our approach on three scenarios from OpenAI Gym: the inverted pendulum, lunar lander, and car racing. We demonstrate that CT-DQN is able to achieve better or equivalent data efficiency with respect to the classic function approximation solutions.","PeriodicalId":268449,"journal":{"name":"Conference on Learning for Dynamics & Control","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130017767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}