{"title":"基于深度强化学习的ICT系统故障自动恢复框架","authors":"Hiroki Ikeuchi, Jiawen Ge, Yoichi Matsuo, Keishiro Watanabe","doi":"10.1109/ICDCS47774.2020.00170","DOIUrl":null,"url":null,"abstract":"Because automatic recovery from failures is of great importance for future operations of ICT systems, we propose a framework for learning a recovery policy using deep reinforcement learning. In our framework, while iteratively trying various recovery actions and observing system metrics in a target system, an agent autonomously learns the optimal recovery policy, which indicates what recovery action should be executed on the basis of observations. By using failure injection tools designed for Chaos Engineering, we can reproduce many types of failures in the target system, thereby making the agent learn a recovery policy applicable to various failures. Once the recovery policy is obtained, we can automate failure recovery by executing recovery actions that the recovery policy returns. Unlike most previous methods, our framework does not require any historical documents of failure recovery or modeling of system behavior. To verify the feasibility of the framework, we conducted an experiment using a container-based environment built on a Kubernetes cluster, demonstrating that training converges in a few days and the obtained recovery policy can successfully recover from failures with a minimum number of recovery actions.","PeriodicalId":158630,"journal":{"name":"2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"A Framework for Automatic Failure Recovery in ICT Systems by Deep Reinforcement Learning\",\"authors\":\"Hiroki Ikeuchi, Jiawen Ge, Yoichi Matsuo, Keishiro Watanabe\",\"doi\":\"10.1109/ICDCS47774.2020.00170\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Because automatic recovery from failures is of great importance for future operations of ICT systems, we propose a framework for learning a recovery policy using deep reinforcement learning. In our framework, while iteratively trying various recovery actions and observing system metrics in a target system, an agent autonomously learns the optimal recovery policy, which indicates what recovery action should be executed on the basis of observations. By using failure injection tools designed for Chaos Engineering, we can reproduce many types of failures in the target system, thereby making the agent learn a recovery policy applicable to various failures. Once the recovery policy is obtained, we can automate failure recovery by executing recovery actions that the recovery policy returns. Unlike most previous methods, our framework does not require any historical documents of failure recovery or modeling of system behavior. To verify the feasibility of the framework, we conducted an experiment using a container-based environment built on a Kubernetes cluster, demonstrating that training converges in a few days and the obtained recovery policy can successfully recover from failures with a minimum number of recovery actions.\",\"PeriodicalId\":158630,\"journal\":{\"name\":\"2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDCS47774.2020.00170\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS47774.2020.00170","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Framework for Automatic Failure Recovery in ICT Systems by Deep Reinforcement Learning
Because automatic recovery from failures is of great importance for future operations of ICT systems, we propose a framework for learning a recovery policy using deep reinforcement learning. In our framework, while iteratively trying various recovery actions and observing system metrics in a target system, an agent autonomously learns the optimal recovery policy, which indicates what recovery action should be executed on the basis of observations. By using failure injection tools designed for Chaos Engineering, we can reproduce many types of failures in the target system, thereby making the agent learn a recovery policy applicable to various failures. Once the recovery policy is obtained, we can automate failure recovery by executing recovery actions that the recovery policy returns. Unlike most previous methods, our framework does not require any historical documents of failure recovery or modeling of system behavior. To verify the feasibility of the framework, we conducted an experiment using a container-based environment built on a Kubernetes cluster, demonstrating that training converges in a few days and the obtained recovery policy can successfully recover from failures with a minimum number of recovery actions.