M. Chang, Brendan Tschaen, Theophilus A. Benson, L. Vanbever
{"title":"Chaos Monkey: Increasing SDN Reliability through Systematic Network Destruction","authors":"M. Chang, Brendan Tschaen, Theophilus A. Benson, L. Vanbever","doi":"10.1145/2785956.2790038","DOIUrl":null,"url":null,"abstract":"As modern networking applications become increasingly dynamic and high-bandwidth, software defined networking (SDN) has emerged as an agile, cost effective architecture with widespread adoption across industry. In SDN, the control-plane program runs on a logically-centralized controller which directly configures the packet-handling mechanisms in the underlying switches using an open API (e.g., OpenFlow). While the controller makes it exceptionally convenient for a network operator to control and manage a network, the controller requires complex logic and becomes a single point of failure within the network. As a result, configuration errors by the controller could be extremely costly for the network provider. Several SDN controllers have been developed since the conception of SDN, and network operators have utilized very traditional means of identifying bugs in the controller, such as unit testing and model checking [1]. However, it has become apparent that these methods cannot practically handle the inherent complexity of the controller platform that manages large networks. Ultimately, one major source of this complexity are network failures, as they trigger execution of unexplored portions of code; these network failures are inevitable, costly, and considering all possible interleaving of bugs is simply unfeasible. To address this problem, we propose “Chaos Monkey” a real-time post-deployment failure injection tool. Inspired by industry practices in the cloud [2], Chaos Monkey is intended to systematically introduce failure (e.g., link failure, network failure) into a network. Chaos Monkey is guided by the following design principles:","PeriodicalId":268472,"journal":{"name":"Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2785956.2790038","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 30
Abstract
As modern networking applications become increasingly dynamic and high-bandwidth, software defined networking (SDN) has emerged as an agile, cost effective architecture with widespread adoption across industry. In SDN, the control-plane program runs on a logically-centralized controller which directly configures the packet-handling mechanisms in the underlying switches using an open API (e.g., OpenFlow). While the controller makes it exceptionally convenient for a network operator to control and manage a network, the controller requires complex logic and becomes a single point of failure within the network. As a result, configuration errors by the controller could be extremely costly for the network provider. Several SDN controllers have been developed since the conception of SDN, and network operators have utilized very traditional means of identifying bugs in the controller, such as unit testing and model checking [1]. However, it has become apparent that these methods cannot practically handle the inherent complexity of the controller platform that manages large networks. Ultimately, one major source of this complexity are network failures, as they trigger execution of unexplored portions of code; these network failures are inevitable, costly, and considering all possible interleaving of bugs is simply unfeasible. To address this problem, we propose “Chaos Monkey” a real-time post-deployment failure injection tool. Inspired by industry practices in the cloud [2], Chaos Monkey is intended to systematically introduce failure (e.g., link failure, network failure) into a network. Chaos Monkey is guided by the following design principles: