Chaos Monkey: Increasing SDN Reliability through Systematic Network Destruction

Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication Pub Date : 2015-08-17 DOI:10.1145/2785956.2790038

M. Chang, Brendan Tschaen, Theophilus A. Benson, L. Vanbever

{"title":"Chaos Monkey: Increasing SDN Reliability through Systematic Network Destruction","authors":"M. Chang, Brendan Tschaen, Theophilus A. Benson, L. Vanbever","doi":"10.1145/2785956.2790038","DOIUrl":null,"url":null,"abstract":"As modern networking applications become increasingly dynamic and high-bandwidth, software defined networking (SDN) has emerged as an agile, cost effective architecture with widespread adoption across industry. In SDN, the control-plane program runs on a logically-centralized controller which directly configures the packet-handling mechanisms in the underlying switches using an open API (e.g., OpenFlow). While the controller makes it exceptionally convenient for a network operator to control and manage a network, the controller requires complex logic and becomes a single point of failure within the network. As a result, configuration errors by the controller could be extremely costly for the network provider. Several SDN controllers have been developed since the conception of SDN, and network operators have utilized very traditional means of identifying bugs in the controller, such as unit testing and model checking [1]. However, it has become apparent that these methods cannot practically handle the inherent complexity of the controller platform that manages large networks. Ultimately, one major source of this complexity are network failures, as they trigger execution of unexplored portions of code; these network failures are inevitable, costly, and considering all possible interleaving of bugs is simply unfeasible. To address this problem, we propose “Chaos Monkey” a real-time post-deployment failure injection tool. Inspired by industry practices in the cloud [2], Chaos Monkey is intended to systematically introduce failure (e.g., link failure, network failure) into a network. Chaos Monkey is guided by the following design principles:","PeriodicalId":268472,"journal":{"name":"Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2785956.2790038","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

Abstract

As modern networking applications become increasingly dynamic and high-bandwidth, software defined networking (SDN) has emerged as an agile, cost effective architecture with widespread adoption across industry. In SDN, the control-plane program runs on a logically-centralized controller which directly configures the packet-handling mechanisms in the underlying switches using an open API (e.g., OpenFlow). While the controller makes it exceptionally convenient for a network operator to control and manage a network, the controller requires complex logic and becomes a single point of failure within the network. As a result, configuration errors by the controller could be extremely costly for the network provider. Several SDN controllers have been developed since the conception of SDN, and network operators have utilized very traditional means of identifying bugs in the controller, such as unit testing and model checking [1]. However, it has become apparent that these methods cannot practically handle the inherent complexity of the controller platform that manages large networks. Ultimately, one major source of this complexity are network failures, as they trigger execution of unexplored portions of code; these network failures are inevitable, costly, and considering all possible interleaving of bugs is simply unfeasible. To address this problem, we propose “Chaos Monkey” a real-time post-deployment failure injection tool. Inspired by industry practices in the cloud [2], Chaos Monkey is intended to systematically introduce failure (e.g., link failure, network failure) into a network. Chaos Monkey is guided by the following design principles:

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

混沌猴子:通过系统的网络破坏提高SDN的可靠性

随着现代网络应用程序变得越来越动态和高带宽，软件定义网络(SDN)已经成为一种敏捷的、具有成本效益的体系结构，并在整个行业中得到广泛采用。在SDN中，控制平面程序运行在逻辑上集中的控制器上，该控制器使用开放API(例如OpenFlow)直接配置底层交换机中的数据包处理机制。虽然控制器使网络操作员非常方便地控制和管理网络，但控制器需要复杂的逻辑，并成为网络中的单点故障。因此，控制器的配置错误可能会给网络提供商带来极大的代价。自SDN概念提出以来，已经开发了几种SDN控制器，网络运营商使用非常传统的方法来识别控制器中的错误，例如单元测试和模型检查[1]。然而，这些方法显然不能实际处理管理大型网络的控制器平台的固有复杂性。最终，这种复杂性的一个主要来源是网络故障，因为它们会触发执行未开发的代码部分;这些网络故障是不可避免的，代价高昂，考虑所有可能的错误交织是根本不可行的。为了解决这个问题，我们提出了一个实时的部署后故障注入工具“Chaos Monkey”。受到云行业实践的启发[2]，Chaos Monkey旨在系统地将故障(如链路故障、网络故障)引入网络。Chaos Monkey遵循以下设计原则:

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication

自引率

0.00%

发文量

期刊最新文献

Alternative Trust Sources: Reducing DNSSEC Signature Verification Operations with TLS RPKI MIRO: Monitoring and Inspection of RPKI Objects Hopper: Decentralized Speculation-aware Cluster Scheduling at Scale Extreme Data-rate Scheduling for the Data Center Multi-Context TLS (mcTLS): Enabling Secure In-Network Functionality in TLS