PreFix: Switch Failure Prediction in Datacenter Networks

Shenglin Zhang, Y. Liu, Weibin Meng, Zhiling Luo, Jiahao Bu, Sen Yang, Peixian Liang, Dan Pei, Jun Xu, Yuzhi Zhang, Yu Chen, Hui Dong, Xianping Qu, Lei Song
{"title":"PreFix: Switch Failure Prediction in Datacenter Networks","authors":"Shenglin Zhang, Y. Liu, Weibin Meng, Zhiling Luo, Jiahao Bu, Sen Yang, Peixian Liang, Dan Pei, Jun Xu, Yuzhi Zhang, Yu Chen, Hui Dong, Xianping Qu, Lei Song","doi":"10.1145/3219617.3219643","DOIUrl":null,"url":null,"abstract":"In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting failures, thus the operators can intervene and \"fix\" the potential failures before they happen. Specifically, in our proposed system, named PreFix, we aim to determine during runtime whether a switch failure will happen in the near future. The prediction is based on the measurements of the current switch system status and historical switch hardware failure cases that have been carefully labelled by network operators. Our key observation is that failures of the same switch model share some common syslog patterns before failures occur, and we can apply machine learning methods to extract the common patterns for predicting switch failures. Our novel set of features (message template sequence, frequency, seasonality and surge) for machine learning can efficiently deal with the challenges of noises, sample imbalance, and computation overhead. We evaluated PreFix on a data set collected from 9397 switches (3 different switch models) deployed in more than 20 datacenters owned by a top global search engine in a 2-year period. PreFix achieved an average of 61.81% recall and 1.84x10 -5 false positive ratio, outperforming the other failure prediction methods for computers and ISP devices.","PeriodicalId":210440,"journal":{"name":"Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"72","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3219617.3219643","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 72

Abstract

In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting failures, thus the operators can intervene and "fix" the potential failures before they happen. Specifically, in our proposed system, named PreFix, we aim to determine during runtime whether a switch failure will happen in the near future. The prediction is based on the measurements of the current switch system status and historical switch hardware failure cases that have been carefully labelled by network operators. Our key observation is that failures of the same switch model share some common syslog patterns before failures occur, and we can apply machine learning methods to extract the common patterns for predicting switch failures. Our novel set of features (message template sequence, frequency, seasonality and surge) for machine learning can efficiently deal with the challenges of noises, sample imbalance, and computation overhead. We evaluated PreFix on a data set collected from 9397 switches (3 different switch models) deployed in more than 20 datacenters owned by a top global search engine in a 2-year period. PreFix achieved an average of 61.81% recall and 1.84x10 -5 false positive ratio, outperforming the other failure prediction methods for computers and ISP devices.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
前缀:数据中心网络中的交换机故障预测
在现代数据中心网络(dcn)中,网络设备的故障是常态,而不是例外,许多研究工作都集中在故障发生后的处理上。在本文中,我们采取了一种不同的方法,通过预测故障,从而使操作员可以在潜在故障发生之前进行干预和“修复”。具体来说,在我们提出的名为PreFix的系统中,我们的目标是在运行时确定在不久的将来是否会发生交换机故障。该预测是基于对当前交换机系统状态和历史交换机硬件故障案例的测量,这些故障案例已被网络运营商仔细标记。我们的主要观察结果是,同一交换机模型的故障在发生故障之前共享一些常见的syslog模式,我们可以应用机器学习方法来提取用于预测交换机故障的常见模式。我们新颖的机器学习特征集(消息模板序列、频率、季节性和激增)可以有效地处理噪声、样本不平衡和计算开销的挑战。我们在一个数据集上对PreFix进行了评估,该数据集收集了9397台交换机(3种不同的交换机型号),这些交换机部署在一家全球顶级搜索引擎拥有的20多个数据中心中,历时2年。PreFix的平均召回率为61.81%,假阳性率为1.84 × 10 -5,优于其他针对计算机和ISP设备的故障预测方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Session details: Networking Asymptotically Optimal Load Balancing Topologies On Resource Pooling and Separation for LRU Caching Working Set Size Estimation Techniques in Virtualized Environments: One Size Does not Fit All PreFix: Switch Failure Prediction in Datacenter Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1