An Intelligent Framework for Timely, Accurate, and Comprehensive Cloud Incident Detection

Q3 Computer Science Operating Systems Review (ACM) Pub Date : 2022-06-14 DOI:10.1145/3544497.3544499
Yichen Li, Xu Zhang, Shilin He, Zhuangbin Chen, Yu Kang, Jinyang Liu, Liqun Li, Yingnong Dang, Feng Gao, Zhangwei Xu, S. Rajmohan, Qingwei Lin, Dongmei Zhang, Michael R. Lyu
{"title":"An Intelligent Framework for Timely, Accurate, and Comprehensive Cloud Incident Detection","authors":"Yichen Li, Xu Zhang, Shilin He, Zhuangbin Chen, Yu Kang, Jinyang Liu, Liqun Li, Yingnong Dang, Feng Gao, Zhangwei Xu, S. Rajmohan, Qingwei Lin, Dongmei Zhang, Michael R. Lyu","doi":"10.1145/3544497.3544499","DOIUrl":null,"url":null,"abstract":"Cloud incidents (service interruptions or performance degradation) dramatically degrade the reliability of large-scale cloud systems, causing customer dissatisfaction and revenue loss. With years of efforts, cloud providers are able to solve most incidents automatically and rapidly. The secret of this ability is intelligent incident detection. Only when incidents are detected timely, accurately, and comprehensively, can they be diagnosed and mitigated at a satisfiable speed. To overcome the limitations of traditional rule-based detection, we carried out years of incident detection research. We developed a comprehensive AIOps (Artificial Intelligence for IT Operations) framework for incident detection containing a set of data-driven methods. This paper shares our recent experience of developing and deploying such an intelligent incident detection system at Microsoft. We first discuss the real-world challenges of incident detection that constitute the pain points of engineers. Then, we summarize our intelligent solutions proposed in recent years to tackle these challenges. Finally, we show the deployment of the incident detection AIOps framework and demonstrate its practical benefits conveyed to Microsoft cloud services with real cases.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"56 1","pages":"1 - 7"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Operating Systems Review (ACM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3544497.3544499","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 5

Abstract

Cloud incidents (service interruptions or performance degradation) dramatically degrade the reliability of large-scale cloud systems, causing customer dissatisfaction and revenue loss. With years of efforts, cloud providers are able to solve most incidents automatically and rapidly. The secret of this ability is intelligent incident detection. Only when incidents are detected timely, accurately, and comprehensively, can they be diagnosed and mitigated at a satisfiable speed. To overcome the limitations of traditional rule-based detection, we carried out years of incident detection research. We developed a comprehensive AIOps (Artificial Intelligence for IT Operations) framework for incident detection containing a set of data-driven methods. This paper shares our recent experience of developing and deploying such an intelligent incident detection system at Microsoft. We first discuss the real-world challenges of incident detection that constitute the pain points of engineers. Then, we summarize our intelligent solutions proposed in recent years to tackle these challenges. Finally, we show the deployment of the incident detection AIOps framework and demonstrate its practical benefits conveyed to Microsoft cloud services with real cases.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用于及时、准确和全面的云事件检测的智能框架
云事件(服务中断或性能下降)极大地降低了大规模云系统的可靠性,导致客户不满和收入损失。经过多年的努力,云提供商能够自动快速地解决大多数事件。这种能力的秘密是智能事件检测。只有及时、准确、全面地发现事件,才能以令人满意的速度对其进行诊断和缓解。为了克服传统基于规则检测的局限性,我们进行了多年的事件检测研究。我们开发了一个用于事件检测的全面AIOps(IT运营人工智能)框架,其中包含一组数据驱动的方法。本文分享了我们最近在微软开发和部署这种智能事件检测系统的经验。我们首先讨论了事件检测的现实挑战,这些挑战构成了工程师的痛点。然后,我们总结了近年来为应对这些挑战而提出的智能解决方案。最后,我们展示了事件检测AIOps框架的部署,并通过实际案例展示了其向微软云服务带来的实际好处。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Operating Systems Review (ACM)
Operating Systems Review (ACM) Computer Science-Computer Networks and Communications
CiteScore
2.80
自引率
0.00%
发文量
10
期刊介绍: Operating Systems Review (OSR) is a publication of the ACM Special Interest Group on Operating Systems (SIGOPS), whose scope of interest includes: computer operating systems and architecture for multiprogramming, multiprocessing, and time sharing; resource management; evaluation and simulation; reliability, integrity, and security of data; communications among computing processors; and computer system modeling and analysis.
期刊最新文献
Disaggregated GPU Acceleration for Serverless Applications Navigating Performance-Efficiency Tradeoffs in Serverless Computing: Deduplication to the Rescue! Using Local Cache Coherence for Disaggregated Memory Systems Make It Real: An End-to-End Implementation of A Physically Disaggregated Data Center Memory disaggregation: why now and what are the challenges
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1