Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps

Amrita Saha, S. Hoi
{"title":"Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps","authors":"Amrita Saha, S. Hoi","doi":"10.1145/3510457.3513030","DOIUrl":null,"url":null,"abstract":"Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes, especially for cloud industry leaders like Salesforce. Typically RCA investigation leverages data-sources like application error logs or service call traces. However a rich goldmine of root cause information is also hidden in the natural language documentation of the past incidents investigations by domain experts. This is generally termed as Problem Review Board (PRB) Data which constitute a core component of IT Incident Management. However, owing to the raw unstructured nature of PRBs, such root cause knowledge is not directly reusable by manual or automated pipelines for RCA of new incidents. This motivates us to leverage this widely-available data-source to build an Incident Causation Analysis (ICA) engine, using SoTA neural NLP techniques to extract targeted information and construct a structured Causal Knowledge Graph from PRB documents. ICA forms the backbone of a simple-yet-effective Retrieval based RCA for new incidents, through an Information Retrieval system to search and rank past incidents and detect likely root causes from them, given the incident symptom. In this work, we present ICA and the downstream Incident Search and Retrieval based RCA pipeline, built at Salesforce, over 2K documented cloud service incident investigations collected over a few years. We also establish the effectiveness of ICA and the downstream tasks through various quantitative benchmarks, qualitative analysis as well as domain expert's validation and real incident case studies after deployment.","PeriodicalId":119790,"journal":{"name":"2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3510457.3513030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes, especially for cloud industry leaders like Salesforce. Typically RCA investigation leverages data-sources like application error logs or service call traces. However a rich goldmine of root cause information is also hidden in the natural language documentation of the past incidents investigations by domain experts. This is generally termed as Problem Review Board (PRB) Data which constitute a core component of IT Incident Management. However, owing to the raw unstructured nature of PRBs, such root cause knowledge is not directly reusable by manual or automated pipelines for RCA of new incidents. This motivates us to leverage this widely-available data-source to build an Incident Causation Analysis (ICA) engine, using SoTA neural NLP techniques to extract targeted information and construct a structured Causal Knowledge Graph from PRB documents. ICA forms the backbone of a simple-yet-effective Retrieval based RCA for new incidents, through an Information Retrieval system to search and rank past incidents and detect likely root causes from them, given the incident symptom. In this work, we present ICA and the downstream Incident Search and Retrieval based RCA pipeline, built at Salesforce, over 2K documented cloud service incident investigations collected over a few years. We also establish the effectiveness of ICA and the downstream tasks through various quantitative benchmarks, qualitative analysis as well as domain expert's validation and real incident case studies after deployment.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
从针对AIOps的云服务事件调查中挖掘根本原因知识
任何服务中断事件的根本原因分析(RCA)是IT流程中最关键也是最复杂的任务之一,特别是对于像Salesforce这样的云行业领导者来说。通常,RCA调查利用诸如应用程序错误日志或服务调用跟踪之类的数据源。然而,领域专家对过去事件调查的自然语言文档中也隐藏着丰富的根本原因信息金矿。这通常被称为问题审查委员会(PRB)数据,它构成了IT事件管理的核心组件。然而,由于prb的原始非结构化性质,这些根本原因知识不能通过手工或自动化管道直接用于新事件的RCA。这促使我们利用这个广泛可用的数据源来构建事件因果分析(ICA)引擎,使用SoTA神经NLP技术从PRB文档中提取目标信息并构建结构化的因果知识图。ICA通过信息检索系统对过去的事件进行搜索和排序,并根据事件症状从事件中检测可能的根本原因,从而形成了简单而有效的基于检索的新事件RCA的主干。在本文中,我们介绍了在Salesforce构建的基于ICA和下游事件搜索和检索的RCA管道,以及几年来收集的超过2K个记录的云服务事件调查。我们还通过各种定量基准、定性分析以及领域专家的验证和部署后的真实事件案例研究,建立了ICA和下游任务的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Industry's Cry for Tools that Support Large-Scale Refactoring Code Reviewer Recommendation in Tencent: Practice, Challenge, and Direction* What's bothering developers in code review? The Impact of Flaky Tests on Historical Test Prioritization on Chrome Surveying the Developer Experience of Flaky Tests
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1