Detecting large-scale system problems by mining console logs

Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles Pub Date : 2009-10-11 DOI:10.1145/1629575.1629587

W. Xu, Ling Huang, A. Fox, D. Patterson, Michael I. Jordan

{"title":"Detecting large-scale system problems by mining console logs","authors":"W. Xu, Ling Huang, A. Fox, D. Patterson, Michael I. Jordan","doi":"10.1145/1629575.1629587","DOIUrl":null,"url":null,"abstract":"Surprisingly, console logs rarely help operators detect problems in large-scale datacenter services, for they often consist of the voluminous intermixing of messages from many software components written by independent developers. We propose a general methodology to mine this rich source of information to automatically detect system runtime problems. We first parse console logs by combining source code analysis with information retrieval to create composite features. We then analyze these features using machine learning to detect operational problems. We show that our method enables analyses that are impossible with previous methods because of its superior ability to create sophisticated features. We also show how to distill the results of our analysis to an operator-friendly one-page decision tree showing the critical messages associated with the detected problems. We validate our approach using the Darkstar online game server and the Hadoop File System, where we detect numerous real problems with high accuracy and few false positives. In the Hadoop case, we are able to analyze 24 million lines of console logs in 3 minutes. Our methodology works on textual console logs of any size and requires no changes to the service software, no human input, and no knowledge of the software's internals.","PeriodicalId":20672,"journal":{"name":"Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles","volume":"11 1","pages":"117-132"},"PeriodicalIF":0.0000,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"992","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1629575.1629587","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 992

Abstract

Surprisingly, console logs rarely help operators detect problems in large-scale datacenter services, for they often consist of the voluminous intermixing of messages from many software components written by independent developers. We propose a general methodology to mine this rich source of information to automatically detect system runtime problems. We first parse console logs by combining source code analysis with information retrieval to create composite features. We then analyze these features using machine learning to detect operational problems. We show that our method enables analyses that are impossible with previous methods because of its superior ability to create sophisticated features. We also show how to distill the results of our analysis to an operator-friendly one-page decision tree showing the critical messages associated with the detected problems. We validate our approach using the Darkstar online game server and the Hadoop File System, where we detect numerous real problems with high accuracy and few false positives. In the Hadoop case, we are able to analyze 24 million lines of console logs in 3 minutes. Our methodology works on textual console logs of any size and requires no changes to the service software, no human input, and no knowledge of the software's internals.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过挖掘控制台日志来检测大规模系统问题

令人惊讶的是，控制台日志很少帮助操作员检测大型数据中心服务中的问题，因为它们通常由独立开发人员编写的许多软件组件的大量混合消息组成。我们提出了一种通用的方法来挖掘这些丰富的信息源，以自动检测系统运行时问题。我们首先通过将源代码分析与信息检索相结合来解析控制台日志，从而创建复合特性。然后，我们使用机器学习来分析这些特征以检测操作问题。我们表明，由于我们的方法具有创建复杂特征的优越能力，因此可以使用以前的方法进行不可能的分析。我们还将展示如何将分析结果提取为对操作人员友好的一页决策树，其中显示与检测到的问题相关的关键消息。我们使用Darkstar在线游戏服务器和Hadoop文件系统验证了我们的方法，在那里我们以高精度和很少的误报检测了许多实际问题。在Hadoop的情况下，我们能够在3分钟内分析2400万行控制台日志。我们的方法适用于任何大小的文本控制台日志，不需要更改服务软件，不需要人工输入，也不需要了解软件的内部结构。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles

自引率

0.00%

发文量