Markov Chain Modeling for Anomaly Detection in High Performance Computing System Logs

Proceedings of the Fourth International Workshop on HPC User Support Tools Pub Date : 2017-11-12 DOI:10.1145/3152493.3152559

Abida Haque, Alexandra DeLucia, Elisabeth Baseman

{"title":"Markov Chain Modeling for Anomaly Detection in High Performance Computing System Logs","authors":"Abida Haque, Alexandra DeLucia, Elisabeth Baseman","doi":"10.1145/3152493.3152559","DOIUrl":null,"url":null,"abstract":"As high performance computing approaches the exascale era, analyzing the massive amount of monitoring data generated by supercomputers is quickly becoming intractable for human analysts. In particular, system logs, which are a crucial source of information regarding machine health and root cause analysis of problems and failures, are becoming far too large for a human to review by hand. We take a step toward mitigating this problem through mathematical modeling of textual system log data in order to automatically capture normal behavior and identify anomalous and potentially interesting log messages. We learn a Markov chain model from average case system logs and use it to generate synthetic system log data. We present a variety of evaluation metrics for scoring similarity between the synthetic logs and the real logs, thus defining and quantifying normal behavior. Then, we explore the abilities of this learned model to identify anomalous behavior by evaluating its ability to catch inserted and missing log messages. We evaluate our model and its performance on the anomaly detection task using a large set of system log files from two institutional computing clusters at Los Alamos National Laboratory. We find that while our model seems to pick up on key features of normal behavior, its ability to detect anomalies varies greatly by anomaly type and the training and test data used. Overall, we find mathematical modeling of system logs to be a promising area for further work, particularly with the goal of aiding human operators in troubleshooting tasks.","PeriodicalId":258031,"journal":{"name":"Proceedings of the Fourth International Workshop on HPC User Support Tools","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Fourth International Workshop on HPC User Support Tools","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3152493.3152559","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

As high performance computing approaches the exascale era, analyzing the massive amount of monitoring data generated by supercomputers is quickly becoming intractable for human analysts. In particular, system logs, which are a crucial source of information regarding machine health and root cause analysis of problems and failures, are becoming far too large for a human to review by hand. We take a step toward mitigating this problem through mathematical modeling of textual system log data in order to automatically capture normal behavior and identify anomalous and potentially interesting log messages. We learn a Markov chain model from average case system logs and use it to generate synthetic system log data. We present a variety of evaluation metrics for scoring similarity between the synthetic logs and the real logs, thus defining and quantifying normal behavior. Then, we explore the abilities of this learned model to identify anomalous behavior by evaluating its ability to catch inserted and missing log messages. We evaluate our model and its performance on the anomaly detection task using a large set of system log files from two institutional computing clusters at Los Alamos National Laboratory. We find that while our model seems to pick up on key features of normal behavior, its ability to detect anomalies varies greatly by anomaly type and the training and test data used. Overall, we find mathematical modeling of system logs to be a promising area for further work, particularly with the goal of aiding human operators in troubleshooting tasks.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于马尔可夫链模型的高性能计算系统日志异常检测

随着高性能计算接近百亿亿次时代，分析超级计算机产生的大量监控数据对人类分析师来说很快变得棘手。特别是系统日志，它是关于机器运行状况和问题和故障的根本原因分析的重要信息来源，它变得太大，人类无法手工查看。我们通过对文本系统日志数据进行数学建模来缓解这个问题，以便自动捕获正常行为并识别异常和潜在有趣的日志消息。我们从平均案例系统日志中学习了一个马尔可夫链模型，并用它来生成综合的系统日志数据。我们提出了各种评价指标来评价合成日志和真实日志之间的相似性，从而定义和量化正常行为。然后，我们通过评估其捕获插入和缺失日志消息的能力来探索该学习模型识别异常行为的能力。我们使用来自洛斯阿拉莫斯国家实验室两个机构计算集群的大量系统日志文件来评估我们的模型及其在异常检测任务上的性能。我们发现，虽然我们的模型似乎可以捕捉到正常行为的关键特征，但它检测异常的能力因异常类型和所使用的训练和测试数据而有很大差异。总的来说，我们发现系统日志的数学建模是一个有前途的进一步工作领域，特别是在帮助人工操作员进行故障排除任务的目标方面。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the Fourth International Workshop on HPC User Support Tools

自引率

0.00%

发文量

期刊最新文献

Markov Chain Modeling for Anomaly Detection in High Performance Computing System Logs Nix as HPC package management system Testpilot: A Flexible Framework for User-centric Testing of HPC Clusters An Edge Service for Managing HPC Workflows ITALC: Interactive Tool for Application-Level Checkpointing