Markov Chain Modeling for Anomaly Detection in High Performance Computing System Logs

Abida Haque, Alexandra DeLucia, Elisabeth Baseman
{"title":"Markov Chain Modeling for Anomaly Detection in High Performance Computing System Logs","authors":"Abida Haque, Alexandra DeLucia, Elisabeth Baseman","doi":"10.1145/3152493.3152559","DOIUrl":null,"url":null,"abstract":"As high performance computing approaches the exascale era, analyzing the massive amount of monitoring data generated by supercomputers is quickly becoming intractable for human analysts. In particular, system logs, which are a crucial source of information regarding machine health and root cause analysis of problems and failures, are becoming far too large for a human to review by hand. We take a step toward mitigating this problem through mathematical modeling of textual system log data in order to automatically capture normal behavior and identify anomalous and potentially interesting log messages. We learn a Markov chain model from average case system logs and use it to generate synthetic system log data. We present a variety of evaluation metrics for scoring similarity between the synthetic logs and the real logs, thus defining and quantifying normal behavior. Then, we explore the abilities of this learned model to identify anomalous behavior by evaluating its ability to catch inserted and missing log messages. We evaluate our model and its performance on the anomaly detection task using a large set of system log files from two institutional computing clusters at Los Alamos National Laboratory. We find that while our model seems to pick up on key features of normal behavior, its ability to detect anomalies varies greatly by anomaly type and the training and test data used. Overall, we find mathematical modeling of system logs to be a promising area for further work, particularly with the goal of aiding human operators in troubleshooting tasks.","PeriodicalId":258031,"journal":{"name":"Proceedings of the Fourth International Workshop on HPC User Support Tools","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Fourth International Workshop on HPC User Support Tools","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3152493.3152559","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

Abstract

As high performance computing approaches the exascale era, analyzing the massive amount of monitoring data generated by supercomputers is quickly becoming intractable for human analysts. In particular, system logs, which are a crucial source of information regarding machine health and root cause analysis of problems and failures, are becoming far too large for a human to review by hand. We take a step toward mitigating this problem through mathematical modeling of textual system log data in order to automatically capture normal behavior and identify anomalous and potentially interesting log messages. We learn a Markov chain model from average case system logs and use it to generate synthetic system log data. We present a variety of evaluation metrics for scoring similarity between the synthetic logs and the real logs, thus defining and quantifying normal behavior. Then, we explore the abilities of this learned model to identify anomalous behavior by evaluating its ability to catch inserted and missing log messages. We evaluate our model and its performance on the anomaly detection task using a large set of system log files from two institutional computing clusters at Los Alamos National Laboratory. We find that while our model seems to pick up on key features of normal behavior, its ability to detect anomalies varies greatly by anomaly type and the training and test data used. Overall, we find mathematical modeling of system logs to be a promising area for further work, particularly with the goal of aiding human operators in troubleshooting tasks.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于马尔可夫链模型的高性能计算系统日志异常检测
随着高性能计算接近百亿亿次时代,分析超级计算机产生的大量监控数据对人类分析师来说很快变得棘手。特别是系统日志,它是关于机器运行状况和问题和故障的根本原因分析的重要信息来源,它变得太大,人类无法手工查看。我们通过对文本系统日志数据进行数学建模来缓解这个问题,以便自动捕获正常行为并识别异常和潜在有趣的日志消息。我们从平均案例系统日志中学习了一个马尔可夫链模型,并用它来生成综合的系统日志数据。我们提出了各种评价指标来评价合成日志和真实日志之间的相似性,从而定义和量化正常行为。然后,我们通过评估其捕获插入和缺失日志消息的能力来探索该学习模型识别异常行为的能力。我们使用来自洛斯阿拉莫斯国家实验室两个机构计算集群的大量系统日志文件来评估我们的模型及其在异常检测任务上的性能。我们发现,虽然我们的模型似乎可以捕捉到正常行为的关键特征,但它检测异常的能力因异常类型和所使用的训练和测试数据而有很大差异。总的来说,我们发现系统日志的数学建模是一个有前途的进一步工作领域,特别是在帮助人工操作员进行故障排除任务的目标方面。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Markov Chain Modeling for Anomaly Detection in High Performance Computing System Logs Nix as HPC package management system Testpilot: A Flexible Framework for User-centric Testing of HPC Clusters An Edge Service for Managing HPC Workflows ITALC: Interactive Tool for Application-Level Checkpointing
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1