Megan Hickman, Dakota Fulp, Elisabeth Baseman, S. Blanchard, Hugh Greenberg, William M. Jones, Nathan Debardeleben
{"title":"Enhancing HPC System Log Analysis by Identifying Message Origin in Source Code","authors":"Megan Hickman, Dakota Fulp, Elisabeth Baseman, S. Blanchard, Hugh Greenberg, William M. Jones, Nathan Debardeleben","doi":"10.1109/ISSREW.2018.00-23","DOIUrl":null,"url":null,"abstract":"Supercomputers, high performance computers, and clusters are composed of very large numbers of independent operating systems that are generating their own system logs. Messages are generated locally on each host and usually are transferred to a central logging infrastructure which keeps a master record of the system as a whole. At Los Alamos National Laboratory (LANL) a collection of open source cloud tools are used which log over a hundred million system log messages per day from over a dozen such systems. Understanding what source code created those messages can be extremely useful to system administrators when they are troubleshooting these complex systems as it can give insight into a subsystem (disk, network, etc.) or even line numbers of source code. Oftentimes, debugging supercomputers is done in environments where open access cannot be provided to all individuals due to security concerns. As such, providing a means for conveying information between system log messages and source code lines allows for communication between system administrators and source developers or supercomputer vendors. In this work, we demonstrate a prototype tool which aims to provide such an expert system. We leverage capabilities from ElasticSearch, one of the open source cloud tools deployed at LANL, and with our own metrics develop a means for correctly matching source code lines as well as files with high confidence. We discuss confidence metrics and show that in our experiments 92% of syslog lines were correctly matched. For any future samples, we predict with 95% confidence that the correct file will be detected between 88.2% and 95.8% of the time. Finally, we discuss enhancements that are underway to improve the tool and study it on a larger dataset.","PeriodicalId":321448,"journal":{"name":"2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSREW.2018.00-23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Supercomputers, high performance computers, and clusters are composed of very large numbers of independent operating systems that are generating their own system logs. Messages are generated locally on each host and usually are transferred to a central logging infrastructure which keeps a master record of the system as a whole. At Los Alamos National Laboratory (LANL) a collection of open source cloud tools are used which log over a hundred million system log messages per day from over a dozen such systems. Understanding what source code created those messages can be extremely useful to system administrators when they are troubleshooting these complex systems as it can give insight into a subsystem (disk, network, etc.) or even line numbers of source code. Oftentimes, debugging supercomputers is done in environments where open access cannot be provided to all individuals due to security concerns. As such, providing a means for conveying information between system log messages and source code lines allows for communication between system administrators and source developers or supercomputer vendors. In this work, we demonstrate a prototype tool which aims to provide such an expert system. We leverage capabilities from ElasticSearch, one of the open source cloud tools deployed at LANL, and with our own metrics develop a means for correctly matching source code lines as well as files with high confidence. We discuss confidence metrics and show that in our experiments 92% of syslog lines were correctly matched. For any future samples, we predict with 95% confidence that the correct file will be detected between 88.2% and 95.8% of the time. Finally, we discuss enhancements that are underway to improve the tool and study it on a larger dataset.