As high performance computing approaches the exascale era, analyzing the massive amount of monitoring data generated by supercomputers is quickly becoming intractable for human analysts. In particular, system logs, which are a crucial source of information regarding machine health and root cause analysis of problems and failures, are becoming far too large for a human to review by hand. We take a step toward mitigating this problem through mathematical modeling of textual system log data in order to automatically capture normal behavior and identify anomalous and potentially interesting log messages. We learn a Markov chain model from average case system logs and use it to generate synthetic system log data. We present a variety of evaluation metrics for scoring similarity between the synthetic logs and the real logs, thus defining and quantifying normal behavior. Then, we explore the abilities of this learned model to identify anomalous behavior by evaluating its ability to catch inserted and missing log messages. We evaluate our model and its performance on the anomaly detection task using a large set of system log files from two institutional computing clusters at Los Alamos National Laboratory. We find that while our model seems to pick up on key features of normal behavior, its ability to detect anomalies varies greatly by anomaly type and the training and test data used. Overall, we find mathematical modeling of system logs to be a promising area for further work, particularly with the goal of aiding human operators in troubleshooting tasks.
{"title":"Markov Chain Modeling for Anomaly Detection in High Performance Computing System Logs","authors":"Abida Haque, Alexandra DeLucia, Elisabeth Baseman","doi":"10.1145/3152493.3152559","DOIUrl":"https://doi.org/10.1145/3152493.3152559","url":null,"abstract":"As high performance computing approaches the exascale era, analyzing the massive amount of monitoring data generated by supercomputers is quickly becoming intractable for human analysts. In particular, system logs, which are a crucial source of information regarding machine health and root cause analysis of problems and failures, are becoming far too large for a human to review by hand. We take a step toward mitigating this problem through mathematical modeling of textual system log data in order to automatically capture normal behavior and identify anomalous and potentially interesting log messages. We learn a Markov chain model from average case system logs and use it to generate synthetic system log data. We present a variety of evaluation metrics for scoring similarity between the synthetic logs and the real logs, thus defining and quantifying normal behavior. Then, we explore the abilities of this learned model to identify anomalous behavior by evaluating its ability to catch inserted and missing log messages. We evaluate our model and its performance on the anomaly detection task using a large set of system log files from two institutional computing clusters at Los Alamos National Laboratory. We find that while our model seems to pick up on key features of normal behavior, its ability to detect anomalies varies greatly by anomaly type and the training and test data used. Overall, we find mathematical modeling of system logs to be a promising area for further work, particularly with the goal of aiding human operators in troubleshooting tasks.","PeriodicalId":258031,"journal":{"name":"Proceedings of the Fourth International Workshop on HPC User Support Tools","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115487493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The computational resources at open-science supercomputing centers are shared among multiple users at a given time, and hence are governed by policies that ensure their fair and optimal usage. Such policies can impose upper-limits on (1) the number of compute-nodes, and (2) the wall-clock time that can be requested per computational job. Given these limits on computational jobs, several applications may not run to completion in a single session. Therefore, as a workaround, the users are advised to take advantage of the checkpoint-and-restart technique and spread their computations across multiple interdependent computational jobs. The checkpoint-and-restart technique helps in saving the execution state of the applications periodically. A saved state is known as a checkpoint. When their computational jobs time-out after running for the maximum wall-clock time, while leaving their computations incomplete, the users can submit new jobs to resume their computations using the checkpoints saved during their previous job runs. The checkpoint-and-restart technique can also be useful for making the applications tolerant to certain types of faults, viz., network and compute-node failures. When this technique is built within an application itself, it is called Application-Level Checkpointing (ALC). We are developing an interactive tool to assist the users in semi-automatically inserting the ALC mechanism into their existing applications without doing any manual reengineering. As compared to other approaches for checkpointing, the checkpoints written with our tool have smaller memory footprint, and thus, incur a smaller I/O overhead.
{"title":"ITALC: Interactive Tool for Application-Level Checkpointing","authors":"R. Arora, Trung Nguyen Ba","doi":"10.1145/3152493.3152558","DOIUrl":"https://doi.org/10.1145/3152493.3152558","url":null,"abstract":"The computational resources at open-science supercomputing centers are shared among multiple users at a given time, and hence are governed by policies that ensure their fair and optimal usage. Such policies can impose upper-limits on (1) the number of compute-nodes, and (2) the wall-clock time that can be requested per computational job. Given these limits on computational jobs, several applications may not run to completion in a single session. Therefore, as a workaround, the users are advised to take advantage of the checkpoint-and-restart technique and spread their computations across multiple interdependent computational jobs. The checkpoint-and-restart technique helps in saving the execution state of the applications periodically. A saved state is known as a checkpoint. When their computational jobs time-out after running for the maximum wall-clock time, while leaving their computations incomplete, the users can submit new jobs to resume their computations using the checkpoints saved during their previous job runs. The checkpoint-and-restart technique can also be useful for making the applications tolerant to certain types of faults, viz., network and compute-node failures. When this technique is built within an application itself, it is called Application-Level Checkpointing (ALC). We are developing an interactive tool to assist the users in semi-automatically inserting the ALC mechanism into their existing applications without doing any manual reengineering. As compared to other approaches for checkpointing, the checkpoints written with our tool have smaller memory footprint, and thus, incur a smaller I/O overhead.","PeriodicalId":258031,"journal":{"name":"Proceedings of the Fourth International Workshop on HPC User Support Tools","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132319835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
HPC systems are made of many complex hardware and software components, and interaction between these components can often break, leading to job failures and customer dissatisfaction. Testing focused on individual components is often inadequate to identify broken inter-component interactions, therefore, to detect and avoid these, a holistic testing framework is needed which can test the full functionality and performance of a cluster from a user's perspective. Existing tools for HPC cluster testing are either rigid (i.e. works within the context of a single cluster) or are focused on system components (i.e., OS and middleware). In this paper, we present Testpilot---a flexible, holistic, and user-centric testing framework which can be used by system administrators, support staff, or even by users themselves. Testpilot can be used in various testing scenarios such as application testing, application update, OS update, or for continuous monitoring of cluster health. The authors have found Testpilot to be invaluable for regression testing at their HPC site and it has caught many issues that would have otherwise gone into production unnoticed.
{"title":"Testpilot: A Flexible Framework for User-centric Testing of HPC Clusters","authors":"K. Colby, A. Maji, Jason Rahman, J. Bottum","doi":"10.1145/3152493.3152555","DOIUrl":"https://doi.org/10.1145/3152493.3152555","url":null,"abstract":"HPC systems are made of many complex hardware and software components, and interaction between these components can often break, leading to job failures and customer dissatisfaction. Testing focused on individual components is often inadequate to identify broken inter-component interactions, therefore, to detect and avoid these, a holistic testing framework is needed which can test the full functionality and performance of a cluster from a user's perspective. Existing tools for HPC cluster testing are either rigid (i.e. works within the context of a single cluster) or are focused on system components (i.e., OS and middleware). In this paper, we present Testpilot---a flexible, holistic, and user-centric testing framework which can be used by system administrators, support staff, or even by users themselves. Testpilot can be used in various testing scenarios such as application testing, application update, OS update, or for continuous monitoring of cluster health. The authors have found Testpilot to be invaluable for regression testing at their HPC site and it has caught many issues that would have otherwise gone into production unnoticed.","PeriodicalId":258031,"journal":{"name":"Proceedings of the Fourth International Workshop on HPC User Support Tools","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125338234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Bzeznik, O. Henriot, Valentin Reis, Olivier Richard, Laure Tavard
Modern High Performance Computing systems are becoming larger and more heterogeneous. The proper management of software for the users of such systems poses a significant challenge. These users run very diverse applications that may be compiled with proprietary tools for specialized hardware. Moreover, the application life-cycle of these software may exceed the lifetime of the HPC systems themselves. These difficulties motivate the use of specialized package management systems. In this paper, we outline an approach to HPC package development, deployment, management, sharing, and reuse based on the Nix functional package manager. We report our experience with this approach inside the GRICAD HPC center[GRICAD 2017a] in Grenoble over a 12 month period and compare it to other existing approaches.
{"title":"Nix as HPC package management system","authors":"B. Bzeznik, O. Henriot, Valentin Reis, Olivier Richard, Laure Tavard","doi":"10.1145/3152493.3152556","DOIUrl":"https://doi.org/10.1145/3152493.3152556","url":null,"abstract":"Modern High Performance Computing systems are becoming larger and more heterogeneous. The proper management of software for the users of such systems poses a significant challenge. These users run very diverse applications that may be compiled with proprietary tools for specialized hardware. Moreover, the application life-cycle of these software may exceed the lifetime of the HPC systems themselves. These difficulties motivate the use of specialized package management systems. In this paper, we outline an approach to HPC package development, deployment, management, sharing, and reuse based on the Nix functional package manager. We report our experience with this approach inside the GRICAD HPC center[GRICAD 2017a] in Grenoble over a 12 month period and compare it to other existing approaches.","PeriodicalId":258031,"journal":{"name":"Proceedings of the Fourth International Workshop on HPC User Support Tools","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121832563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. T. Childers, T. Uram, D. Benjamin, T. LeCompte, M. Papka
Large experimental collaborations, such as those at the Large Hadron Collider at CERN, have developed large job management systems running hundreds of thousands of jobs across worldwide computing grids. HPC facilities are becoming more important to these data-intensive workflows and integrating them into the experiment job management system is non-trivial due to increased security and heterogeneous computing environments. The following article describes a common edge service developed and deployed on DOE supercomputers for both small users and large collaborations. This edge service provides a uniform interaction across many different supercomputers. Example users are described with the related performance.
{"title":"An Edge Service for Managing HPC Workflows","authors":"J. T. Childers, T. Uram, D. Benjamin, T. LeCompte, M. Papka","doi":"10.1145/3152493.3152557","DOIUrl":"https://doi.org/10.1145/3152493.3152557","url":null,"abstract":"Large experimental collaborations, such as those at the Large Hadron Collider at CERN, have developed large job management systems running hundreds of thousands of jobs across worldwide computing grids. HPC facilities are becoming more important to these data-intensive workflows and integrating them into the experiment job management system is non-trivial due to increased security and heterogeneous computing environments. The following article describes a common edge service developed and deployed on DOE supercomputers for both small users and large collaborations. This edge service provides a uniform interaction across many different supercomputers. Example users are described with the related performance.","PeriodicalId":258031,"journal":{"name":"Proceedings of the Fourth International Workshop on HPC User Support Tools","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129910639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the Fourth International Workshop on HPC User Support Tools","authors":"","doi":"10.1145/3152493","DOIUrl":"https://doi.org/10.1145/3152493","url":null,"abstract":"","PeriodicalId":258031,"journal":{"name":"Proceedings of the Fourth International Workshop on HPC User Support Tools","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114272716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}