Proceedings of the Fourth International Workshop on HPC User Support Tools最新文献

英文中文

Markov Chain Modeling for Anomaly Detection in High Performance Computing System Logs 基于马尔可夫链模型的高性能计算系统日志异常检测

Proceedings of the Fourth International Workshop on HPC User Support Tools

Pub Date : 2017-11-12 DOI: 10.1145/3152493.3152559

Abida Haque, Alexandra DeLucia, Elisabeth Baseman

As high performance computing approaches the exascale era, analyzing the massive amount of monitoring data generated by supercomputers is quickly becoming intractable for human analysts. In particular, system logs, which are a crucial source of information regarding machine health and root cause analysis of problems and failures, are becoming far too large for a human to review by hand. We take a step toward mitigating this problem through mathematical modeling of textual system log data in order to automatically capture normal behavior and identify anomalous and potentially interesting log messages. We learn a Markov chain model from average case system logs and use it to generate synthetic system log data. We present a variety of evaluation metrics for scoring similarity between the synthetic logs and the real logs, thus defining and quantifying normal behavior. Then, we explore the abilities of this learned model to identify anomalous behavior by evaluating its ability to catch inserted and missing log messages. We evaluate our model and its performance on the anomaly detection task using a large set of system log files from two institutional computing clusters at Los Alamos National Laboratory. We find that while our model seems to pick up on key features of normal behavior, its ability to detect anomalies varies greatly by anomaly type and the training and test data used. Overall, we find mathematical modeling of system logs to be a promising area for further work, particularly with the goal of aiding human operators in troubleshooting tasks.

随着高性能计算接近百亿亿次时代，分析超级计算机产生的大量监控数据对人类分析师来说很快变得棘手。特别是系统日志，它是关于机器运行状况和问题和故障的根本原因分析的重要信息来源，它变得太大，人类无法手工查看。我们通过对文本系统日志数据进行数学建模来缓解这个问题，以便自动捕获正常行为并识别异常和潜在有趣的日志消息。我们从平均案例系统日志中学习了一个马尔可夫链模型，并用它来生成综合的系统日志数据。我们提出了各种评价指标来评价合成日志和真实日志之间的相似性，从而定义和量化正常行为。然后，我们通过评估其捕获插入和缺失日志消息的能力来探索该学习模型识别异常行为的能力。我们使用来自洛斯阿拉莫斯国家实验室两个机构计算集群的大量系统日志文件来评估我们的模型及其在异常检测任务上的性能。我们发现，虽然我们的模型似乎可以捕捉到正常行为的关键特征，但它检测异常的能力因异常类型和所使用的训练和测试数据而有很大差异。总的来说，我们发现系统日志的数学建模是一个有前途的进一步工作领域，特别是在帮助人工操作员进行故障排除任务的目标方面。

{"title":"Markov Chain Modeling for Anomaly Detection in High Performance Computing System Logs","authors":"Abida Haque, Alexandra DeLucia, Elisabeth Baseman","doi":"10.1145/3152493.3152559","DOIUrl":"https://doi.org/10.1145/3152493.3152559","url":null,"abstract":"As high performance computing approaches the exascale era, analyzing the massive amount of monitoring data generated by supercomputers is quickly becoming intractable for human analysts. In particular, system logs, which are a crucial source of information regarding machine health and root cause analysis of problems and failures, are becoming far too large for a human to review by hand. We take a step toward mitigating this problem through mathematical modeling of textual system log data in order to automatically capture normal behavior and identify anomalous and potentially interesting log messages. We learn a Markov chain model from average case system logs and use it to generate synthetic system log data. We present a variety of evaluation metrics for scoring similarity between the synthetic logs and the real logs, thus defining and quantifying normal behavior. Then, we explore the abilities of this learned model to identify anomalous behavior by evaluating its ability to catch inserted and missing log messages. We evaluate our model and its performance on the anomaly detection task using a large set of system log files from two institutional computing clusters at Los Alamos National Laboratory. We find that while our model seems to pick up on key features of normal behavior, its ability to detect anomalies varies greatly by anomaly type and the training and test data used. Overall, we find mathematical modeling of system logs to be a promising area for further work, particularly with the goal of aiding human operators in troubleshooting tasks.","PeriodicalId":258031,"journal":{"name":"Proceedings of the Fourth International Workshop on HPC User Support Tools","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115487493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

ITALC: Interactive Tool for Application-Level Checkpointing 用于应用程序级检查点的交互式工具

Proceedings of the Fourth International Workshop on HPC User Support Tools

Pub Date : 2017-11-12 DOI: 10.1145/3152493.3152558

R. Arora, Trung Nguyen Ba

The computational resources at open-science supercomputing centers are shared among multiple users at a given time, and hence are governed by policies that ensure their fair and optimal usage. Such policies can impose upper-limits on (1) the number of compute-nodes, and (2) the wall-clock time that can be requested per computational job. Given these limits on computational jobs, several applications may not run to completion in a single session. Therefore, as a workaround, the users are advised to take advantage of the checkpoint-and-restart technique and spread their computations across multiple interdependent computational jobs. The checkpoint-and-restart technique helps in saving the execution state of the applications periodically. A saved state is known as a checkpoint. When their computational jobs time-out after running for the maximum wall-clock time, while leaving their computations incomplete, the users can submit new jobs to resume their computations using the checkpoints saved during their previous job runs. The checkpoint-and-restart technique can also be useful for making the applications tolerant to certain types of faults, viz., network and compute-node failures. When this technique is built within an application itself, it is called Application-Level Checkpointing (ALC). We are developing an interactive tool to assist the users in semi-automatically inserting the ALC mechanism into their existing applications without doing any manual reengineering. As compared to other approaches for checkpointing, the checkpoints written with our tool have smaller memory footprint, and thus, incur a smaller I/O overhead.

开放科学超级计算中心的计算资源在给定时间内由多个用户共享，因此由确保公平和最佳使用的策略进行管理。这些策略可以对(1)计算节点的数量和(2)每个计算作业可以请求的挂钟时间施加上限。考虑到计算作业的这些限制，多个应用程序可能无法在一个会话中运行完成。因此，作为一种解决方案，建议用户利用检查点-重新启动技术，并将其计算分散到多个相互依赖的计算作业中。检查点和重启技术有助于定期保存应用程序的执行状态。保存的状态称为检查点。当他们的计算作业在运行了最长的时钟时间后超时，而计算仍未完成时，用户可以提交新作业，使用在以前的作业运行期间保存的检查点恢复计算。检查点和重新启动技术对于使应用程序能够容忍某些类型的故障(即网络和计算节点故障)也很有用。当这种技术在应用程序本身内构建时，它被称为应用程序级检查点(ALC)。我们正在开发一个交互式工具，以帮助用户半自动地将ALC机制插入到他们现有的应用程序中，而无需进行任何手动重新设计。与其他检查点方法相比，使用我们的工具编写的检查点占用更小的内存，因此产生更小的I/O开销。

{"title":"ITALC: Interactive Tool for Application-Level Checkpointing","authors":"R. Arora, Trung Nguyen Ba","doi":"10.1145/3152493.3152558","DOIUrl":"https://doi.org/10.1145/3152493.3152558","url":null,"abstract":"The computational resources at open-science supercomputing centers are shared among multiple users at a given time, and hence are governed by policies that ensure their fair and optimal usage. Such policies can impose upper-limits on (1) the number of compute-nodes, and (2) the wall-clock time that can be requested per computational job. Given these limits on computational jobs, several applications may not run to completion in a single session. Therefore, as a workaround, the users are advised to take advantage of the checkpoint-and-restart technique and spread their computations across multiple interdependent computational jobs. The checkpoint-and-restart technique helps in saving the execution state of the applications periodically. A saved state is known as a checkpoint. When their computational jobs time-out after running for the maximum wall-clock time, while leaving their computations incomplete, the users can submit new jobs to resume their computations using the checkpoints saved during their previous job runs. The checkpoint-and-restart technique can also be useful for making the applications tolerant to certain types of faults, viz., network and compute-node failures. When this technique is built within an application itself, it is called Application-Level Checkpointing (ALC). We are developing an interactive tool to assist the users in semi-automatically inserting the ALC mechanism into their existing applications without doing any manual reengineering. As compared to other approaches for checkpointing, the checkpoints written with our tool have smaller memory footprint, and thus, incur a smaller I/O overhead.","PeriodicalId":258031,"journal":{"name":"Proceedings of the Fourth International Workshop on HPC User Support Tools","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132319835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Testpilot: A Flexible Framework for User-centric Testing of HPC Clusters Testpilot:一个以用户为中心的HPC集群测试的灵活框架

Proceedings of the Fourth International Workshop on HPC User Support Tools

Pub Date : 2017-11-12 DOI: 10.1145/3152493.3152555

K. Colby, A. Maji, Jason Rahman, J. Bottum

HPC systems are made of many complex hardware and software components, and interaction between these components can often break, leading to job failures and customer dissatisfaction. Testing focused on individual components is often inadequate to identify broken inter-component interactions, therefore, to detect and avoid these, a holistic testing framework is needed which can test the full functionality and performance of a cluster from a user's perspective. Existing tools for HPC cluster testing are either rigid (i.e. works within the context of a single cluster) or are focused on system components (i.e., OS and middleware). In this paper, we present Testpilot---a flexible, holistic, and user-centric testing framework which can be used by system administrators, support staff, or even by users themselves. Testpilot can be used in various testing scenarios such as application testing, application update, OS update, or for continuous monitoring of cluster health. The authors have found Testpilot to be invaluable for regression testing at their HPC site and it has caught many issues that would have otherwise gone into production unnoticed.

高性能计算系统由许多复杂的硬件和软件组件组成，这些组件之间的交互经常会中断，导致工作失败和客户不满。专注于单个组件的测试通常不足以识别组件间交互的中断，因此，为了检测和避免这些，需要一个整体的测试框架，它可以从用户的角度测试集群的完整功能和性能。现有的HPC集群测试工具要么是刚性的(即在单个集群的上下文中工作)，要么是专注于系统组件(即操作系统和中间件)。在本文中，我们展示了Testpilot——一个灵活的、整体的、以用户为中心的测试框架，它可以被系统管理员、支持人员甚至用户自己使用。Testpilot可用于各种测试场景，例如应用程序测试、应用程序更新、操作系统更新，或用于连续监视集群运行状况。作者发现Testpilot对于他们的HPC站点的回归测试是非常有用的，它发现了许多原本会被忽视的问题。

{"title":"Testpilot: A Flexible Framework for User-centric Testing of HPC Clusters","authors":"K. Colby, A. Maji, Jason Rahman, J. Bottum","doi":"10.1145/3152493.3152555","DOIUrl":"https://doi.org/10.1145/3152493.3152555","url":null,"abstract":"HPC systems are made of many complex hardware and software components, and interaction between these components can often break, leading to job failures and customer dissatisfaction. Testing focused on individual components is often inadequate to identify broken inter-component interactions, therefore, to detect and avoid these, a holistic testing framework is needed which can test the full functionality and performance of a cluster from a user's perspective. Existing tools for HPC cluster testing are either rigid (i.e. works within the context of a single cluster) or are focused on system components (i.e., OS and middleware). In this paper, we present Testpilot---a flexible, holistic, and user-centric testing framework which can be used by system administrators, support staff, or even by users themselves. Testpilot can be used in various testing scenarios such as application testing, application update, OS update, or for continuous monitoring of cluster health. The authors have found Testpilot to be invaluable for regression testing at their HPC site and it has caught many issues that would have otherwise gone into production unnoticed.","PeriodicalId":258031,"journal":{"name":"Proceedings of the Fourth International Workshop on HPC User Support Tools","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125338234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Nix as HPC package management system Nix作为HPC包管理系统

Proceedings of the Fourth International Workshop on HPC User Support Tools

Pub Date : 2017-11-12 DOI: 10.1145/3152493.3152556

B. Bzeznik, O. Henriot, Valentin Reis, Olivier Richard, Laure Tavard

Modern High Performance Computing systems are becoming larger and more heterogeneous. The proper management of software for the users of such systems poses a significant challenge. These users run very diverse applications that may be compiled with proprietary tools for specialized hardware. Moreover, the application life-cycle of these software may exceed the lifetime of the HPC systems themselves. These difficulties motivate the use of specialized package management systems. In this paper, we outline an approach to HPC package development, deployment, management, sharing, and reuse based on the Nix functional package manager. We report our experience with this approach inside the GRICAD HPC center[GRICAD 2017a] in Grenoble over a 12 month period and compare it to other existing approaches.

现代高性能计算系统正变得越来越大，越来越异构。为这类系统的用户适当地管理软件是一项重大挑战。这些用户运行各种各样的应用程序，这些应用程序可以用专门硬件的专有工具进行编译。此外，这些软件的应用生命周期可能超过高性能计算系统本身的生命周期。这些困难促使使用专门的包管理系统。在本文中，我们概述了一种基于Nix功能包管理器的HPC包开发、部署、管理、共享和重用方法。我们在格勒诺布尔的GRICAD高性能计算中心[GRICAD 2017a]报告了我们在12个月内使用这种方法的经验，并将其与其他现有方法进行了比较。

引用次数: 15

An Edge Service for Managing HPC Workflows 管理HPC工作流的边缘服务

Proceedings of the Fourth International Workshop on HPC User Support Tools

Pub Date : 2017-11-12 DOI: 10.1145/3152493.3152557

J. T. Childers, T. Uram, D. Benjamin, T. LeCompte, M. Papka

Large experimental collaborations, such as those at the Large Hadron Collider at CERN, have developed large job management systems running hundreds of thousands of jobs across worldwide computing grids. HPC facilities are becoming more important to these data-intensive workflows and integrating them into the experiment job management system is non-trivial due to increased security and heterogeneous computing environments. The following article describes a common edge service developed and deployed on DOE supercomputers for both small users and large collaborations. This edge service provides a uniform interaction across many different supercomputers. Example users are described with the related performance.

大型实验合作，如欧洲核子研究中心的大型强子对撞机，已经开发出大型作业管理系统，在全球计算网格中运行数十万个作业。高性能计算设备对于这些数据密集型工作流程变得越来越重要，由于安全性和异构计算环境的增加，将它们集成到实验作业管理系统中是非容易的。下面的文章描述了在DOE超级计算机上为小型用户和大型协作开发和部署的通用边缘服务。这个边缘服务在许多不同的超级计算机之间提供统一的交互。示例用户的描述以及相关性能。

引用次数: 6

Proceedings of the Fourth International Workshop on HPC User Support Tools 第四届HPC用户支持工具国际研讨会论文集

Proceedings of the Fourth International Workshop on HPC User Support Tools

Pub Date : 1900-01-01 DOI: 10.1145/3152493

引用次数: 1

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the Fourth International Workshop on HPC User Support Tools

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀