{"title":"AMPRO-HPCC: A Machine-Learning Tool for Predicting Resources on Slurm HPC Clusters.","authors":"Mohammed Tanash, Daniel Andresen, William Hsu","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Determining resource allocations (memory and time) for submitted jobs in High Performance Computing (HPC) systems is a challenging process even for computer scientists. HPC users are highly encouraged to overestimate resource allocation for their submitted jobs, so their jobs will not be killed due to insufficient resources. Overestimating resource allocations occurs because of the wide variety of HPC applications and environment configuration options, and the lack of knowledge of the complex structure of HPC systems. This causes a waste of HPC resources, a decreased utilization of HPC systems, and increased waiting and turnaround time for submitted jobs. In this paper, we introduce our first ever implemented fully-offline, fully-automated, stand-alone, and open-source Machine Learning (ML) tool to help users predict memory and time requirements for their submitted jobs on the cluster. Our tool involves implementing six ML discriminative models from the scikit-learn and Microsoft LightGBM applied on the historical data (sacct data) from Simple Linux Utility for Resource Management (Slurm). We have tested our tool using historical data (saact data) using HPC resources of Kansas State University (Beocat), which covers the years from January 2019 - March 2021, and contains around 17.6 million jobs. Our results show that our tool achieves high predictive accuracy <i>R</i> <sup>2</sup> (0.72 using LightGBM for predicting the memory and 0.74 using Random Forest for predicting the time), helps dramatically reduce computational average waiting-time and turnaround time for the submitted jobs, and increases utilization of the HPC resources. Hence, our tool decreases the power consumption of the HPC resources.</p>","PeriodicalId":72112,"journal":{"name":"ADVCOMP ... the ... International Conference on Advanced Engineering Computing and Applications in Sciences","volume":"2021 ","pages":"20-27"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9906793/pdf/nihms-1831252.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ADVCOMP ... the ... International Conference on Advanced Engineering Computing and Applications in Sciences","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Determining resource allocations (memory and time) for submitted jobs in High Performance Computing (HPC) systems is a challenging process even for computer scientists. HPC users are highly encouraged to overestimate resource allocation for their submitted jobs, so their jobs will not be killed due to insufficient resources. Overestimating resource allocations occurs because of the wide variety of HPC applications and environment configuration options, and the lack of knowledge of the complex structure of HPC systems. This causes a waste of HPC resources, a decreased utilization of HPC systems, and increased waiting and turnaround time for submitted jobs. In this paper, we introduce our first ever implemented fully-offline, fully-automated, stand-alone, and open-source Machine Learning (ML) tool to help users predict memory and time requirements for their submitted jobs on the cluster. Our tool involves implementing six ML discriminative models from the scikit-learn and Microsoft LightGBM applied on the historical data (sacct data) from Simple Linux Utility for Resource Management (Slurm). We have tested our tool using historical data (saact data) using HPC resources of Kansas State University (Beocat), which covers the years from January 2019 - March 2021, and contains around 17.6 million jobs. Our results show that our tool achieves high predictive accuracy R2 (0.72 using LightGBM for predicting the memory and 0.74 using Random Forest for predicting the time), helps dramatically reduce computational average waiting-time and turnaround time for the submitted jobs, and increases utilization of the HPC resources. Hence, our tool decreases the power consumption of the HPC resources.
在高性能计算(HPC)系统中,为提交的作业确定资源分配(内存和时间)是一个具有挑战性的过程,即使对计算机科学家也是如此。强烈建议HPC用户高估其提交作业的资源分配,这样他们的作业就不会因为资源不足而被终止。由于HPC应用程序和环境配置选项的多样性,以及缺乏对HPC系统复杂结构的了解,会出现对资源分配的高估。这会导致HPC资源的浪费,HPC系统的利用率降低,以及提交作业的等待和周转时间增加。在本文中,我们介绍了我们有史以来第一个实现的完全离线、全自动、独立和开源的机器学习(ML)工具,以帮助用户预测他们在集群上提交的作业的内存和时间需求。我们的工具包括实现来自scikit-learn和Microsoft LightGBM的6个ML判别模型,这些模型应用于来自Simple Linux Utility for Resource Management (Slurm)的历史数据(sact数据)。我们使用堪萨斯州立大学(Beocat)的HPC资源使用历史数据(saact数据)测试了我们的工具,这些数据涵盖了2019年1月至2021年3月的年份,包含了大约1760万个工作岗位。我们的结果表明,我们的工具达到了很高的预测精度r2(使用LightGBM预测内存为0.72,使用Random Forest预测时间为0.74),有助于显着减少提交作业的计算平均等待时间和周转时间,并提高HPC资源的利用率。因此,我们的工具降低了HPC资源的功耗。