PEARC19 : Practice and Experience in Advanced Research Computing 2019 : Rise of the Machines (learning) : July 28-August 1, 2019, Chicago, Illinois. Practice and Experience in Advanced Research Computing (Conference) (2019 : Chicago, Il...最新文献

英文中文

Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning. 通过监督式机器学习预测作业资源提高高性能计算系统性能。

PEARC19 : Practice and Experience in Advanced Research Computing 2019 : Rise of the Machines (learning) : July 28-August 1, 2019, Chicago, Illinois. Practice and Experience in Advanced Research Computing (Conference) (2019 : Chicago, Il...

Pub Date : 2019-07-01 Epub Date: 2019-07-28 DOI: 10.1145/3332186.3333041

Mohammed Tanash, Brandon Dunn, Daniel Andresen, William Hsu, Huichen Yang, Adedolapo Okanlawon

High-Performance Computing (HPC) systems are resources utilized for data capture, sharing, and analysis. The majority of our HPC users come from other disciplines than Computer Science. HPC users including computer scientists have difficulties and do not feel proficient enough to decide the required amount of resources for their submitted jobs on the cluster. Consequently, users are encouraged to over-estimate resources for their submitted jobs, so their jobs will not be killing due insufficient resources. This process will waste and devour HPC resources; hence, this will lead to inefficient cluster utilization. We created a supervised machine learning model and integrated it into the Slurm resource manager simulator to predict the amount of required memory resources (Memory) and the required amount of time to run the computation. Our model involves using different machine learning algorithms. Our goal is to integrate and test the proposed supervised machine learning model on Slurm. We used over 10000 tasks selected from our HPC log files to evaluate the performance and the accuracy of our integrated model. The purpose of our work is to increase the performance of the Slurm by predicting the amount of require jobs memory resources and the time required for each particular job in order to improve the utilization of the HPC system using our integrated supervised machine learning model. Our results indicate that for larger jobs our model helps dramatically reduce computational turnaround time (from five days to ten hours for large jobs), substantially increased utilization of the HPC system, and decreased the average waiting time for the submitted jobs.

高性能计算(HPC)系统是用于数据捕获、共享和分析的资源。我们的大多数HPC用户来自计算机科学以外的其他学科。包括计算机科学家在内的HPC用户在决定他们在集群上提交的作业所需的资源数量方面存在困难，并且觉得自己不够熟练。因此，会鼓励用户高估提交作业的资源，这样他们的作业就不会因为资源不足而中断。这个过程将浪费和吞噬高性能计算资源;因此，这将导致低效的集群利用。我们创建了一个监督机器学习模型，并将其集成到Slurm资源管理器模拟器中，以预测运行计算所需的内存资源(内存)和所需的时间。我们的模型使用了不同的机器学习算法。我们的目标是在Slurm上集成和测试提出的监督机器学习模型。我们使用了从HPC日志文件中选择的10000多个任务来评估我们集成模型的性能和准确性。我们的工作目的是通过预测所需作业内存资源的数量和每个特定作业所需的时间来提高Slurm的性能，以便使用我们的集成监督机器学习模型提高HPC系统的利用率。我们的结果表明，对于较大的作业，我们的模型有助于显著减少计算周转时间(对于大型作业，从5天减少到10小时)，大大提高HPC系统的利用率，并减少提交作业的平均等待时间。

{"title":"Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning.","authors":"Mohammed Tanash, Brandon Dunn, Daniel Andresen, William Hsu, Huichen Yang, Adedolapo Okanlawon","doi":"10.1145/3332186.3333041","DOIUrl":"https://doi.org/10.1145/3332186.3333041","url":null,"abstract":"<p><p>High-Performance Computing (HPC) systems are resources utilized for data capture, sharing, and analysis. The majority of our HPC users come from other disciplines than Computer Science. HPC users including computer scientists have difficulties and do not feel proficient enough to decide the required amount of resources for their submitted jobs on the cluster. Consequently, users are encouraged to over-estimate resources for their submitted jobs, so their jobs will not be killing due insufficient resources. This process will waste and devour HPC resources; hence, this will lead to inefficient cluster utilization. We created a supervised machine learning model and integrated it into the Slurm resource manager simulator to predict the amount of required memory resources (Memory) and the required amount of time to run the computation. Our model involves using different machine learning algorithms. Our goal is to integrate and test the proposed supervised machine learning model on Slurm. We used over 10000 tasks selected from our HPC log files to evaluate the performance and the accuracy of our integrated model. The purpose of our work is to increase the performance of the Slurm by predicting the amount of require jobs memory resources and the time required for each particular job in order to improve the utilization of the HPC system using our integrated supervised machine learning model. Our results indicate that for larger jobs our model helps dramatically reduce computational turnaround time (from five days to ten hours for large jobs), substantially increased utilization of the HPC system, and decreased the average waiting time for the submitted jobs.</p>","PeriodicalId":93601,"journal":{"name":"PEARC19 : Practice and Experience in Advanced Research Computing 2019 : Rise of the Machines (learning) : July 28-August 1, 2019, Chicago, Illinois. Practice and Experience in Advanced Research Computing (Conference) (2019 : Chicago, Il...","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3332186.3333041","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40306877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀