首页 > 最新文献

PEARC19 : Practice and Experience in Advanced Research Computing 2019 : Rise of the Machines (learning) : July 28-August 1, 2019, Chicago, Illinois. Practice and Experience in Advanced Research Computing (Conference) (2019 : Chicago, Il...最新文献

英文 中文
Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning. 通过监督式机器学习预测作业资源提高高性能计算系统性能。
Mohammed Tanash, Brandon Dunn, Daniel Andresen, William Hsu, Huichen Yang, Adedolapo Okanlawon

High-Performance Computing (HPC) systems are resources utilized for data capture, sharing, and analysis. The majority of our HPC users come from other disciplines than Computer Science. HPC users including computer scientists have difficulties and do not feel proficient enough to decide the required amount of resources for their submitted jobs on the cluster. Consequently, users are encouraged to over-estimate resources for their submitted jobs, so their jobs will not be killing due insufficient resources. This process will waste and devour HPC resources; hence, this will lead to inefficient cluster utilization. We created a supervised machine learning model and integrated it into the Slurm resource manager simulator to predict the amount of required memory resources (Memory) and the required amount of time to run the computation. Our model involves using different machine learning algorithms. Our goal is to integrate and test the proposed supervised machine learning model on Slurm. We used over 10000 tasks selected from our HPC log files to evaluate the performance and the accuracy of our integrated model. The purpose of our work is to increase the performance of the Slurm by predicting the amount of require jobs memory resources and the time required for each particular job in order to improve the utilization of the HPC system using our integrated supervised machine learning model. Our results indicate that for larger jobs our model helps dramatically reduce computational turnaround time (from five days to ten hours for large jobs), substantially increased utilization of the HPC system, and decreased the average waiting time for the submitted jobs.

高性能计算(HPC)系统是用于数据捕获、共享和分析的资源。我们的大多数HPC用户来自计算机科学以外的其他学科。包括计算机科学家在内的HPC用户在决定他们在集群上提交的作业所需的资源数量方面存在困难,并且觉得自己不够熟练。因此,会鼓励用户高估提交作业的资源,这样他们的作业就不会因为资源不足而中断。这个过程将浪费和吞噬高性能计算资源;因此,这将导致低效的集群利用。我们创建了一个监督机器学习模型,并将其集成到Slurm资源管理器模拟器中,以预测运行计算所需的内存资源(内存)和所需的时间。我们的模型使用了不同的机器学习算法。我们的目标是在Slurm上集成和测试提出的监督机器学习模型。我们使用了从HPC日志文件中选择的10000多个任务来评估我们集成模型的性能和准确性。我们的工作目的是通过预测所需作业内存资源的数量和每个特定作业所需的时间来提高Slurm的性能,以便使用我们的集成监督机器学习模型提高HPC系统的利用率。我们的结果表明,对于较大的作业,我们的模型有助于显著减少计算周转时间(对于大型作业,从5天减少到10小时),大大提高HPC系统的利用率,并减少提交作业的平均等待时间。
{"title":"Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning.","authors":"Mohammed Tanash,&nbsp;Brandon Dunn,&nbsp;Daniel Andresen,&nbsp;William Hsu,&nbsp;Huichen Yang,&nbsp;Adedolapo Okanlawon","doi":"10.1145/3332186.3333041","DOIUrl":"https://doi.org/10.1145/3332186.3333041","url":null,"abstract":"<p><p>High-Performance Computing (HPC) systems are resources utilized for data capture, sharing, and analysis. The majority of our HPC users come from other disciplines than Computer Science. HPC users including computer scientists have difficulties and do not feel proficient enough to decide the required amount of resources for their submitted jobs on the cluster. Consequently, users are encouraged to over-estimate resources for their submitted jobs, so their jobs will not be killing due insufficient resources. This process will waste and devour HPC resources; hence, this will lead to inefficient cluster utilization. We created a supervised machine learning model and integrated it into the Slurm resource manager simulator to predict the amount of required memory resources (Memory) and the required amount of time to run the computation. Our model involves using different machine learning algorithms. Our goal is to integrate and test the proposed supervised machine learning model on Slurm. We used over 10000 tasks selected from our HPC log files to evaluate the performance and the accuracy of our integrated model. The purpose of our work is to increase the performance of the Slurm by predicting the amount of require jobs memory resources and the time required for each particular job in order to improve the utilization of the HPC system using our integrated supervised machine learning model. Our results indicate that for larger jobs our model helps dramatically reduce computational turnaround time (from five days to ten hours for large jobs), substantially increased utilization of the HPC system, and decreased the average waiting time for the submitted jobs.</p>","PeriodicalId":93601,"journal":{"name":"PEARC19 : Practice and Experience in Advanced Research Computing 2019 : Rise of the Machines (learning) : July 28-August 1, 2019, Chicago, Illinois. Practice and Experience in Advanced Research Computing (Conference) (2019 : Chicago, Il...","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3332186.3333041","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40306877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
期刊
PEARC19 : Practice and Experience in Advanced Research Computing 2019 : Rise of the Machines (learning) : July 28-August 1, 2019, Chicago, Illinois. Practice and Experience in Advanced Research Computing (Conference) (2019 : Chicago, Il...
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1