LLload: Simplifying Real-Time Job Monitoring for HPC Users

arXiv - CS - Performance Pub Date : 2024-07-01 DOI:arxiv-2407.01481

Chansup Byun, Julia Mullen, Albert Reuther, William Arcand, William Bergeron, David Bestor, Daniel Burrill, Vijay Gadepally, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Peter Michaleas, Guillermo Morales, Andrew Prout, Antonio Rosa, Charles Yee, Jeremy Kepner, Lauren Milechin

{"title":"LLload: Simplifying Real-Time Job Monitoring for HPC Users","authors":"Chansup Byun, Julia Mullen, Albert Reuther, William Arcand, William Bergeron, David Bestor, Daniel Burrill, Vijay Gadepally, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Peter Michaleas, Guillermo Morales, Andrew Prout, Antonio Rosa, Charles Yee, Jeremy Kepner, Lauren Milechin","doi":"arxiv-2407.01481","DOIUrl":null,"url":null,"abstract":"One of the more complex tasks for researchers using HPC systems is\nperformance monitoring and tuning of their applications. Developing a practice\nof continuous performance improvement, both for speed-up and efficient use of\nresources is essential to the long term success of both the HPC practitioner\nand the research project. Profiling tools provide a nice view of the\nperformance of an application but often have a steep learning curve and rarely\nprovide an easy to interpret view of resource utilization. Lower level tools\nsuch as top and htop provide a view of resource utilization for those familiar\nand comfortable with Linux but a barrier for newer HPC practitioners. To expand\nthe existing profiling and job monitoring options, the MIT Lincoln Laboratory\nSupercomputing Center created LLoad, a tool that captures a snapshot of the\nresources being used by a job on a per user basis. LLload is a tool built from\nstandard HPC tools that provides an easy way for a researcher to track resource\nusage of active jobs. We explain how the tool was designed and implemented and\nprovide insight into how it is used to aid new researchers in developing their\nperformance monitoring skills as well as guide researchers in their resource\nrequests.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"20 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.01481","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

One of the more complex tasks for researchers using HPC systems is performance monitoring and tuning of their applications. Developing a practice of continuous performance improvement, both for speed-up and efficient use of resources is essential to the long term success of both the HPC practitioner and the research project. Profiling tools provide a nice view of the performance of an application but often have a steep learning curve and rarely provide an easy to interpret view of resource utilization. Lower level tools such as top and htop provide a view of resource utilization for those familiar and comfortable with Linux but a barrier for newer HPC practitioners. To expand the existing profiling and job monitoring options, the MIT Lincoln Laboratory Supercomputing Center created LLoad, a tool that captures a snapshot of the resources being used by a job on a per user basis. LLload is a tool built from standard HPC tools that provides an easy way for a researcher to track resource usage of active jobs. We explain how the tool was designed and implemented and provide insight into how it is used to aid new researchers in developing their performance monitoring skills as well as guide researchers in their resource requests.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

LLload：简化高性能计算用户的实时作业监控

对于使用高性能计算系统的研究人员来说，更复杂的任务之一是对其应用程序进行性能监控和调整。开发一种持续改进性能的方法，既能提高速度，又能有效利用资源，这对高性能计算从业人员和研究项目的长期成功都至关重要。剖析工具提供了一个很好的应用程序性能视图，但通常学习曲线很陡峭，而且很少提供易于解释的资源利用率视图。较低级别的工具，如 top 和 htop，可以为熟悉 Linux 的人提供资源利用率的视图，但对较新的 HPC 从业人员来说却是个障碍。为了扩展现有的剖析和作业监控选项，麻省理工学院林肯实验室超级计算中心创建了 LLoad，这是一种按用户捕获作业所用资源快照的工具。LLoad 是一款由标准 HPC 工具构建而成的工具，它为研究人员跟踪活动作业的资源使用情况提供了一种简便的方法。我们解释了该工具的设计和实施过程，并深入介绍了它如何用于帮助新研究人员提高性能监控技能，以及指导研究人员的资源需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Performance

自引率

0.00%

发文量