Chansup Byun, Julia Mullen, Albert Reuther, William Arcand, William Bergeron, David Bestor, Daniel Burrill, Vijay Gadepally, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Peter Michaleas, Guillermo Morales, Andrew Prout, Antonio Rosa, Charles Yee, Jeremy Kepner, Lauren Milechin
{"title":"LLload: Simplifying Real-Time Job Monitoring for HPC Users","authors":"Chansup Byun, Julia Mullen, Albert Reuther, William Arcand, William Bergeron, David Bestor, Daniel Burrill, Vijay Gadepally, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Peter Michaleas, Guillermo Morales, Andrew Prout, Antonio Rosa, Charles Yee, Jeremy Kepner, Lauren Milechin","doi":"arxiv-2407.01481","DOIUrl":null,"url":null,"abstract":"One of the more complex tasks for researchers using HPC systems is\nperformance monitoring and tuning of their applications. Developing a practice\nof continuous performance improvement, both for speed-up and efficient use of\nresources is essential to the long term success of both the HPC practitioner\nand the research project. Profiling tools provide a nice view of the\nperformance of an application but often have a steep learning curve and rarely\nprovide an easy to interpret view of resource utilization. Lower level tools\nsuch as top and htop provide a view of resource utilization for those familiar\nand comfortable with Linux but a barrier for newer HPC practitioners. To expand\nthe existing profiling and job monitoring options, the MIT Lincoln Laboratory\nSupercomputing Center created LLoad, a tool that captures a snapshot of the\nresources being used by a job on a per user basis. LLload is a tool built from\nstandard HPC tools that provides an easy way for a researcher to track resource\nusage of active jobs. We explain how the tool was designed and implemented and\nprovide insight into how it is used to aid new researchers in developing their\nperformance monitoring skills as well as guide researchers in their resource\nrequests.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"20 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.01481","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
One of the more complex tasks for researchers using HPC systems is
performance monitoring and tuning of their applications. Developing a practice
of continuous performance improvement, both for speed-up and efficient use of
resources is essential to the long term success of both the HPC practitioner
and the research project. Profiling tools provide a nice view of the
performance of an application but often have a steep learning curve and rarely
provide an easy to interpret view of resource utilization. Lower level tools
such as top and htop provide a view of resource utilization for those familiar
and comfortable with Linux but a barrier for newer HPC practitioners. To expand
the existing profiling and job monitoring options, the MIT Lincoln Laboratory
Supercomputing Center created LLoad, a tool that captures a snapshot of the
resources being used by a job on a per user basis. LLload is a tool built from
standard HPC tools that provides an easy way for a researcher to track resource
usage of active jobs. We explain how the tool was designed and implemented and
provide insight into how it is used to aid new researchers in developing their
performance monitoring skills as well as guide researchers in their resource
requests.