JuMonC: A RESTful tool for enabling monitoring and control of simulations at scale

IF 6.2 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Future Generation Computer Systems-The International Journal of Escience Pub Date : 2024-09-23 DOI:10.1016/j.future.2024.107541
Christian Witzler , Filipe Souza Mendes Guimarães , Daniel Mira , Hartwig Anzt , Jens Henrik Göbbert , Wolfgang Frings , Mathis Bode
{"title":"JuMonC: A RESTful tool for enabling monitoring and control of simulations at scale","authors":"Christian Witzler ,&nbsp;Filipe Souza Mendes Guimarães ,&nbsp;Daniel Mira ,&nbsp;Hartwig Anzt ,&nbsp;Jens Henrik Göbbert ,&nbsp;Wolfgang Frings ,&nbsp;Mathis Bode","doi":"10.1016/j.future.2024.107541","DOIUrl":null,"url":null,"abstract":"<div><div>As systems and simulations grow in size and complexity, it is challenging to maintain efficient use of resources and avoid failures. In this scenario, monitoring becomes even more important and mandatory. This paper describes and discusses the benefits of the advanced monitoring and control tool JuMonC, which runs under user control alongside HPC simulations and provides valuable metrics via REST-API. In addition, plugin extensibility allows JuMonC to go a step further and provide computational steering of the simulation itself. To demonstrate the benefits and usability of JuMonC for large-scale simulations, two use cases are described employing nekRS and ICON on JURECA-DC, a supercomputer located at the Jülich Supercomputing Centre (JSC). Furthermore, a large-scale use case with nekRS on JSC’s flagship system JUWELS Booster is described. Finally, the interplay between JuMonC and LLview (a standard monitoring tool for HPC systems) is presented using a simple and secure JuMonC-LLview plugin, which collects performance metrics and enables their analysis in LLview. Overall, the portability and usefulness of JuMonC, together with its low performance impact, make it an important application for both current and future generations of exascale HPC systems.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"164 ","pages":"Article 107541"},"PeriodicalIF":6.2000,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X24005053","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

As systems and simulations grow in size and complexity, it is challenging to maintain efficient use of resources and avoid failures. In this scenario, monitoring becomes even more important and mandatory. This paper describes and discusses the benefits of the advanced monitoring and control tool JuMonC, which runs under user control alongside HPC simulations and provides valuable metrics via REST-API. In addition, plugin extensibility allows JuMonC to go a step further and provide computational steering of the simulation itself. To demonstrate the benefits and usability of JuMonC for large-scale simulations, two use cases are described employing nekRS and ICON on JURECA-DC, a supercomputer located at the Jülich Supercomputing Centre (JSC). Furthermore, a large-scale use case with nekRS on JSC’s flagship system JUWELS Booster is described. Finally, the interplay between JuMonC and LLview (a standard monitoring tool for HPC systems) is presented using a simple and secure JuMonC-LLview plugin, which collects performance metrics and enables their analysis in LLview. Overall, the portability and usefulness of JuMonC, together with its low performance impact, make it an important application for both current and future generations of exascale HPC systems.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
JuMonC:用于大规模监测和控制模拟的 RESTful 工具
随着系统和模拟的规模和复杂性不断增加,保持资源的有效利用并避免故障是一项挑战。在这种情况下,监控变得更加重要和必要。本文介绍并讨论了高级监控工具 JuMonC 的优势,该工具在用户控制下与 HPC 仿真一起运行,并通过 REST-API 提供有价值的指标。此外,插件的可扩展性使 JuMonC 能够更进一步,为仿真本身提供计算指导。为了展示 JuMonC 在大规模仿真方面的优势和可用性,我们介绍了在 JURECA-DC 超级计算机上使用 nekRS 和 ICON 的两个用例,JURECA-DC 超级计算机位于尤里希超级计算中心(JSC)。此外,还介绍了在 JSC 的旗舰系统 JUWELS Booster 上使用 nekRS 的大规模使用案例。最后,还介绍了 JuMonC 与 LLview(高性能计算系统的标准监控工具)之间的相互作用,使用简单安全的 JuMonC-LLview 插件收集性能指标,并在 LLview 中进行分析。总之,JuMonC 的可移植性和实用性,以及对性能的低影响,使其成为当前和未来几代超大规模高性能计算系统的重要应用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
19.90
自引率
2.70%
发文量
376
审稿时长
10.6 months
期刊介绍: Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.
期刊最新文献
Editorial Board AIHO: Enhancing task offloading and reducing latency in serverless multi-edge-to-cloud systems DSDM-TCSE: Deterministic storage and deletion mechanism for trusted cloud service environments Energy management in smart grids: An Edge-Cloud Continuum approach with Deep Q-learning Service migration with edge collaboration: Multi-agent deep reinforcement learning approach combined with user preference adaptation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1