{"title":"Scalable, fault-tolerant job step management for high-performance systems","authors":"D. Solt;J. Hursey;A. Lauria;D. Guo;X. Guo","doi":"10.1147/JRD.2019.2958909","DOIUrl":null,"url":null,"abstract":"Scientific applications on the CORAL systems demanded a fault-tolerant, scalable job launch infrastructure for complex workflows with multiple job steps within an allocation. The distinct design of IBM's Job Step Manager (JSM) infrastructure, working in concert with Load Sharing Facility (LSF) and Cluster System Management (CSM), achieves these goals. JSM demonstrated launching over three-quarters of a million processes in under a minute while providing efficient process management interface for exascale-based services to communication libraries, such as parallel active messaging interface and message passing interface, and tools over the management network. JSM relies on the parallel task support library to provide a fault-tolerant, scalable communication medium between the JSM daemons. Application workflows using job steps harness the unique resource set abstraction concept in JSM to manage CPUs, GPUs, and memory between groups of processes, possibly in discrete job steps, sharing a node. The resource set concept gives JSM the opportunity to better organize process placement to optimize, for example, CPU-to-GPU communication. Applications that need complete control over the shaping of the resource sets and the placement, binding, and ordering of processes within them can leverage JSM's co-designed Explicit Resource File mechanism. This article explores the design decisions, implementation considerations, and performance optimizations of IBM's JSM infrastructure to support scientific discovery on the CORAL systems.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":"64 3/4","pages":"8:1-8:9"},"PeriodicalIF":1.3000,"publicationDate":"2019-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2958909","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IBM Journal of Research and Development","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/8930300/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 1
Abstract
Scientific applications on the CORAL systems demanded a fault-tolerant, scalable job launch infrastructure for complex workflows with multiple job steps within an allocation. The distinct design of IBM's Job Step Manager (JSM) infrastructure, working in concert with Load Sharing Facility (LSF) and Cluster System Management (CSM), achieves these goals. JSM demonstrated launching over three-quarters of a million processes in under a minute while providing efficient process management interface for exascale-based services to communication libraries, such as parallel active messaging interface and message passing interface, and tools over the management network. JSM relies on the parallel task support library to provide a fault-tolerant, scalable communication medium between the JSM daemons. Application workflows using job steps harness the unique resource set abstraction concept in JSM to manage CPUs, GPUs, and memory between groups of processes, possibly in discrete job steps, sharing a node. The resource set concept gives JSM the opportunity to better organize process placement to optimize, for example, CPU-to-GPU communication. Applications that need complete control over the shaping of the resource sets and the placement, binding, and ordering of processes within them can leverage JSM's co-designed Explicit Resource File mechanism. This article explores the design decisions, implementation considerations, and performance optimizations of IBM's JSM infrastructure to support scientific discovery on the CORAL systems.
期刊介绍:
The IBM Journal of Research and Development is a peer-reviewed technical journal, published bimonthly, which features the work of authors in the science, technology and engineering of information systems. Papers are written for the worldwide scientific research and development community and knowledgeable professionals.
Submitted papers are welcome from the IBM technical community and from non-IBM authors on topics relevant to the scientific and technical content of the Journal.