Scalable resource management in high performance computers

Proceedings. IEEE International Conference on Cluster Computing Pub Date : 2002-09-23 DOI:10.1109/CLUSTR.2002.1137759

E. Frachtenberg, F. Petrini, Juan Fernández Peinador, S. Coll

{"title":"Scalable resource management in high performance computers","authors":"E. Frachtenberg, F. Petrini, Juan Fernández Peinador, S. Coll","doi":"10.1109/CLUSTR.2002.1137759","DOIUrl":null,"url":null,"abstract":"Clusters of workstations have emerged as an important platform for building cost-effective, scalable, and highly-available computers. Although many hardware solutions are available today, the largest challenge in making largescale clusters usable lies in the system software. In this paper we present STORM, a resource management tool designed to provide scalability, low overhead, and the flexibility necessary to efficiently support and analyze a wide range of job-scheduling algorithms. STORM achieves these feats by using a small set of primitive mechanisms that are common in modern high-performance interconnects. The architecture of STORM is based on three main technical innovations. First, a part of the scheduler runs in the thread processor located on the network interface. Second, we use hardware collectives that are highly scalable both for implementing control heartbeats and to distribute the binary of a parallel job in near-constant time. Third, we use an I/O bypass protocol that allows fast data movements front the file system to the communication buffers in the network interface and vice versa. The experimental results show that STORM can launch a job with a binary of 12 MB on a 64-processor, 32-node cluster in less than 250 ms. This paper provides expert. mental and analytical evidence that these results scale to a much larger number of nodes. To the best of our knowledge, STORM significantly outperforms existing production schedulers in launching jobs, performing resource management tasks, and gang-scheduling tasks.","PeriodicalId":92128,"journal":{"name":"Proceedings. IEEE International Conference on Cluster Computing","volume":"63 1","pages":"305-314"},"PeriodicalIF":0.0000,"publicationDate":"2002-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTR.2002.1137759","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Clusters of workstations have emerged as an important platform for building cost-effective, scalable, and highly-available computers. Although many hardware solutions are available today, the largest challenge in making largescale clusters usable lies in the system software. In this paper we present STORM, a resource management tool designed to provide scalability, low overhead, and the flexibility necessary to efficiently support and analyze a wide range of job-scheduling algorithms. STORM achieves these feats by using a small set of primitive mechanisms that are common in modern high-performance interconnects. The architecture of STORM is based on three main technical innovations. First, a part of the scheduler runs in the thread processor located on the network interface. Second, we use hardware collectives that are highly scalable both for implementing control heartbeats and to distribute the binary of a parallel job in near-constant time. Third, we use an I/O bypass protocol that allows fast data movements front the file system to the communication buffers in the network interface and vice versa. The experimental results show that STORM can launch a job with a binary of 12 MB on a 64-processor, 32-node cluster in less than 250 ms. This paper provides expert. mental and analytical evidence that these results scale to a much larger number of nodes. To the best of our knowledge, STORM significantly outperforms existing production schedulers in launching jobs, performing resource management tasks, and gang-scheduling tasks.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

高性能计算机中的可伸缩资源管理

工作站集群已经成为构建具有成本效益、可伸缩和高可用性计算机的重要平台。尽管现在有许多硬件解决方案可用，但是使大规模集群可用的最大挑战在于系统软件。在本文中，我们介绍了STORM，一种资源管理工具，旨在提供可扩展性，低开销和灵活性，以有效地支持和分析各种作业调度算法。STORM通过使用现代高性能互连中常见的一小部分原始机制来实现这些壮举。STORM的架构基于三个主要的技术创新。首先，调度器的一部分在位于网络接口上的线程处理器中运行。其次，我们使用高度可扩展的硬件集合来实现控制心跳，并在接近恒定的时间内分发并行作业的二进制数据。第三，我们使用I/O旁路协议，该协议允许快速数据从文件系统前端移动到网络接口中的通信缓冲区，反之亦然。实验结果表明，STORM可以在不到250毫秒的时间内在64处理器、32节点的集群上启动一个12 MB二进制文件的作业。本文提供了专家意见。心理和分析证据表明，这些结果可以扩展到更大数量的节点。据我们所知，STORM在启动作业、执行资源管理任务和组调度任务方面明显优于现有的生产调度程序。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings. IEEE International Conference on Cluster Computing

自引率

0.00%

发文量

期刊最新文献

Parallel processing of spatial batch-queries using xBR+-trees in solid-state drives Predicting the Energy-Consumption of MPI Applications at Scale Using Only a Single Node Parallel and Efficient Sensitivity Analysis of Microscopy Image Segmentation Workflows in Hybrid Systems. FTS 2016 Workshop Keynote Speech Letter from the general chair