Elastic job bundling: an adaptive resource request strategy for large-scale parallel applications

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2015-11-15 DOI:10.1145/2807591.2807610

Feng Liu, J. Weissman

引用次数: 21

Abstract

In today's batch queue HPC cluster systems, the user submits a job requesting a fixed number of processors. The system will not start the job until all of the requested resources become available simultaneously. When cluster workload is high, large sized jobs will experience long waiting time due to this policy. In this paper, we propose a new approach that dynamically decomposes a large job into smaller ones to reduce waiting time, and lets the application expand across multiple subjobs while continuously achieving progress. This approach has three benefits: (i) application turnaround time is reduced, (ii) system fragmentation is diminished, and (iii) fairness is promoted. Our approach does not depend on job queue time prediction but exploits available backfill opportunities. Simulation results have shown that our approach can reduce application mean turnaround time by up to 48%.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

弹性作业捆绑:大规模并行应用程序的自适应资源请求策略

在今天的批处理队列HPC集群系统中，用户提交一个请求固定数量处理器的作业。在所有请求的资源同时可用之前，系统不会启动作业。当集群工作负载较高时，由于此策略，大型作业将经历较长的等待时间。在本文中，我们提出了一种新的方法，该方法将一个大的作业动态分解为较小的作业，以减少等待时间，并使应用程序在多个子作业之间扩展，同时不断取得进展。这种方法有三个好处:(i)减少了应用程序周转时间，(ii)减少了系统碎片，(iii)提高了公平性。我们的方法不依赖于作业队列时间预测，而是利用可用的回填机会。仿真结果表明，我们的方法可以将应用程序的平均周转时间减少48%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

自引率

0.00%

发文量