L. Schares, A. Tantawi, P. Maniotis, Ming-Hung Chen, Claudia Misale, Seetharami R. Seelam, Hao Yu
{"title":"Chic-sched: a HPC Placement-Group Scheduler on Hierarchical Topologies with Constraints","authors":"L. Schares, A. Tantawi, P. Maniotis, Ming-Hung Chen, Claudia Misale, Seetharami R. Seelam, Hao Yu","doi":"10.1109/IPDPS54959.2023.00050","DOIUrl":null,"url":null,"abstract":"Efficient placement of advanced HPC and AI workloads with application constraints is raising challenges for resource schedulers on shared infrastructures, such as the Cloud. In this work, we propose a novel Constraints- and Heuristics-based scheduler on HIerarchical Topologies for High-Performance Computing workloads in the Cloud (chic-sched, for short). Our heuristics-based algorithm enables placement across multiple levels in a network hierarchy with loosely specified constraints, and it works without retries by providing suboptimal placements to minimize placement failures. This allows for fast scheduling at scale, and the O(N log N) complexity enables placement decisions within tens of milliseconds for groups of hundreds of virtual machines (VM). We introduce a new and simple metric to quantify the goodness of group placements. With this metric, in terms of deviation from ideal placements, we show that chic-sched is 20-50% better than the common bestFit or worstFit algorithms in all scenarios of two-level placements with spreading and packing constraints. We evaluate chic-sched with publicly available VM-request traces from a production Cloud, and, comparing against bestFit, we show that it achieves 8% lower placement failure rates and more than 40% better placement locality. Finally, to quantify the goodness of constraints-based placements, we conduct experiments with a realistic MPI workload on synthetically allocated VM clusters in a public cloud. We measure a 9% performance improvement over an adverse placement in a scenario where our heuristics-based scheduler would return a good, but not perfect, placement.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS54959.2023.00050","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Efficient placement of advanced HPC and AI workloads with application constraints is raising challenges for resource schedulers on shared infrastructures, such as the Cloud. In this work, we propose a novel Constraints- and Heuristics-based scheduler on HIerarchical Topologies for High-Performance Computing workloads in the Cloud (chic-sched, for short). Our heuristics-based algorithm enables placement across multiple levels in a network hierarchy with loosely specified constraints, and it works without retries by providing suboptimal placements to minimize placement failures. This allows for fast scheduling at scale, and the O(N log N) complexity enables placement decisions within tens of milliseconds for groups of hundreds of virtual machines (VM). We introduce a new and simple metric to quantify the goodness of group placements. With this metric, in terms of deviation from ideal placements, we show that chic-sched is 20-50% better than the common bestFit or worstFit algorithms in all scenarios of two-level placements with spreading and packing constraints. We evaluate chic-sched with publicly available VM-request traces from a production Cloud, and, comparing against bestFit, we show that it achieves 8% lower placement failure rates and more than 40% better placement locality. Finally, to quantify the goodness of constraints-based placements, we conduct experiments with a realistic MPI workload on synthetically allocated VM clusters in a public cloud. We measure a 9% performance improvement over an adverse placement in a scenario where our heuristics-based scheduler would return a good, but not perfect, placement.