ElasticBatch: A Learning-Augmented Elastic Scheduling System for Batch Inference on MIG

IF 6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-07-19 DOI:10.1109/TPDS.2024.3431189

Jiaxing Qi;Wencong Xiao;Mingzhen Li;Chaojie Yang;Yong Li;Wei Lin;Hailong Yang;Zhongzhi Luan;Depei Qian

{"title":"ElasticBatch: A Learning-Augmented Elastic Scheduling System for Batch Inference on MIG","authors":"Jiaxing Qi;Wencong Xiao;Mingzhen Li;Chaojie Yang;Yong Li;Wei Lin;Hailong Yang;Zhongzhi Luan;Depei Qian","doi":"10.1109/TPDS.2024.3431189","DOIUrl":null,"url":null,"abstract":"As deep learning (DL) technologies become ubiquitous, GPU clusters are deployed for inference tasks with consistent service level objectives (SLOs). Efficiently utilizing multiple GPUs is crucial for throughput and cost-effectiveness. This article addresses the challenges posed by dynamic input and NVIDIA MIG in scheduling DL workloads. We present ElasticBatch, a scheduling system that simplifies configuration through bucketization and employs a machine learning-based pipeline to optimize settings. Our experiments demonstrate that ElasticBatch achieves a 50% reduction in GPU instances compared to MIG disablement, increases GPU utilization by 1.4% to 6.5% over an ideal scheduler and significantly reduces profiling time. This research contributes to the discourse on efficient utilization of GPU clusters. ElasticBatch's effectiveness in mitigating challenges posed by dynamic inputs and NVIDIA MIG underscores its potential to optimize GPU cluster performance, providing tangible benefits in terms of reduced instances, increased utilization, and significant time savings in real-world deployment scenarios.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1708-1720"},"PeriodicalIF":6.0000,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Parallel and Distributed Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10605084/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

As deep learning (DL) technologies become ubiquitous, GPU clusters are deployed for inference tasks with consistent service level objectives (SLOs). Efficiently utilizing multiple GPUs is crucial for throughput and cost-effectiveness. This article addresses the challenges posed by dynamic input and NVIDIA MIG in scheduling DL workloads. We present ElasticBatch, a scheduling system that simplifies configuration through bucketization and employs a machine learning-based pipeline to optimize settings. Our experiments demonstrate that ElasticBatch achieves a 50% reduction in GPU instances compared to MIG disablement, increases GPU utilization by 1.4% to 6.5% over an ideal scheduler and significantly reduces profiling time. This research contributes to the discourse on efficient utilization of GPU clusters. ElasticBatch's effectiveness in mitigating challenges posed by dynamic inputs and NVIDIA MIG underscores its potential to optimize GPU cluster performance, providing tangible benefits in terms of reduced instances, increased utilization, and significant time savings in real-world deployment scenarios.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ElasticBatch：用于 MIG 批量推理的学习增强型弹性调度系统

随着深度学习（DL）技术的普及，GPU 集群被部署用于具有一致服务水平目标（SLO）的推理任务。有效利用多个 GPU 对于提高吞吐量和成本效益至关重要。本文探讨了动态输入和英伟达™ MIG 在调度 DL 工作负载时带来的挑战。我们介绍了 ElasticBatch 调度系统，该系统通过桶化来简化配置，并采用基于机器学习的管道来优化设置。我们的实验证明，与禁用 MIG 相比，ElasticBatch 可减少 50% 的 GPU 实例，与理想的调度程序相比，GPU 利用率提高了 1.4% 至 6.5%，并显著减少了剖析时间。这项研究为高效利用 GPU 集群的讨论做出了贡献。ElasticBatch 在缓解动态输入和英伟达 MIG 带来的挑战方面的有效性凸显了其优化 GPU 群集性能的潜力，在实际部署场景中，它在减少实例、提高利用率和显著节省时间方面带来了实实在在的好处。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Parallel and Distributed Systems 工程技术-工程：电子与电气

CiteScore

11.00

自引率

9.40%

发文量

281

审稿时长

5.6 months

期刊介绍： IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers. Particular areas of interest include, but are not limited to: a) Parallel and distributed algorithms, focusing on topics such as: models of computation; numerical, combinatorial, and data-intensive parallel algorithms, scalability of algorithms and data structures for parallel and distributed systems, communication and synchronization protocols, network algorithms, scheduling, and load balancing. b) Applications of parallel and distributed computing, including computational and data-enabled science and engineering, big data applications, parallel crowd sourcing, large-scale social network analysis, management of big data, cloud and grid computing, scientific and biomedical applications, mobile computing, and cyber-physical systems. c) Parallel and distributed architectures, including architectures for instruction-level and thread-level parallelism; design, analysis, implementation, fault resilience and performance measurements of multiple-processor systems; multicore processors, heterogeneous many-core systems; petascale and exascale systems designs; novel big data architectures; special purpose architectures, including graphics processors, signal processors, network processors, media accelerators, and other special purpose processors and accelerators; impact of technology on architecture; network and interconnect architectures; parallel I/O and storage systems; architecture of the memory hierarchy; power-efficient and green computing architectures; dependable architectures; and performance modeling and evaluation. d) Parallel and distributed software, including parallel and multicore programming languages and compilers, runtime systems, operating systems, Internet computing and web services, resource management including green computing, middleware for grids, clouds, and data centers, libraries, performance modeling and evaluation, parallel programming paradigms, and programming environments and tools.