Deep neural networks (DNNs) have gained popularity in many fields such as computer vision, and natural language processing. However, the increasing size of data and complexity of models have made training DNNs time-consuming. While distributed DNN training using multiple GPUs in parallel is a common solution, it introduces challenges in GPU resource management and scheduling. One key challenge is minimizing communication costs among GPUs assigned to a DNN training job. High communication costs—arising from factors such as inter-rack or inter-machine data transfers—can lead to hardware bottlenecks and network delays, ultimately slowing down training. Reducing these costs facilitates more efficient data transfer and synchronization, directly accelerating the training process. Although deep reinforcement learning (DRL) has shown promise in GPU resource scheduling, existing methods often lack considerations for hardware topology. Moreover, most proposed GPU schedulers ignore the possibility of combining heuristic and DRL policies. In response to these challenges, we introduce , an innovative hybrid scheduler that integrates deep reinforcement learning (DRL) and heuristic methods to enhance GPU job scheduling. uses a multi-branch convolutional neural network (CNN) model for job selection and a heuristic method for GPU allocation. At each time step, the CNN model selects a job, and then a heuristic method selects available GPUs closest to each other from the cluster. Reinforcement learning (RL) is used to train the CNN model to select the job that maximizes throughput-based rewards. Extensive evaluation, conducted on datasets with real jobs, shows that significantly outperforms six baseline schedulers that use heuristics or other DRL models for job picking and resource allocation.
扫码关注我们
求助内容:
应助结果提醒方式:
