Thanh Hoang Le Hai, Luan Le Dinh, Dat Ngo Tien, Dat Bui Huu Tien, N. Thoai
{"title":"IRLS: An Improved Reinforcement Learning Scheduler for High Performance Computing Systems","authors":"Thanh Hoang Le Hai, Luan Le Dinh, Dat Ngo Tien, Dat Bui Huu Tien, N. Thoai","doi":"10.1109/ICSSE58758.2023.10227229","DOIUrl":null,"url":null,"abstract":"Exploiting current High Performance Computing (HPC) systems is a critical task for resolving urgent worldwide problems. However, existing scheduling heuristics such as First Come First Served (FCFS) have limitations in dealing with the increasing complexity of computing systems and the dynamic nature of application workloads. Reinforcement learning (RL) has emerged as a promising approach to designing HPC schedulers that can learn to adapt to dynamic system configurations and workload conditions. However, existing RL-based schedulers often lack the ability to incorporate important identity features of jobs and do not consider user behavior.To address these limitations, we propose an improvement to the latest Deep Reinforcement Learning Agent for Scheduling (DRAS) model, called Improved Reinforcement Learning Scheduler (IRLS). The IRLS model incorporates additional identity features in the state definition to recognize similarities between tasks from the same source and utilizes an empirical approach to perform job runtime prediction. Our experiments demonstrate that by using the IRLS model, we can significantly improve the performance of real-life HPC workloads, with improvements of up to 15.4% compared to the original DRAS model and 35.7% compared to FCFS.","PeriodicalId":280745,"journal":{"name":"2023 International Conference on System Science and Engineering (ICSSE)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on System Science and Engineering (ICSSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSSE58758.2023.10227229","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Exploiting current High Performance Computing (HPC) systems is a critical task for resolving urgent worldwide problems. However, existing scheduling heuristics such as First Come First Served (FCFS) have limitations in dealing with the increasing complexity of computing systems and the dynamic nature of application workloads. Reinforcement learning (RL) has emerged as a promising approach to designing HPC schedulers that can learn to adapt to dynamic system configurations and workload conditions. However, existing RL-based schedulers often lack the ability to incorporate important identity features of jobs and do not consider user behavior.To address these limitations, we propose an improvement to the latest Deep Reinforcement Learning Agent for Scheduling (DRAS) model, called Improved Reinforcement Learning Scheduler (IRLS). The IRLS model incorporates additional identity features in the state definition to recognize similarities between tasks from the same source and utilizes an empirical approach to perform job runtime prediction. Our experiments demonstrate that by using the IRLS model, we can significantly improve the performance of real-life HPC workloads, with improvements of up to 15.4% compared to the original DRAS model and 35.7% compared to FCFS.