{"title":"AISAW: An adaptive interference-aware scheduling algorithm for acceleration of deep learning workloads training on distributed heterogeneous systems","authors":"Yushen Bi , Yupeng Xi , Chao Jing","doi":"10.1016/j.future.2024.107642","DOIUrl":null,"url":null,"abstract":"<div><div>Owing to the widespread application of artificial intelligence, deep learning (DL) has attracted considerable attention from both academia and industry. The DL workload-training process is a key step in determining the quality of DL-based applications. However, owing to the limited computational power of conventionally centralized clusters, it is more beneficial to accelerate workload training while placing them in distributed heterogeneous systems. Unfortunately, current scheduling algorithms do not account for the various capabilities of nodes and the limited network bandwidth, which leads to poor performance in distributed heterogeneous systems. To address this problem, we propose an adaptive interference-aware scheduling algorithm for accelerating DL workloads (called AISAW). By doing so, we initially established a predictive model consisting of a job performance model and an interference-aware model to reduce the impact of job co-location. Subsequently, to improve the system efficiency, we developed an adaptive priority-aware allocation scheme (APS) to find the optimal performance match in terms of adaptively allocating DL jobs to computing nodes. In addition, under the constraint of network bandwidth, we devised a deadline-aware overhead minimization dynamic migration scheme (DOMS) to avoid the high overhead caused by frequent job migration. Finally, we conducted experiments on real distributed heterogeneous systems deployed with several GPU-based servers. The results demonstrate that AISAW is capable of improving the system efficiency by decreasing the makespan and average JCT by at least 23.86% and 13.02%, respectively, compared to state-of-the-art algorithms such as Gandiva, Tiresias, and MLF-H.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"166 ","pages":"Article 107642"},"PeriodicalIF":6.2000,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X2400606X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Owing to the widespread application of artificial intelligence, deep learning (DL) has attracted considerable attention from both academia and industry. The DL workload-training process is a key step in determining the quality of DL-based applications. However, owing to the limited computational power of conventionally centralized clusters, it is more beneficial to accelerate workload training while placing them in distributed heterogeneous systems. Unfortunately, current scheduling algorithms do not account for the various capabilities of nodes and the limited network bandwidth, which leads to poor performance in distributed heterogeneous systems. To address this problem, we propose an adaptive interference-aware scheduling algorithm for accelerating DL workloads (called AISAW). By doing so, we initially established a predictive model consisting of a job performance model and an interference-aware model to reduce the impact of job co-location. Subsequently, to improve the system efficiency, we developed an adaptive priority-aware allocation scheme (APS) to find the optimal performance match in terms of adaptively allocating DL jobs to computing nodes. In addition, under the constraint of network bandwidth, we devised a deadline-aware overhead minimization dynamic migration scheme (DOMS) to avoid the high overhead caused by frequent job migration. Finally, we conducted experiments on real distributed heterogeneous systems deployed with several GPU-based servers. The results demonstrate that AISAW is capable of improving the system efficiency by decreasing the makespan and average JCT by at least 23.86% and 13.02%, respectively, compared to state-of-the-art algorithms such as Gandiva, Tiresias, and MLF-H.
期刊介绍:
Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications.
Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration.
Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.