AISAW: An adaptive interference-aware scheduling algorithm for acceleration of deep learning workloads training on distributed heterogeneous systems

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Future Generation Computer Systems-The International Journal of Escience Pub Date : 2024-12-06 DOI:10.1016/j.future.2024.107642

Yushen Bi , Yupeng Xi , Chao Jing

{"title":"AISAW: An adaptive interference-aware scheduling algorithm for acceleration of deep learning workloads training on distributed heterogeneous systems","authors":"Yushen Bi , Yupeng Xi , Chao Jing","doi":"10.1016/j.future.2024.107642","DOIUrl":null,"url":null,"abstract":"<div><div>Owing to the widespread application of artificial intelligence, deep learning (DL) has attracted considerable attention from both academia and industry. The DL workload-training process is a key step in determining the quality of DL-based applications. However, owing to the limited computational power of conventionally centralized clusters, it is more beneficial to accelerate workload training while placing them in distributed heterogeneous systems. Unfortunately, current scheduling algorithms do not account for the various capabilities of nodes and the limited network bandwidth, which leads to poor performance in distributed heterogeneous systems. To address this problem, we propose an adaptive interference-aware scheduling algorithm for accelerating DL workloads (called AISAW). By doing so, we initially established a predictive model consisting of a job performance model and an interference-aware model to reduce the impact of job co-location. Subsequently, to improve the system efficiency, we developed an adaptive priority-aware allocation scheme (APS) to find the optimal performance match in terms of adaptively allocating DL jobs to computing nodes. In addition, under the constraint of network bandwidth, we devised a deadline-aware overhead minimization dynamic migration scheme (DOMS) to avoid the high overhead caused by frequent job migration. Finally, we conducted experiments on real distributed heterogeneous systems deployed with several GPU-based servers. The results demonstrate that AISAW is capable of improving the system efficiency by decreasing the makespan and average JCT by at least 23.86% and 13.02%, respectively, compared to state-of-the-art algorithms such as Gandiva, Tiresias, and MLF-H.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"166 ","pages":"Article 107642"},"PeriodicalIF":6.2000,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X2400606X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Owing to the widespread application of artificial intelligence, deep learning (DL) has attracted considerable attention from both academia and industry. The DL workload-training process is a key step in determining the quality of DL-based applications. However, owing to the limited computational power of conventionally centralized clusters, it is more beneficial to accelerate workload training while placing them in distributed heterogeneous systems. Unfortunately, current scheduling algorithms do not account for the various capabilities of nodes and the limited network bandwidth, which leads to poor performance in distributed heterogeneous systems. To address this problem, we propose an adaptive interference-aware scheduling algorithm for accelerating DL workloads (called AISAW). By doing so, we initially established a predictive model consisting of a job performance model and an interference-aware model to reduce the impact of job co-location. Subsequently, to improve the system efficiency, we developed an adaptive priority-aware allocation scheme (APS) to find the optimal performance match in terms of adaptively allocating DL jobs to computing nodes. In addition, under the constraint of network bandwidth, we devised a deadline-aware overhead minimization dynamic migration scheme (DOMS) to avoid the high overhead caused by frequent job migration. Finally, we conducted experiments on real distributed heterogeneous systems deployed with several GPU-based servers. The results demonstrate that AISAW is capable of improving the system efficiency by decreasing the makespan and average JCT by at least 23.86% and 13.02%, respectively, compared to state-of-the-art algorithms such as Gandiva, Tiresias, and MLF-H.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

分布式异构系统上深度学习工作负载训练加速的自适应干扰感知调度算法

由于人工智能的广泛应用，深度学习（DL）引起了学术界和工业界的广泛关注。深度学习的工作量训练过程是决定基于深度学习的应用质量的关键步骤。然而，由于传统集中式集群的计算能力有限，将其置于分布式异构系统中更有利于加速工作负载训练。遗憾的是，目前的调度算法没有考虑到节点的各种能力和有限的网络带宽，导致分布式异构系统性能低下。为解决这一问题，我们提出了一种用于加速 DL 工作负载的自适应干扰感知调度算法（称为 AISAW）。为此，我们首先建立了一个由作业性能模型和干扰感知模型组成的预测模型，以减少作业共址的影响。随后，为了提高系统效率，我们开发了一种自适应优先级感知分配方案（APS），以便在将 DL 作业自适应分配到计算节点方面找到最佳性能匹配。此外，在网络带宽的限制下，我们还设计了一种具有截止日期意识的开销最小化动态迁移方案（DOMS），以避免频繁迁移作业造成的高开销。最后，我们在部署了多台基于 GPU 的服务器的真实分布式异构系统上进行了实验。实验结果表明，与 Gandiva、Tiresias 和 MLF-H 等最先进的算法相比，AISAW 能够提高系统效率，将作业时间和平均 JCT 分别减少至少 23.86% 和 13.02%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Future Generation Computer Systems-The International Journal of Escience 工程技术-计算机：理论方法

CiteScore

19.90

自引率

2.70%

发文量

376

审稿时长

10.6 months

期刊介绍： Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.