AISAW: An adaptive interference-aware scheduling algorithm for acceleration of deep learning workloads training on distributed heterogeneous systems

IF 6.2 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Future Generation Computer Systems-The International Journal of Escience Pub Date : 2024-12-06 DOI:10.1016/j.future.2024.107642
Yushen Bi , Yupeng Xi , Chao Jing
{"title":"AISAW: An adaptive interference-aware scheduling algorithm for acceleration of deep learning workloads training on distributed heterogeneous systems","authors":"Yushen Bi ,&nbsp;Yupeng Xi ,&nbsp;Chao Jing","doi":"10.1016/j.future.2024.107642","DOIUrl":null,"url":null,"abstract":"<div><div>Owing to the widespread application of artificial intelligence, deep learning (DL) has attracted considerable attention from both academia and industry. The DL workload-training process is a key step in determining the quality of DL-based applications. However, owing to the limited computational power of conventionally centralized clusters, it is more beneficial to accelerate workload training while placing them in distributed heterogeneous systems. Unfortunately, current scheduling algorithms do not account for the various capabilities of nodes and the limited network bandwidth, which leads to poor performance in distributed heterogeneous systems. To address this problem, we propose an adaptive interference-aware scheduling algorithm for accelerating DL workloads (called AISAW). By doing so, we initially established a predictive model consisting of a job performance model and an interference-aware model to reduce the impact of job co-location. Subsequently, to improve the system efficiency, we developed an adaptive priority-aware allocation scheme (APS) to find the optimal performance match in terms of adaptively allocating DL jobs to computing nodes. In addition, under the constraint of network bandwidth, we devised a deadline-aware overhead minimization dynamic migration scheme (DOMS) to avoid the high overhead caused by frequent job migration. Finally, we conducted experiments on real distributed heterogeneous systems deployed with several GPU-based servers. The results demonstrate that AISAW is capable of improving the system efficiency by decreasing the makespan and average JCT by at least 23.86% and 13.02%, respectively, compared to state-of-the-art algorithms such as Gandiva, Tiresias, and MLF-H.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"166 ","pages":"Article 107642"},"PeriodicalIF":6.2000,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X2400606X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Owing to the widespread application of artificial intelligence, deep learning (DL) has attracted considerable attention from both academia and industry. The DL workload-training process is a key step in determining the quality of DL-based applications. However, owing to the limited computational power of conventionally centralized clusters, it is more beneficial to accelerate workload training while placing them in distributed heterogeneous systems. Unfortunately, current scheduling algorithms do not account for the various capabilities of nodes and the limited network bandwidth, which leads to poor performance in distributed heterogeneous systems. To address this problem, we propose an adaptive interference-aware scheduling algorithm for accelerating DL workloads (called AISAW). By doing so, we initially established a predictive model consisting of a job performance model and an interference-aware model to reduce the impact of job co-location. Subsequently, to improve the system efficiency, we developed an adaptive priority-aware allocation scheme (APS) to find the optimal performance match in terms of adaptively allocating DL jobs to computing nodes. In addition, under the constraint of network bandwidth, we devised a deadline-aware overhead minimization dynamic migration scheme (DOMS) to avoid the high overhead caused by frequent job migration. Finally, we conducted experiments on real distributed heterogeneous systems deployed with several GPU-based servers. The results demonstrate that AISAW is capable of improving the system efficiency by decreasing the makespan and average JCT by at least 23.86% and 13.02%, respectively, compared to state-of-the-art algorithms such as Gandiva, Tiresias, and MLF-H.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
分布式异构系统上深度学习工作负载训练加速的自适应干扰感知调度算法
由于人工智能的广泛应用,深度学习(DL)引起了学术界和工业界的广泛关注。深度学习的工作量训练过程是决定基于深度学习的应用质量的关键步骤。然而,由于传统集中式集群的计算能力有限,将其置于分布式异构系统中更有利于加速工作负载训练。遗憾的是,目前的调度算法没有考虑到节点的各种能力和有限的网络带宽,导致分布式异构系统性能低下。为解决这一问题,我们提出了一种用于加速 DL 工作负载的自适应干扰感知调度算法(称为 AISAW)。为此,我们首先建立了一个由作业性能模型和干扰感知模型组成的预测模型,以减少作业共址的影响。随后,为了提高系统效率,我们开发了一种自适应优先级感知分配方案(APS),以便在将 DL 作业自适应分配到计算节点方面找到最佳性能匹配。此外,在网络带宽的限制下,我们还设计了一种具有截止日期意识的开销最小化动态迁移方案(DOMS),以避免频繁迁移作业造成的高开销。最后,我们在部署了多台基于 GPU 的服务器的真实分布式异构系统上进行了实验。实验结果表明,与 Gandiva、Tiresias 和 MLF-H 等最先进的算法相比,AISAW 能够提高系统效率,将作业时间和平均 JCT 分别减少至少 23.86% 和 13.02%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
19.90
自引率
2.70%
发文量
376
审稿时长
10.6 months
期刊介绍: Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.
期刊最新文献
Self-sovereign identity framework with user-friendly private key generation and rule table Accelerating complex graph queries by summary-based hybrid partitioning for discovering vulnerabilities of distribution equipment DNA: Dual-radio Dual-constraint Node Activation scheduling for energy-efficient data dissemination in IoT Blending lossy and lossless data compression methods to support health data streaming in smart cities Energy–time modelling of distributed multi-population genetic algorithms with dynamic workload in HPC clusters
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1