Locality-Aware and Fault-Tolerant Batching for Machine Learning on Distributed Datasets

IF 5.3 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Transactions on Cloud Computing Pub Date : 2024-01-09 DOI:10.1109/TCC.2024.3351716
Liu Liu;Zhijun Ding;Dazhao Cheng;Xiaobo Zhou
{"title":"Locality-Aware and Fault-Tolerant Batching for Machine Learning on Distributed Datasets","authors":"Liu Liu;Zhijun Ding;Dazhao Cheng;Xiaobo Zhou","doi":"10.1109/TCC.2024.3351716","DOIUrl":null,"url":null,"abstract":"The performance of distributed ML training is largely determined by workers that generate gradients in the slowest pace, i.e., stragglers. The state-of-the-art load balancing approaches consider that each worker stores a complete dataset locally and the data fetching time can be ignored. They only consider the computation capacity of workers in equalizing the gradient computation time. However, we find that in scenarios of ML on distributed datasets, whether in edge computing or distributed data cache systems, the data fetching time is non-negligible and often becomes the primary cause of stragglers. In this paper, we present LOFT, an adaptive load balancing approach for ML upon distributed datasets at the edge. It aims to balance the time to generate gradients at each worker while ensuring the model accuracy. Specifically, LOFT features a locality-aware batching. It builds performance and optimization models upon data fetching and gradient computation time. Leveraging the models, it develops an adaptive scheme based on grid search. Furthermore, it offers Byzantine gradient aggregation upon Ring All-Reduce, which makes itself fault-tolerant under Byzantine gradients brought by a small batch size. Experiments with twelve public DNN models and four open datasets show that LOFT reduces the training time by up to 46%, while reducing the training loss by up to 67% compared to LB-BSP.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":null,"pages":null},"PeriodicalIF":5.3000,"publicationDate":"2024-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cloud Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10384826/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

The performance of distributed ML training is largely determined by workers that generate gradients in the slowest pace, i.e., stragglers. The state-of-the-art load balancing approaches consider that each worker stores a complete dataset locally and the data fetching time can be ignored. They only consider the computation capacity of workers in equalizing the gradient computation time. However, we find that in scenarios of ML on distributed datasets, whether in edge computing or distributed data cache systems, the data fetching time is non-negligible and often becomes the primary cause of stragglers. In this paper, we present LOFT, an adaptive load balancing approach for ML upon distributed datasets at the edge. It aims to balance the time to generate gradients at each worker while ensuring the model accuracy. Specifically, LOFT features a locality-aware batching. It builds performance and optimization models upon data fetching and gradient computation time. Leveraging the models, it develops an adaptive scheme based on grid search. Furthermore, it offers Byzantine gradient aggregation upon Ring All-Reduce, which makes itself fault-tolerant under Byzantine gradients brought by a small batch size. Experiments with twelve public DNN models and four open datasets show that LOFT reduces the training time by up to 46%, while reducing the training loss by up to 67% compared to LB-BSP.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
分布式数据集上机器学习的本地感知和容错批处理
分布式人工智能训练的性能在很大程度上取决于以最慢的速度生成梯度的工作者,即落伍者。最先进的负载均衡方法认为,每个工作者都在本地存储一个完整的数据集,数据获取时间可以忽略不计。在均衡梯度计算时间时,它们只考虑工作者的计算能力。然而,我们发现,在分布式数据集上的 ML 场景中,无论是在边缘计算还是分布式数据缓存系统中,数据获取时间都是不可忽略的,往往成为造成滞后的主要原因。在本文中,我们介绍了 LOFT,一种用于边缘分布式数据集上的 ML 的自适应负载平衡方法。它旨在平衡每个工作者生成梯度的时间,同时确保模型的准确性。具体来说,LOFT 具有局部感知批处理功能。它根据数据获取和梯度计算时间建立性能和优化模型。利用这些模型,它开发了一种基于网格搜索的自适应方案。此外,它还在环形全还原(Ring All-Reduce)基础上提供拜占庭梯度聚合,从而在批量较小带来拜占庭梯度的情况下实现容错。用 12 个公共 DNN 模型和 4 个开放数据集进行的实验表明,与 LB-BSP 相比,LOFT 最多缩短了 46% 的训练时间,同时最多减少了 67% 的训练损失。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Transactions on Cloud Computing
IEEE Transactions on Cloud Computing Computer Science-Software
CiteScore
9.40
自引率
6.20%
发文量
167
期刊介绍: The IEEE Transactions on Cloud Computing (TCC) is dedicated to the multidisciplinary field of cloud computing. It is committed to the publication of articles that present innovative research ideas, application results, and case studies in cloud computing, focusing on key technical issues related to theory, algorithms, systems, applications, and performance.
期刊最新文献
WorkloadDiff: Conditional Denoising Diffusion Probabilistic Models for Cloud Workload Prediction A Lightweight Privacy-Preserving Ciphertext Retrieval Scheme Based on Edge Computing Generative Adversarial Privacy for Multimedia Analytics Across the IoT-Edge Continuum Corrections to “DNN Surgery: Accelerating DNN Inference on the Edge through Layer Partitioning” FedPAW: Federated Learning With Personalized Aggregation Weights for Urban Vehicle Speed Prediction
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1