Doing more by doing less: how structured partial backpropagation improves deep learning clusters

Adarsh Kumar, Kausik Subramanian, S. Venkataraman, Aditya Akella
{"title":"Doing more by doing less: how structured partial backpropagation improves deep learning clusters","authors":"Adarsh Kumar, Kausik Subramanian, S. Venkataraman, Aditya Akella","doi":"10.1145/3488659.3493778","DOIUrl":null,"url":null,"abstract":"Many organizations employ compute clusters equipped with accelerators such as GPUs and TPUs for training deep learning models in a distributed fashion. Training is resource-intensive, consuming significant compute, memory, and network resources. Many prior works explore how to reduce training resource footprint without impacting quality, but their focus on a subset of the bottlenecks (typically only the network) limits their ability to improve overall cluster utilization. In this work, we exploit the unique characteristics of deep learning workloads to propose Structured Partial Backpropagation(SPB), a technique that systematically controls the amount of backpropagation at individual workers in distributed training. This simultaneously reduces network bandwidth, compute utilization, and memory footprint while preserving model quality. To efficiently leverage the benefits of SPB at cluster level, we introduce Jigsaw, a SPB aware scheduler, which does scheduling at the iteration level for Deep Learning Training(DLT) jobs. We find that Jigsaw can improve large scale cluster efficiency by as high as 28%.","PeriodicalId":343000,"journal":{"name":"Proceedings of the 2nd ACM International Workshop on Distributed Machine Learning","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd ACM International Workshop on Distributed Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3488659.3493778","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Many organizations employ compute clusters equipped with accelerators such as GPUs and TPUs for training deep learning models in a distributed fashion. Training is resource-intensive, consuming significant compute, memory, and network resources. Many prior works explore how to reduce training resource footprint without impacting quality, but their focus on a subset of the bottlenecks (typically only the network) limits their ability to improve overall cluster utilization. In this work, we exploit the unique characteristics of deep learning workloads to propose Structured Partial Backpropagation(SPB), a technique that systematically controls the amount of backpropagation at individual workers in distributed training. This simultaneously reduces network bandwidth, compute utilization, and memory footprint while preserving model quality. To efficiently leverage the benefits of SPB at cluster level, we introduce Jigsaw, a SPB aware scheduler, which does scheduling at the iteration level for Deep Learning Training(DLT) jobs. We find that Jigsaw can improve large scale cluster efficiency by as high as 28%.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
事半功倍:结构化部分反向传播如何改进深度学习集群
许多组织使用配备gpu和tpu等加速器的计算集群,以分布式方式训练深度学习模型。培训是资源密集型的,需要消耗大量的计算、内存和网络资源。许多先前的工作探索了如何在不影响质量的情况下减少训练资源占用,但是他们对瓶颈子集(通常只有网络)的关注限制了他们提高整体集群利用率的能力。在这项工作中,我们利用深度学习工作负载的独特特征提出了结构化部分反向传播(SPB),这是一种系统地控制分布式训练中单个工作人员的反向传播量的技术。这同时减少了网络带宽、计算利用率和内存占用,同时保持了模型质量。为了有效地利用SPB在集群级别的优势,我们引入了Jigsaw,一个SPB感知调度器,它在迭代级别为深度学习训练(DLT)作业进行调度。我们发现Jigsaw可以将大规模集群的效率提高高达28%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Image reconstruction attacks on distributed machine learning models Secure aggregation for federated learning in flower FL_PyTorch: optimization research simulator for federated learning Doing more by doing less: how structured partial backpropagation improves deep learning clusters Rapid IoT device identification at the edge
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1