Doing more by doing less: how structured partial backpropagation improves deep learning clusters

Proceedings of the 2nd ACM International Workshop on Distributed Machine Learning Pub Date : 2021-11-20 DOI:10.1145/3488659.3493778

Adarsh Kumar, Kausik Subramanian, S. Venkataraman, Aditya Akella

{"title":"Doing more by doing less: how structured partial backpropagation improves deep learning clusters","authors":"Adarsh Kumar, Kausik Subramanian, S. Venkataraman, Aditya Akella","doi":"10.1145/3488659.3493778","DOIUrl":null,"url":null,"abstract":"Many organizations employ compute clusters equipped with accelerators such as GPUs and TPUs for training deep learning models in a distributed fashion. Training is resource-intensive, consuming significant compute, memory, and network resources. Many prior works explore how to reduce training resource footprint without impacting quality, but their focus on a subset of the bottlenecks (typically only the network) limits their ability to improve overall cluster utilization. In this work, we exploit the unique characteristics of deep learning workloads to propose Structured Partial Backpropagation(SPB), a technique that systematically controls the amount of backpropagation at individual workers in distributed training. This simultaneously reduces network bandwidth, compute utilization, and memory footprint while preserving model quality. To efficiently leverage the benefits of SPB at cluster level, we introduce Jigsaw, a SPB aware scheduler, which does scheduling at the iteration level for Deep Learning Training(DLT) jobs. We find that Jigsaw can improve large scale cluster efficiency by as high as 28%.","PeriodicalId":343000,"journal":{"name":"Proceedings of the 2nd ACM International Workshop on Distributed Machine Learning","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd ACM International Workshop on Distributed Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3488659.3493778","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Many organizations employ compute clusters equipped with accelerators such as GPUs and TPUs for training deep learning models in a distributed fashion. Training is resource-intensive, consuming significant compute, memory, and network resources. Many prior works explore how to reduce training resource footprint without impacting quality, but their focus on a subset of the bottlenecks (typically only the network) limits their ability to improve overall cluster utilization. In this work, we exploit the unique characteristics of deep learning workloads to propose Structured Partial Backpropagation(SPB), a technique that systematically controls the amount of backpropagation at individual workers in distributed training. This simultaneously reduces network bandwidth, compute utilization, and memory footprint while preserving model quality. To efficiently leverage the benefits of SPB at cluster level, we introduce Jigsaw, a SPB aware scheduler, which does scheduling at the iteration level for Deep Learning Training(DLT) jobs. We find that Jigsaw can improve large scale cluster efficiency by as high as 28%.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

事半功倍:结构化部分反向传播如何改进深度学习集群

许多组织使用配备gpu和tpu等加速器的计算集群，以分布式方式训练深度学习模型。培训是资源密集型的，需要消耗大量的计算、内存和网络资源。许多先前的工作探索了如何在不影响质量的情况下减少训练资源占用，但是他们对瓶颈子集(通常只有网络)的关注限制了他们提高整体集群利用率的能力。在这项工作中，我们利用深度学习工作负载的独特特征提出了结构化部分反向传播(SPB)，这是一种系统地控制分布式训练中单个工作人员的反向传播量的技术。这同时减少了网络带宽、计算利用率和内存占用，同时保持了模型质量。为了有效地利用SPB在集群级别的优势，我们引入了Jigsaw，一个SPB感知调度器，它在迭代级别为深度学习训练(DLT)作业进行调度。我们发现Jigsaw可以将大规模集群的效率提高高达28%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2nd ACM International Workshop on Distributed Machine Learning

自引率

0.00%

发文量

期刊最新文献

Image reconstruction attacks on distributed machine learning models Secure aggregation for federated learning in flower FL_PyTorch: optimization research simulator for federated learning Doing more by doing less: how structured partial backpropagation improves deep learning clusters Rapid IoT device identification at the edge