PipeDream: generalized pipeline parallelism for DNN training

Proceedings of the 27th ACM Symposium on Operating Systems Principles Pub Date : 2019-10-27 DOI:10.1145/3341301.3359646

D. Narayanan, A. Harlap, Amar Phanishayee, V. Seshadri, Nikhil R. Devanur, G. Ganger, Phillip B. Gibbons, M. Zaharia

{"title":"PipeDream: generalized pipeline parallelism for DNN training","authors":"D. Narayanan, A. Harlap, Amar Phanishayee, V. Seshadri, Nikhil R. Devanur, G. Ganger, Phillip B. Gibbons, M. Zaharia","doi":"10.1145/3341301.3359646","DOIUrl":null,"url":null,"abstract":"DNN training is extremely time-consuming, necessitating efficient multi-accelerator parallelization. Current approaches to parallelizing training primarily use intra-batch parallelization, where a single iteration of training is split over the available workers, but suffer from diminishing returns at higher worker counts. We present PipeDream, a system that adds inter-batch pipelining to intra-batch parallelism to further improve parallel training throughput, helping to better overlap computation with communication and reduce the amount of communication when possible. Unlike traditional pipelining, DNN training is bi-directional, where a forward pass through the computation graph is followed by a backward pass that uses state and intermediate data computed during the forward pass. Naïve pipelining can thus result in mismatches in state versions used in the forward and backward passes, or excessive pipeline flushes and lower hardware efficiency. To address these challenges, PipeDream versions model parameters for numerically correct gradient computations, and schedules forward and backward passes of different minibatches concurrently on different workers with minimal pipeline stalls. PipeDream also automatically partitions DNN layers among workers to balance work and minimize communication. Extensive experimentation with a range of DNN tasks, models, and hardware configurations shows that PipeDream trains models to high accuracy up to 5.3X faster than commonly used intra-batch parallelism techniques.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"520","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3341301.3359646","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 520

Abstract

DNN training is extremely time-consuming, necessitating efficient multi-accelerator parallelization. Current approaches to parallelizing training primarily use intra-batch parallelization, where a single iteration of training is split over the available workers, but suffer from diminishing returns at higher worker counts. We present PipeDream, a system that adds inter-batch pipelining to intra-batch parallelism to further improve parallel training throughput, helping to better overlap computation with communication and reduce the amount of communication when possible. Unlike traditional pipelining, DNN training is bi-directional, where a forward pass through the computation graph is followed by a backward pass that uses state and intermediate data computed during the forward pass. Naïve pipelining can thus result in mismatches in state versions used in the forward and backward passes, or excessive pipeline flushes and lower hardware efficiency. To address these challenges, PipeDream versions model parameters for numerically correct gradient computations, and schedules forward and backward passes of different minibatches concurrently on different workers with minimal pipeline stalls. PipeDream also automatically partitions DNN layers among workers to balance work and minimize communication. Extensive experimentation with a range of DNN tasks, models, and hardware configurations shows that PipeDream trains models to high accuracy up to 5.3X faster than commonly used intra-batch parallelism techniques.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PipeDream: DNN训练的广义管道并行

深度神经网络训练非常耗时，需要高效的多加速器并行化。当前并行化训练的方法主要使用批内并行化，其中训练的单个迭代被分配到可用的工人上，但是在较高的工人数量上遭受收益递减的困扰。我们提出了PipeDream系统，该系统在批内并行的基础上增加了批间流水线，以进一步提高并行训练吞吐量，帮助更好地重叠计算和通信，并在可能的情况下减少通信量。与传统的流水线不同，DNN训练是双向的，其中向前传递计算图，然后是向后传递，使用在向前传递期间计算的状态和中间数据。因此，Naïve管道可能导致在向前和向后传递中使用的状态版本不匹配，或者过多的管道刷新和较低的硬件效率。为了应对这些挑战，PipeDream版本为数值正确的梯度计算建模参数，并以最小的管道延迟同时在不同的工人上安排不同小批量的向前和向后传递。PipeDream还自动在工作人员之间划分DNN层，以平衡工作并减少通信。对一系列DNN任务、模型和硬件配置的广泛实验表明，PipeDream训练模型的精度比常用的批内并行技术快5.3倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助