An Efficient 2D Method for Training Super-Large Deep Learning Models

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-05-01 DOI:10.1109/IPDPS54959.2023.00031

Qifan Xu, Shenggui Li, Chaoyu Gong, Yang You

{"title":"An Efficient 2D Method for Training Super-Large Deep Learning Models","authors":"Qifan Xu, Shenggui Li, Chaoyu Gong, Yang You","doi":"10.1109/IPDPS54959.2023.00031","DOIUrl":null,"url":null,"abstract":"Since the rise of Transformer [22] and BERT [6], large language models [7], [12] have been proposed and shown unprecedented performance in tasks like translation, classification, and text generation. However, due to the memory constraint, model parallelism must be used to split the model across multiple processors. Inter-layer partition, intra-layer partition, and sparse activation are the major approaches to achieve model parallelism. Among them, inter-layer partition [10], [11] often requires the model to be explicitly expressed as a stack of sub-modules, the number of which equals to the number of processors, and would introduce either gradient staleness or bubble overhead; while the sparse activation [12] is primarily designed for Google TPU cluster and hard to deploy on GPU servers, intra-layer partition [17], especially Megatron-LM [18], can be easily deployed on GPU servers and has been adopted in subsequent works like Turing-NLG and M6. Though as pioneers of intra-layer parallelism, they still show memory redundancy and sub-optimal communication efficiency, which reveals the space for further improvements. In this work, we leverage SUMMA [21] and propose Optimus, a highly efficient and scalable paradigm for training super-large language models. In Optimus, activations and gradients are partitioned and distributed along processors all the way through forward and backward propagations, with hardly any memory redundancy. The isoefficiency of communication in pure model parallelism improves from W ~ p3 for Megatron-LM, to $W\\sim {(\\sqrt p \\log p)^3}$ for our Optimus. This framework is implemented with open-source deep learning framework, PyTorch, and consolidates existing techniques such as mixed precision training [13], activation checkpointing [5], and data parallelism. In experiments on TACC Frontera supercomputers, Optimus shows 1.48× the speed for training, 1.78× speed for inference, and 8× the maximum batch size over Megatron-LM on 64 GPUs in pure model parallelism; and 1.73× speed for training, 2.32× speed for inference with data parallelism size equaling 2 on 128 GPUs. In pure model parallelism, Optimus surpasses Megatron-LM in weak scaling efficiency by a great margin, and shows an extraordinary increasing strong scaling efficiency. Optimus would facilitate the scaling of language models and serve as a strong thrust in the space exploration of artificial intelligence.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"196 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS54959.2023.00031","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

Abstract

Since the rise of Transformer [22] and BERT [6], large language models [7], [12] have been proposed and shown unprecedented performance in tasks like translation, classification, and text generation. However, due to the memory constraint, model parallelism must be used to split the model across multiple processors. Inter-layer partition, intra-layer partition, and sparse activation are the major approaches to achieve model parallelism. Among them, inter-layer partition [10], [11] often requires the model to be explicitly expressed as a stack of sub-modules, the number of which equals to the number of processors, and would introduce either gradient staleness or bubble overhead; while the sparse activation [12] is primarily designed for Google TPU cluster and hard to deploy on GPU servers, intra-layer partition [17], especially Megatron-LM [18], can be easily deployed on GPU servers and has been adopted in subsequent works like Turing-NLG and M6. Though as pioneers of intra-layer parallelism, they still show memory redundancy and sub-optimal communication efficiency, which reveals the space for further improvements. In this work, we leverage SUMMA [21] and propose Optimus, a highly efficient and scalable paradigm for training super-large language models. In Optimus, activations and gradients are partitioned and distributed along processors all the way through forward and backward propagations, with hardly any memory redundancy. The isoefficiency of communication in pure model parallelism improves from W ~ p3 for Megatron-LM, to $W\sim {(\sqrt p \log p)^3}$ for our Optimus. This framework is implemented with open-source deep learning framework, PyTorch, and consolidates existing techniques such as mixed precision training [13], activation checkpointing [5], and data parallelism. In experiments on TACC Frontera supercomputers, Optimus shows 1.48× the speed for training, 1.78× speed for inference, and 8× the maximum batch size over Megatron-LM on 64 GPUs in pure model parallelism; and 1.73× speed for training, 2.32× speed for inference with data parallelism size equaling 2 on 128 GPUs. In pure model parallelism, Optimus surpasses Megatron-LM in weak scaling efficiency by a great margin, and shows an extraordinary increasing strong scaling efficiency. Optimus would facilitate the scaling of language models and serve as a strong thrust in the space exploration of artificial intelligence.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种训练超大型深度学习模型的高效二维方法

自Transformer[22]和BERT[6]兴起以来，大型语言模型[7]、[12]被提出，并在翻译、分类和文本生成等任务中表现出前所未有的性能。然而，由于内存约束，必须使用模型并行性来跨多个处理器拆分模型。层间划分、层内划分和稀疏激活是实现模型并行化的主要方法。其中，层间划分[10]、[11]往往需要将模型显式地表示为一堆子模块，子模块的数量等于处理器的数量，并且会引入梯度过时或气泡开销;稀疏激活[12]主要是为Google TPU集群设计的，很难部署在GPU服务器上，而层内分区[17]，特别是Megatron-LM[18]，可以很容易地部署在GPU服务器上，并被后续的Turing-NLG和M6等作品所采用。虽然它们是层内并行的先驱，但它们仍然存在内存冗余和次优通信效率，这显示了进一步改进的空间。在这项工作中，我们利用SUMMA[21]并提出了Optimus，这是一种用于训练超大型语言模型的高效可扩展范例。在Optimus中，激活和梯度通过前向和后向传播沿着处理器进行分区和分布，几乎没有任何内存冗余。纯模型并行通信的等效率从Megatron-LM的wp3提高到Optimus的$W\sim {(\sqrt p \log p)^3}$。该框架使用开源深度学习框架PyTorch实现，并整合了混合精确训练[13]、激活检查点[5]和数据并行等现有技术。在TACC Frontera超级计算机上的实验中，在纯模型并行的64个gpu上，Optimus的训练速度是Megatron-LM的1.48倍，推理速度是1.78倍，最大批处理大小是8倍;训练速度为1.73倍，推理速度为2.32倍，在128 gpu上数据并行大小为2。在纯模型并行性方面，Optimus在弱缩放效率上大大超过Megatron-LM，并表现出非凡的递增的强缩放效率。Optimus将促进语言模型的扩展，并在人工智能的太空探索中发挥强大的推动作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量