ZeRO:面向训练万亿参数模型的内存优化

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2019-10-04 DOI:10.1109/SC41405.2020.00024

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He

{"title":"ZeRO:面向训练万亿参数模型的内存优化","authors":"Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He","doi":"10.1109/SC41405.2020.00024","DOIUrl":null,"url":null,"abstract":"Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today’s hardware. We implement and evaluate ZeRO: it trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops. This represents an 8x increase in model size and 10x increase in achievable performance over state-of-the-art. In terms of usability, ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8. 3B and T5 11B) without requiring model parallelism which is harder for scientists to apply. Last but not the least, researchers have used the system breakthroughs of ZeRO to create Turing-NLG, the world’s largest language model at the time (17B parameters) with record breaking accuracy.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"125 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"563","resultStr":"{\"title\":\"ZeRO: Memory optimizations Toward Training Trillion Parameter Models\",\"authors\":\"Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He\",\"doi\":\"10.1109/SC41405.2020.00024\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today’s hardware. We implement and evaluate ZeRO: it trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops. This represents an 8x increase in model size and 10x increase in achievable performance over state-of-the-art. In terms of usability, ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8. 3B and T5 11B) without requiring model parallelism which is harder for scientists to apply. Last but not the least, researchers have used the system breakthroughs of ZeRO to create Turing-NLG, the world’s largest language model at the time (17B parameters) with record breaking accuracy.\",\"PeriodicalId\":424429,\"journal\":{\"name\":\"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis\",\"volume\":\"125 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"563\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SC41405.2020.00024\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC41405.2020.00024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 563

摘要

大型深度学习模型提供了显著的准确性提高，但训练数十亿到数万亿的参数是具有挑战性的。现有的解决方案，如数据和模型并行，在获得计算、通信和开发效率的同时，在将这些模型适应有限的设备内存方面表现出基本的局限性。我们开发了一种新颖的解决方案，零冗余优化器(Zero)，以优化内存，大大提高了训练速度，同时增加了可以有效训练的模型大小。ZeRO消除了数据和模型并行训练中的内存冗余，同时保持了低通信量和高计算粒度，使我们能够以持续的高效率按比例缩放模型大小。我们对内存需求和通信量的分析表明:使用当今的硬件，ZeRO有可能扩展到超过1万亿个参数。我们实现和评估ZeRO:它在400个gpu上以超线性加速训练超过100B参数的大型模型，实现15 Petaflops的吞吐量。这意味着模型尺寸增加了8倍，可实现性能增加了10倍。在可用性方面，ZeRO可以训练多达13B个参数的大型模型(例如，比威震天GPT 8更大)。3B和T5 (11B)，而不需要模型并行，这对科学家来说更难应用。最后但并非最不重要的是，研究人员利用ZeRO的系统突破创建了图灵- nlg，这是当时世界上最大的语言模型(17B个参数)，其准确性打破了纪录。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ZeRO: Memory optimizations Toward Training Trillion Parameter Models

Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today’s hardware. We implement and evaluate ZeRO: it trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops. This represents an 8x increase in model size and 10x increase in achievable performance over state-of-the-art. In terms of usability, ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8. 3B and T5 11B) without requiring model parallelism which is harder for scientists to apply. Last but not the least, researchers have used the system breakthroughs of ZeRO to create Turing-NLG, the world’s largest language model at the time (17B parameters) with record breaking accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

自引率

0.00%

发文量

期刊最新文献

CAB-MPI: Exploring Interprocess Work-Stealing towards Balanced MPI Communication Toward Realization of Numerical Towing-Tank Tests by Wall-Resolved Large Eddy Simulation based on 32 Billion Grid Finite-Element Computation Scalable yet Rigorous Floating-Point Error Analysis Scalable Knowledge Graph Analytics at 136 Petaflop/s BORA: A Bag Optimizer for Robotic Analysis