PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Proc. VLDB Endow. Pub Date : 2023-04-21 DOI:10.48550/arXiv.2304.11277

Yanli Zhao, A. Gu, R. Varma, Liangchen Luo, Chien-chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Bernard Nguyen, Geeta Chauhan, Y. Hao, Shen Li

{"title":"PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel","authors":"Yanli Zhao, A. Gu, R. Varma, Liangchen Luo, Chien-chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Bernard Nguyen, Geeta Chauhan, Y. Hao, Shen Li","doi":"10.48550/arXiv.2304.11277","DOIUrl":null,"url":null,"abstract":"It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"40 1","pages":"3848-3860"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"33","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. VLDB Endow.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2304.11277","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 33

Abstract

It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PyTorch FSDP:扩展完全分片数据并行的经验

人们普遍认为，大型模型有潜力在广泛的领域提供卓越的性能。尽管在机器学习系统研究领域取得了显著进展，这使得大型模型的开发和探索成为可能，但这种能力仍然局限于一小部分高级用户和行业领导者，导致更广泛的社区访问和利用这些技术存在隐性的技术障碍。在本文中，我们介绍了PyTorch完全分片数据并行(FSDP)作为大型模型训练的工业级解决方案。FSDP与几个关键PyTorch核心组件紧密合作设计，包括张量实现，调度系统和CUDA内存缓存分配器，以提供非侵入式用户体验和高培训效率。此外，FSDP集成了一系列技术和设置，可以优化各种硬件配置的资源利用率。实验结果表明，FSDP能够实现与分布式数据并行相当的性能，同时在TFLOPS方面为具有近线性可扩展性的更大模型提供支持。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proc. VLDB Endow.

自引率

0.00%

发文量

期刊最新文献

Cryptographically Secure Private Record Linkage Using Locality-Sensitive Hashing Utility-aware Payment Channel Network Rebalance Relational Query Synthesis ⋈ Decision Tree Learning Billion-Scale Bipartite Graph Embedding: A Global-Local Induced Approach Query Refinement for Diversity Constraint Satisfaction