Angel-PTM:一种可扩展且经济的腾讯大规模预训练系统

Proc. VLDB Endow. Pub Date : 2023-03-06 DOI:10.48550/arXiv.2303.02868

Xiaonan Nie, Yi Liu, Fangcheng Fu, J. Xue, Dian Jiao, Xupeng Miao, Yangyu Tao, Bin Cui

{"title":"Angel-PTM:一种可扩展且经济的腾讯大规模预训练系统","authors":"Xiaonan Nie, Yi Liu, Fangcheng Fu, J. Xue, Dian Jiao, Xupeng Miao, Yangyu Tao, Bin Cui","doi":"10.48550/arXiv.2303.02868","DOIUrl":null,"url":null,"abstract":"\n Recent years have witnessed the unprecedented achievements of large-scale pre-trained models, especially Transformer models. Many products and services in Tencent Inc., such as WeChat, QQ, and Tencent Advertisement, have been opted in to gain the power of pre-trained models. In this work, we present Angel-PTM, a productive deep learning system designed for pre-training and fine-tuning Transformer models. Angel-PTM can train extremely large-scale models with hierarchical memory efficiently. The key designs of Angel-PTM are a fine-grained memory management via the\n Page\n abstraction and a unified scheduling method that coordinates computations, data movements, and communications. Furthermore, Angel-PTM supports extreme model scaling with SSD storage and implements a lock-free updating mechanism to address the SSD I/O bottlenecks. Experimental results demonstrate that Angel-PTM outperforms existing systems by up to 114.8% in terms of maximum model scale as well as up to 88.9% in terms of training throughput. Additionally, experiments on GPT3-175B and T5-MoE-1.2T models utilizing hundreds of GPUs verify our strong scalability.\n","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"57 1","pages":"3781-3794"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent\",\"authors\":\"Xiaonan Nie, Yi Liu, Fangcheng Fu, J. Xue, Dian Jiao, Xupeng Miao, Yangyu Tao, Bin Cui\",\"doi\":\"10.48550/arXiv.2303.02868\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n Recent years have witnessed the unprecedented achievements of large-scale pre-trained models, especially Transformer models. Many products and services in Tencent Inc., such as WeChat, QQ, and Tencent Advertisement, have been opted in to gain the power of pre-trained models. In this work, we present Angel-PTM, a productive deep learning system designed for pre-training and fine-tuning Transformer models. Angel-PTM can train extremely large-scale models with hierarchical memory efficiently. The key designs of Angel-PTM are a fine-grained memory management via the\\n Page\\n abstraction and a unified scheduling method that coordinates computations, data movements, and communications. Furthermore, Angel-PTM supports extreme model scaling with SSD storage and implements a lock-free updating mechanism to address the SSD I/O bottlenecks. Experimental results demonstrate that Angel-PTM outperforms existing systems by up to 114.8% in terms of maximum model scale as well as up to 88.9% in terms of training throughput. Additionally, experiments on GPT3-175B and T5-MoE-1.2T models utilizing hundreds of GPUs verify our strong scalability.\\n\",\"PeriodicalId\":20467,\"journal\":{\"name\":\"Proc. VLDB Endow.\",\"volume\":\"57 1\",\"pages\":\"3781-3794\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-03-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proc. VLDB Endow.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2303.02868\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. VLDB Endow.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2303.02868","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

近年来，大规模预训练模型取得了前所未有的成就，尤其是Transformer模型。腾讯公司的许多产品和服务，如微信、QQ和腾讯广告，都已被选中，以获得预训练模型的能力。在这项工作中，我们提出了Angel-PTM，这是一个高效的深度学习系统，专为预训练和微调Transformer模型而设计。Angel-PTM可以有效地训练具有分层记忆的超大规模模型。Angel-PTM的关键设计是通过页面抽象实现的细粒度内存管理和协调计算、数据移动和通信的统一调度方法。此外，Angel-PTM支持SSD存储的极端模型扩展，并实现无锁更新机制，以解决SSD I/O瓶颈。实验结果表明，Angel-PTM在最大模型规模方面优于现有系统114.8%，在训练吞吐量方面优于现有系统88.9%。此外，在使用数百个gpu的GPT3-175B和T5-MoE-1.2T模型上的实验验证了我们强大的可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent

Recent years have witnessed the unprecedented achievements of large-scale pre-trained models, especially Transformer models. Many products and services in Tencent Inc., such as WeChat, QQ, and Tencent Advertisement, have been opted in to gain the power of pre-trained models. In this work, we present Angel-PTM, a productive deep learning system designed for pre-training and fine-tuning Transformer models. Angel-PTM can train extremely large-scale models with hierarchical memory efficiently. The key designs of Angel-PTM are a fine-grained memory management via the Page abstraction and a unified scheduling method that coordinates computations, data movements, and communications. Furthermore, Angel-PTM supports extreme model scaling with SSD storage and implements a lock-free updating mechanism to address the SSD I/O bottlenecks. Experimental results demonstrate that Angel-PTM outperforms existing systems by up to 114.8% in terms of maximum model scale as well as up to 88.9% in terms of training throughput. Additionally, experiments on GPT3-175B and T5-MoE-1.2T models utilizing hundreds of GPUs verify our strong scalability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proc. VLDB Endow.

自引率

0.00%

发文量

期刊最新文献

Cryptographically Secure Private Record Linkage Using Locality-Sensitive Hashing Utility-aware Payment Channel Network Rebalance Relational Query Synthesis ⋈ Decision Tree Learning Billion-Scale Bipartite Graph Embedding: A Global-Local Induced Approach Query Refinement for Diversity Constraint Satisfaction