TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading

Kun Wu, Jeongmin Brian Park, Xiaofan Zhang, Mert Hidayetoğlu, Vikram Sharma Mailthody, Sitao Huang, Steven Sam Lumetta, Wen-mei Hwu
{"title":"TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading","authors":"Kun Wu, Jeongmin Brian Park, Xiaofan Zhang, Mert Hidayetoğlu, Vikram Sharma Mailthody, Sitao Huang, Steven Sam Lumetta, Wen-mei Hwu","doi":"arxiv-2408.10013","DOIUrl":null,"url":null,"abstract":"The growth rate of the GPU memory capacity has not been able to keep up with\nthat of the size of large language models (LLMs), hindering the model training\nprocess. In particular, activations -- the intermediate tensors produced during\nforward propagation and reused in backward propagation -- dominate the GPU\nmemory use. To address this challenge, we propose TBA to efficiently offload\nactivations to high-capacity NVMe SSDs. This approach reduces GPU memory usage\nwithout impacting performance by adaptively overlapping data transfers with\ncomputation. TBA is compatible with popular deep learning frameworks like\nPyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor\ndeduplication, forwarding, and adaptive offloading to further enhance\nefficiency. We conduct extensive experiments on GPT, BERT, and T5. Results\ndemonstrate that TBA effectively reduces 47% of the activation peak memory\nusage. At the same time, TBA perfectly overlaps the I/O with the computation\nand incurs negligible performance overhead. We introduce the\nrecompute-offload-keep (ROK) curve to compare the TBA offloading with other two\ntensor placement strategies, keeping activations in memory and layerwise full\nrecomputation. We find that TBA achieves better memory savings than layerwise\nfull recomputation while retaining the performance of keeping the activations\nin memory.","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"7 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Neural and Evolutionary Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.10013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The growth rate of the GPU memory capacity has not been able to keep up with that of the size of large language models (LLMs), hindering the model training process. In particular, activations -- the intermediate tensors produced during forward propagation and reused in backward propagation -- dominate the GPU memory use. To address this challenge, we propose TBA to efficiently offload activations to high-capacity NVMe SSDs. This approach reduces GPU memory usage without impacting performance by adaptively overlapping data transfers with computation. TBA is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor deduplication, forwarding, and adaptive offloading to further enhance efficiency. We conduct extensive experiments on GPT, BERT, and T5. Results demonstrate that TBA effectively reduces 47% of the activation peak memory usage. At the same time, TBA perfectly overlaps the I/O with the computation and incurs negligible performance overhead. We introduce the recompute-offload-keep (ROK) curve to compare the TBA offloading with other two tensor placement strategies, keeping activations in memory and layerwise full recomputation. We find that TBA achieves better memory savings than layerwise full recomputation while retaining the performance of keeping the activations in memory.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
TBA:使用基于固态盘的激活卸载加快大型语言模型训练
GPU 内存容量的增长速度一直跟不上大型语言模型(LLM)的大小,从而阻碍了模型的训练过程。特别是激活(activations)--在前向传播过程中产生并在后向传播中重复使用的中间张量--在GPU内存的使用中占主导地位。为了应对这一挑战,我们提出了 TBA 方法,将激活有效地卸载到大容量 NVMe SSD 上。这种方法通过自适应地将数据传输与计算重叠,在不影响性能的情况下减少了 GPU 内存的使用。TBA兼容PyTorch、Megatron和DeepSpeed等流行的深度学习框架,并采用了重复数据传输、转发和自适应卸载等技术来进一步提高效率。我们在 GPT、BERT 和 T5 上进行了大量实验。结果表明,TBA 有效降低了 47% 的激活峰值内存用量。同时,TBA 将 I/O 与计算完美地重叠在一起,产生的性能开销可以忽略不计。我们引入了计算-卸载-保持(ROK)曲线,将 TBA 卸载与其他双传感器放置策略(将激活保持在内存中和分层全计算)进行比较。我们发现,与分层全重新计算相比,TBA 能更好地节省内存,同时保留内存中激活的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Hardware-Friendly Implementation of Physical Reservoir Computing with CMOS-based Time-domain Analog Spiking Neurons Self-Contrastive Forward-Forward Algorithm Bio-Inspired Mamba: Temporal Locality and Bioplausible Learning in Selective State Space Models PReLU: Yet Another Single-Layer Solution to the XOR Problem Inferno: An Extensible Framework for Spiking Neural Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1