Kun Wu, Jeongmin Brian Park, Xiaofan Zhang, Mert Hidayetoğlu, Vikram Sharma Mailthody, Sitao Huang, Steven Sam Lumetta, Wen-mei Hwu
{"title":"TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading","authors":"Kun Wu, Jeongmin Brian Park, Xiaofan Zhang, Mert Hidayetoğlu, Vikram Sharma Mailthody, Sitao Huang, Steven Sam Lumetta, Wen-mei Hwu","doi":"arxiv-2408.10013","DOIUrl":null,"url":null,"abstract":"The growth rate of the GPU memory capacity has not been able to keep up with\nthat of the size of large language models (LLMs), hindering the model training\nprocess. In particular, activations -- the intermediate tensors produced during\nforward propagation and reused in backward propagation -- dominate the GPU\nmemory use. To address this challenge, we propose TBA to efficiently offload\nactivations to high-capacity NVMe SSDs. This approach reduces GPU memory usage\nwithout impacting performance by adaptively overlapping data transfers with\ncomputation. TBA is compatible with popular deep learning frameworks like\nPyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor\ndeduplication, forwarding, and adaptive offloading to further enhance\nefficiency. We conduct extensive experiments on GPT, BERT, and T5. Results\ndemonstrate that TBA effectively reduces 47% of the activation peak memory\nusage. At the same time, TBA perfectly overlaps the I/O with the computation\nand incurs negligible performance overhead. We introduce the\nrecompute-offload-keep (ROK) curve to compare the TBA offloading with other two\ntensor placement strategies, keeping activations in memory and layerwise full\nrecomputation. We find that TBA achieves better memory savings than layerwise\nfull recomputation while retaining the performance of keeping the activations\nin memory.","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"7 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Neural and Evolutionary Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.10013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The growth rate of the GPU memory capacity has not been able to keep up with
that of the size of large language models (LLMs), hindering the model training
process. In particular, activations -- the intermediate tensors produced during
forward propagation and reused in backward propagation -- dominate the GPU
memory use. To address this challenge, we propose TBA to efficiently offload
activations to high-capacity NVMe SSDs. This approach reduces GPU memory usage
without impacting performance by adaptively overlapping data transfers with
computation. TBA is compatible with popular deep learning frameworks like
PyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor
deduplication, forwarding, and adaptive offloading to further enhance
efficiency. We conduct extensive experiments on GPT, BERT, and T5. Results
demonstrate that TBA effectively reduces 47% of the activation peak memory
usage. At the same time, TBA perfectly overlaps the I/O with the computation
and incurs negligible performance overhead. We introduce the
recompute-offload-keep (ROK) curve to compare the TBA offloading with other two
tensor placement strategies, keeping activations in memory and layerwise full
recomputation. We find that TBA achieves better memory savings than layerwise
full recomputation while retaining the performance of keeping the activations
in memory.