利用激活检查点中的输入张量动力学实现GPU的高效训练

Jian Liao, Mingzhen Li, Hailong Yang, Qingxiao Sun, Biao Sun, Jiwei Hao, Tianyu Feng, F. Yu, Shengdong Chen, Ye Tao, Zicheng Zhang, Zhongzhi Luan, D. Qian
{"title":"利用激活检查点中的输入张量动力学实现GPU的高效训练","authors":"Jian Liao, Mingzhen Li, Hailong Yang, Qingxiao Sun, Biao Sun, Jiwei Hao, Tianyu Feng, F. Yu, Shengdong Chen, Ye Tao, Zicheng Zhang, Zhongzhi Luan, D. Qian","doi":"10.1109/IPDPS54959.2023.00025","DOIUrl":null,"url":null,"abstract":"Larger deep learning models usually lead to higher model quality, however with an ever-increasing GPU memory footprint. Although several tensor checkpointing techniques have been proposed to enable training under a restricted GPU memory budget, they fail to exploit the input tensor dynamics due to diverse datasets and subsequent data augmentation, and thus leave the training optimization on table. In this paper, we propose Mimose, an input-aware tensor checkpointing planner respecting the memory budget while enabling efficient model training on GPU. Mimose builds a lightweight but accurate prediction model of GPU memory usage online, without pre-analyzing the model. It generates a tensor checkpointing plan based on per-layer memory prediction and applies it to the training process on the fly. Our experiments show that Mimose achieves superior training throughput compared to state-of-the-art checkpointing frameworks under the same GPU memory budgets.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploiting Input Tensor Dynamics in Activation Checkpointing for Efficient Training on GPU\",\"authors\":\"Jian Liao, Mingzhen Li, Hailong Yang, Qingxiao Sun, Biao Sun, Jiwei Hao, Tianyu Feng, F. Yu, Shengdong Chen, Ye Tao, Zicheng Zhang, Zhongzhi Luan, D. Qian\",\"doi\":\"10.1109/IPDPS54959.2023.00025\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Larger deep learning models usually lead to higher model quality, however with an ever-increasing GPU memory footprint. Although several tensor checkpointing techniques have been proposed to enable training under a restricted GPU memory budget, they fail to exploit the input tensor dynamics due to diverse datasets and subsequent data augmentation, and thus leave the training optimization on table. In this paper, we propose Mimose, an input-aware tensor checkpointing planner respecting the memory budget while enabling efficient model training on GPU. Mimose builds a lightweight but accurate prediction model of GPU memory usage online, without pre-analyzing the model. It generates a tensor checkpointing plan based on per-layer memory prediction and applies it to the training process on the fly. Our experiments show that Mimose achieves superior training throughput compared to state-of-the-art checkpointing frameworks under the same GPU memory budgets.\",\"PeriodicalId\":343684,\"journal\":{\"name\":\"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS54959.2023.00025\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS54959.2023.00025","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

更大的深度学习模型通常会带来更高的模型质量,但是GPU内存占用会不断增加。虽然已经提出了几种张量检查点技术来实现在有限的GPU内存预算下的训练,但由于不同的数据集和随后的数据增加,它们无法利用输入张量的动态,从而使训练优化留在表中。在本文中,我们提出了Mimose,一个输入感知张量检查点规划器,在尊重内存预算的同时,在GPU上实现高效的模型训练。Mimose构建了一个轻量级但准确的GPU内存使用在线预测模型,而无需对模型进行预分析。它基于每层记忆预测生成一个张量检查点计划,并将其应用于动态训练过程。我们的实验表明,在相同的GPU内存预算下,与最先进的检查点框架相比,Mimose实现了更高的训练吞吐量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Exploiting Input Tensor Dynamics in Activation Checkpointing for Efficient Training on GPU
Larger deep learning models usually lead to higher model quality, however with an ever-increasing GPU memory footprint. Although several tensor checkpointing techniques have been proposed to enable training under a restricted GPU memory budget, they fail to exploit the input tensor dynamics due to diverse datasets and subsequent data augmentation, and thus leave the training optimization on table. In this paper, we propose Mimose, an input-aware tensor checkpointing planner respecting the memory budget while enabling efficient model training on GPU. Mimose builds a lightweight but accurate prediction model of GPU memory usage online, without pre-analyzing the model. It generates a tensor checkpointing plan based on per-layer memory prediction and applies it to the training process on the fly. Our experiments show that Mimose achieves superior training throughput compared to state-of-the-art checkpointing frameworks under the same GPU memory budgets.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
GPU-Accelerated Error-Bounded Compression Framework for Quantum Circuit Simulations Generalizable Reinforcement Learning-Based Coarsening Model for Resource Allocation over Large and Diverse Stream Processing Graphs Smart Redbelly Blockchain: Reducing Congestion for Web3 QoS-Aware and Cost-Efficient Dynamic Resource Allocation for Serverless ML Workflows Fast Sparse GPU Kernels for Accelerated Training of Graph Neural Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1