Optimal Re-Materialization Strategies for Heterogeneous Chains: How to Train Deep Neural Networks with Limited Memory

IF 2.7 1区 数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING ACM Transactions on Mathematical Software Pub Date : 2024-03-05 DOI:10.1145/3648633
Olivier Beaumont, Lionel Eyraud-Dubois, Julien Herrmann, Alexis Joly, Alena Shilova
{"title":"Optimal Re-Materialization Strategies for Heterogeneous Chains: How to Train Deep Neural Networks with Limited Memory","authors":"Olivier Beaumont, Lionel Eyraud-Dubois, Julien Herrmann, Alexis Joly, Alena Shilova","doi":"10.1145/3648633","DOIUrl":null,"url":null,"abstract":"<p>Training in Feed Forward Deep Neural Networks is a memory-intensive operation which is usually performed on GPUs with limited memory capacities. This may force data scientists to limit the depth of the models or the resolution of the input data if data does not fit in the GPU memory. The re-materialization technique, whose idea comes from the checkpointing strategies developed in the Automatic Differentiation literature, allows data scientists to limit the memory requirements related to the storage of intermediate data (activations), at the cost of an increase in the computational cost.</p><p>This paper introduces a new strategy of re-materialization of activations that significantly reduces memory usage. It consists in selecting which activations are saved and which activations are deleted during the forward phase, and then recomputing the deleted activations when they are needed during the backward phase.</p><p>We propose an original computation model that combines two types of activation savings: either only storing the layer inputs, or recording the complete history of operations that produced the outputs. This paper focuses on the fully heterogeneous case, where the computation time and the memory requirement of each layer is different. We prove that finding the optimal solution is NP-hard and that classical techniques from Automatic Differentiation literature do not apply. Moreover, the classical assumption of memory persistence of materialized activations, used to simplify the search of optimal solutions, does not hold anymore. Thus, we propose a weak memory persistence property and provide a Dynamic Program to compute the optimal sequence of computations.</p><p>This algorithm is made available through the <span>Rotor</span> software, a PyTorch plug-in dealing with any network consisting of a sequence of layers, each of them having an arbitrarily complex structure. Through extensive experiments, we show that our implementation consistently outperforms existing re-materialization approaches for a large class of networks, image sizes and batch sizes.</p>","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"44 1","pages":""},"PeriodicalIF":2.7000,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Mathematical Software","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3648633","RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

Training in Feed Forward Deep Neural Networks is a memory-intensive operation which is usually performed on GPUs with limited memory capacities. This may force data scientists to limit the depth of the models or the resolution of the input data if data does not fit in the GPU memory. The re-materialization technique, whose idea comes from the checkpointing strategies developed in the Automatic Differentiation literature, allows data scientists to limit the memory requirements related to the storage of intermediate data (activations), at the cost of an increase in the computational cost.

This paper introduces a new strategy of re-materialization of activations that significantly reduces memory usage. It consists in selecting which activations are saved and which activations are deleted during the forward phase, and then recomputing the deleted activations when they are needed during the backward phase.

We propose an original computation model that combines two types of activation savings: either only storing the layer inputs, or recording the complete history of operations that produced the outputs. This paper focuses on the fully heterogeneous case, where the computation time and the memory requirement of each layer is different. We prove that finding the optimal solution is NP-hard and that classical techniques from Automatic Differentiation literature do not apply. Moreover, the classical assumption of memory persistence of materialized activations, used to simplify the search of optimal solutions, does not hold anymore. Thus, we propose a weak memory persistence property and provide a Dynamic Program to compute the optimal sequence of computations.

This algorithm is made available through the Rotor software, a PyTorch plug-in dealing with any network consisting of a sequence of layers, each of them having an arbitrarily complex structure. Through extensive experiments, we show that our implementation consistently outperforms existing re-materialization approaches for a large class of networks, image sizes and batch sizes.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
异构链的最佳再物化策略:如何利用有限的内存训练深度神经网络
前馈深度神经网络的训练是一项内存密集型操作,通常在内存容量有限的 GPU 上进行。如果数据不适合 GPU 内存,数据科学家可能不得不限制模型的深度或输入数据的分辨率。再物化技术的理念来自自动微分文献中开发的检查点策略,它允许数据科学家以增加计算成本为代价,限制与中间数据(激活)存储相关的内存需求。它包括在前向阶段选择保存哪些激活,删除哪些激活,然后在后向阶段需要时重新计算被删除的激活。我们提出了一种新颖的计算模型,它结合了两种激活保存方式:或只保存层输入,或记录产生输出的完整操作历史。本文的重点是完全异构的情况,即每个层的计算时间和内存需求都不同。我们证明,找到最优解是 NP 难的,自动微分文献中的经典技术并不适用。此外,用于简化最优解搜索的物化激活记忆持久性经典假设也不再成立。因此,我们提出了一个弱记忆持久性属性,并提供了一个动态程序来计算最优计算序列。该算法可通过 Rotor 软件实现,它是一个 PyTorch 插件,可处理任何由层级序列组成的网络,每个层级都具有任意复杂的结构。通过广泛的实验,我们发现,在大量网络、图像大小和批量大小的情况下,我们的实现始终优于现有的再物化方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
ACM Transactions on Mathematical Software
ACM Transactions on Mathematical Software 工程技术-计算机:软件工程
CiteScore
5.00
自引率
3.70%
发文量
50
审稿时长
>12 weeks
期刊介绍: As a scientific journal, ACM Transactions on Mathematical Software (TOMS) documents the theoretical underpinnings of numeric, symbolic, algebraic, and geometric computing applications. It focuses on analysis and construction of algorithms and programs, and the interaction of programs and architecture. Algorithms documented in TOMS are available as the Collected Algorithms of the ACM at calgo.acm.org.
期刊最新文献
Algorithm xxx: A Covariate-Dependent Approach to Gaussian Graphical Modeling in R Remark on Algorithm 1012: Computing projections with large data sets PyOED: An Extensible Suite for Data Assimilation and Model-Constrained Optimal Design of Experiments Avoiding breakdown in incomplete factorizations in low precision arithmetic Algorithm xxx: PyGenStability, a multiscale community detection with generalized Markov Stability
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1