EfficientTrain++:用于高效视觉骨干训练的通用课程学习

Yulin Wang, Yang Yue, Rui Lu, Yizeng Han, Shiji Song, Gao Huang
{"title":"EfficientTrain++:用于高效视觉骨干训练的通用课程学习","authors":"Yulin Wang, Yang Yue, Rui Lu, Yizeng Han, Shiji Song, Gao Huang","doi":"10.1109/TPAMI.2024.3401036","DOIUrl":null,"url":null,"abstract":"<p><p>The superior performance of modern computer vision backbones (e.g., vision Transformers learned on ImageNet-1 K/22 K) usually comes with a costly training procedure. This study contributes to this issue by generalizing the idea of curriculum learning beyond its original formulation, i.e., training models using easier-to-harder data. Specifically, we reformulate the training curriculum as a soft-selection function, which uncovers progressively more difficult patterns within each example during training, instead of performing easier-to-harder sample selection. Our work is inspired by an intriguing observation on the learning dynamics of visual backbones: during the earlier stages of training, the model predominantly learns to recognize some 'easier-to-learn' discriminative patterns in the data. These patterns, when observed through frequency and spatial domains, incorporate lower-frequency components, and the natural image contents without distortion or data augmentation. Motivated by these findings, we propose a curriculum where the model always leverages all the training data at every learning stage, yet the exposure to the 'easier-to-learn' patterns of each example is initiated first, with harder patterns gradually introduced as training progresses. To implement this idea in a computationally efficient way, we introduce a cropping operation in the Fourier spectrum of the inputs, enabling the model to learn from only the lower-frequency components. Then we show that exposing the contents of natural images can be readily achieved by modulating the intensity of data augmentation. Finally, we integrate these two aspects and design curriculum learning schedules by proposing tailored searching algorithms. Moreover, we present useful techniques for deploying our approach efficiently in challenging practical scenarios, such as large-scale parallel training, and limited input/output or data pre-processing speed. The resulting method, EfficientTrain++, is simple, general, yet surprisingly effective. As an off-the-shelf approach, it reduces the training time of various popular models (e.g., ResNet, ConvNeXt, DeiT, PVT, Swin, CSWin, and CAFormer) by [Formula: see text] on ImageNet-1 K/22 K without sacrificing accuracy. It also demonstrates efficacy in self-supervised learning (e.g., MAE).</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training.\",\"authors\":\"Yulin Wang, Yang Yue, Rui Lu, Yizeng Han, Shiji Song, Gao Huang\",\"doi\":\"10.1109/TPAMI.2024.3401036\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The superior performance of modern computer vision backbones (e.g., vision Transformers learned on ImageNet-1 K/22 K) usually comes with a costly training procedure. This study contributes to this issue by generalizing the idea of curriculum learning beyond its original formulation, i.e., training models using easier-to-harder data. Specifically, we reformulate the training curriculum as a soft-selection function, which uncovers progressively more difficult patterns within each example during training, instead of performing easier-to-harder sample selection. Our work is inspired by an intriguing observation on the learning dynamics of visual backbones: during the earlier stages of training, the model predominantly learns to recognize some 'easier-to-learn' discriminative patterns in the data. These patterns, when observed through frequency and spatial domains, incorporate lower-frequency components, and the natural image contents without distortion or data augmentation. Motivated by these findings, we propose a curriculum where the model always leverages all the training data at every learning stage, yet the exposure to the 'easier-to-learn' patterns of each example is initiated first, with harder patterns gradually introduced as training progresses. To implement this idea in a computationally efficient way, we introduce a cropping operation in the Fourier spectrum of the inputs, enabling the model to learn from only the lower-frequency components. Then we show that exposing the contents of natural images can be readily achieved by modulating the intensity of data augmentation. Finally, we integrate these two aspects and design curriculum learning schedules by proposing tailored searching algorithms. Moreover, we present useful techniques for deploying our approach efficiently in challenging practical scenarios, such as large-scale parallel training, and limited input/output or data pre-processing speed. The resulting method, EfficientTrain++, is simple, general, yet surprisingly effective. As an off-the-shelf approach, it reduces the training time of various popular models (e.g., ResNet, ConvNeXt, DeiT, PVT, Swin, CSWin, and CAFormer) by [Formula: see text] on ImageNet-1 K/22 K without sacrificing accuracy. It also demonstrates efficacy in self-supervised learning (e.g., MAE).</p>\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TPAMI.2024.3401036\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/11/6 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPAMI.2024.3401036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/6 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

现代计算机视觉骨干的卓越性能(例如在 ImageNet-1K/22K 上学习的视觉转换器)通常伴随着昂贵的训练过程。本研究对这一问题做出了贡献,它将课程学习的理念推广到其原始表述之外,即使用由易到难的数据训练模型。具体来说,我们将训练课程重新表述为一个软选择函数,它在训练过程中在每个示例中逐步发现更难的模式,而不是执行从易到难的样本选择。我们的工作灵感来自于对视觉骨干学习动态的一个有趣观察:在训练的早期阶段,模型主要学习识别数据中一些 "较易学习 "的判别模式。通过频域和空间域观察这些模式时,会发现其中包含低频成分和自然图像内容,而不会出现失真或数据增强。受这些发现的启发,我们提出了一套课程,在这套课程中,模型在每个学习阶段都会利用所有的训练数据,但首先接触的是每个示例中 "较易学习 "的模式,随着训练的进行,再逐步引入较难学习的模式。为了以计算效率高的方式实现这一想法,我们在输入的傅立叶频谱中引入了裁剪操作,使模型能够只从低频成分中学习。然后,我们证明,通过调节数据增强的强度,可以很容易地揭示自然图像的内容。最后,我们整合了这两个方面,并通过提出量身定制的搜索算法来设计课程学习时间表。此外,我们还介绍了在具有挑战性的实际场景(如大规模并行训练、有限的输入/输出或数据预处理速度)中高效部署我们的方法的有用技术。由此产生的 EfficientTrain++ 方法简单、通用,但却出奇地有效。作为一种现成的方法,它在 ImageNet-1K/22K 上将各种流行模型(如 ResNet、ConvNeXt、DeiT、PVT、Swin、CSWin 和 CAFormer)的训练时间缩短了[公式:见正文],而且不影响准确性。它还展示了自我监督学习(如 MAE)的功效。代码见:https://github.com/LeapLabTHU/EfficientTrain。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training.

The superior performance of modern computer vision backbones (e.g., vision Transformers learned on ImageNet-1 K/22 K) usually comes with a costly training procedure. This study contributes to this issue by generalizing the idea of curriculum learning beyond its original formulation, i.e., training models using easier-to-harder data. Specifically, we reformulate the training curriculum as a soft-selection function, which uncovers progressively more difficult patterns within each example during training, instead of performing easier-to-harder sample selection. Our work is inspired by an intriguing observation on the learning dynamics of visual backbones: during the earlier stages of training, the model predominantly learns to recognize some 'easier-to-learn' discriminative patterns in the data. These patterns, when observed through frequency and spatial domains, incorporate lower-frequency components, and the natural image contents without distortion or data augmentation. Motivated by these findings, we propose a curriculum where the model always leverages all the training data at every learning stage, yet the exposure to the 'easier-to-learn' patterns of each example is initiated first, with harder patterns gradually introduced as training progresses. To implement this idea in a computationally efficient way, we introduce a cropping operation in the Fourier spectrum of the inputs, enabling the model to learn from only the lower-frequency components. Then we show that exposing the contents of natural images can be readily achieved by modulating the intensity of data augmentation. Finally, we integrate these two aspects and design curriculum learning schedules by proposing tailored searching algorithms. Moreover, we present useful techniques for deploying our approach efficiently in challenging practical scenarios, such as large-scale parallel training, and limited input/output or data pre-processing speed. The resulting method, EfficientTrain++, is simple, general, yet surprisingly effective. As an off-the-shelf approach, it reduces the training time of various popular models (e.g., ResNet, ConvNeXt, DeiT, PVT, Swin, CSWin, and CAFormer) by [Formula: see text] on ImageNet-1 K/22 K without sacrificing accuracy. It also demonstrates efficacy in self-supervised learning (e.g., MAE).

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
EBMGC-GNF: Efficient Balanced Multi-View Graph Clustering via Good Neighbor Fusion. Integrating Neural-Symbolic Reasoning With Variational Causal Inference Network for Explanatory Visual Question Answering. Motion-Aware Dynamic Graph Neural Network for Video Compressive Sensing. Evaluation Metrics for Intelligent Generation of Graphical Game Assets: A Systematic Survey-Based Framework. Artificial Intelligence and Machine Learning Tools for Improving Early Warning Systems of Volcanic Eruptions: The Case of Stromboli.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1