Evolved Hierarchical Masking for Self-Supervised Learning.

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-11-04 DOI:10.1109/TPAMI.2024.3490776

Zhanzhou Feng, Shiliang Zhang

{"title":"Evolved Hierarchical Masking for Self-Supervised Learning.","authors":"Zhanzhou Feng, Shiliang Zhang","doi":"10.1109/TPAMI.2024.3490776","DOIUrl":null,"url":null,"abstract":"<p><p>Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. As those mask patterns resort to different criteria to depict image contents, sticking to a fixed pattern leads to a limited vision cues modeling capability. This paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning. The proposed method leverages the vision model being trained to parse the input visual cues into a hierarchy structure, which is hence adopted to generate masks accordingly. The accuracy of hierarchy is on par with the capability of the model being trained, leading to evolved mask patterns at different training stages. Initially, generated masks focus on low-level visual cues to grasp basic textures, then gradually evolve to depict higher-level cues to reinforce the learning of more complicated object semantics and contexts. Our method does not require extra pre-trained models or annotations and ensures training efficiency by evolving the training difficulty. We conduct extensive experiments on seven downstream tasks including partial-duplicate image retrieval relying on low-level details, as well as image classification and semantic segmentation that require semantic parsing capability. Experimental results demonstrate that it substantially boosts performance across these tasks. For instance, it surpasses the recent MAE by 1.1% in imageNet-1K classification and 1.4% in ADE20K segmentation with the same training epochs. We also align the proposed method with the current research focus on LLMs. The proposed approach bridges the gap with large-scale pre-training on semantic demanding tasks and enhances intricate detail perception in tasks requiring low-level feature recognition.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPAMI.2024.3490776","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. As those mask patterns resort to different criteria to depict image contents, sticking to a fixed pattern leads to a limited vision cues modeling capability. This paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning. The proposed method leverages the vision model being trained to parse the input visual cues into a hierarchy structure, which is hence adopted to generate masks accordingly. The accuracy of hierarchy is on par with the capability of the model being trained, leading to evolved mask patterns at different training stages. Initially, generated masks focus on low-level visual cues to grasp basic textures, then gradually evolve to depict higher-level cues to reinforce the learning of more complicated object semantics and contexts. Our method does not require extra pre-trained models or annotations and ensures training efficiency by evolving the training difficulty. We conduct extensive experiments on seven downstream tasks including partial-duplicate image retrieval relying on low-level details, as well as image classification and semantic segmentation that require semantic parsing capability. Experimental results demonstrate that it substantially boosts performance across these tasks. For instance, it surpasses the recent MAE by 1.1% in imageNet-1K classification and 1.4% in ADE20K segmentation with the same training epochs. We also align the proposed method with the current research focus on LLMs. The proposed approach bridges the gap with large-scale pre-training on semantic demanding tasks and enhances intricate detail perception in tasks requiring low-level feature recognition.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于自我监督学习的进化分层遮蔽技术

现有的遮罩图像建模方法采用固定的遮罩模式来指导自我监督训练。由于这些遮罩模式采用不同的标准来描述图像内容，拘泥于固定模式导致视觉线索建模能力有限。本文介绍了一种进化的分层遮罩方法，以追求自我监督学习中的通用视觉线索建模。所提出的方法利用正在训练的视觉模型，将输入的视觉线索解析为层次结构，并据此生成遮罩。层次结构的准确性与所训练模型的能力相当，从而在不同的训练阶段产生不同的遮罩模式。最初，生成的遮罩侧重于低层次的视觉线索，以掌握基本的纹理，然后逐渐演变为描绘更高层次的线索，以加强对更复杂的物体语义和语境的学习。我们的方法不需要额外的预训练模型或注释，并通过不断提高训练难度来确保训练效率。我们在七个下游任务上进行了广泛的实验，包括依赖低级细节的部分重复图像检索，以及需要语义解析能力的图像分类和语义分割。实验结果表明，它大大提高了这些任务的性能。例如，在相同的训练历时下，它在 imageNet-1K 分类中的 MAE 高出 1.1%，在 ADE20K 分割中的 MAE 高出 1.4%。我们还将提出的方法与当前对 LLM 的研究重点相结合。所提出的方法弥补了在语义要求较高的任务中进行大规模预训练的不足，并增强了在需要低级特征识别的任务中对复杂细节的感知。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量