Contrastive Learning With Enhancing Detailed Information for Pre-Training Vision Transformer

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-09-11 DOI:10.1109/TCSVT.2024.3457840

Zhuomin Liang;Liang Bai;Jinyu Fan;Xian Yang;Jiye Liang

{"title":"Contrastive Learning With Enhancing Detailed Information for Pre-Training Vision Transformer","authors":"Zhuomin Liang;Liang Bai;Jinyu Fan;Xian Yang;Jiye Liang","doi":"10.1109/TCSVT.2024.3457840","DOIUrl":null,"url":null,"abstract":"Contrastive Learning (CL) is an effective self-supervised learning method. It performs instance-level contrastiveness based on the image representations, which enables the model to extract abstract information from images. However, when training data is insufficient, abstract information fails to distinguish samples from different classes. This problem is more severe in the pre-training of Vision Transformer (ViT). In general, detailed information is crucial for enhancing the discrimination of representations. Patch representations, which focus on the details of images, are often overlooked in existing methods that train ViT through CL, resulting in the confusion of similar samples. To address this problem, we propose a Contrastive Learning model with Enhancing Detailed Information (CL-EDI) for pre-training ViT. Our model consists of dual ViT contrastive modules. The first module is similar to MoCo V3, which can learn abstract information about images. The role of the second ViT contrastive module is to enhance detailed information in data representations by aggregating patch representations of images. Extensive experiments demonstrate the necessity of learning detailed information. Across several datasets, our model surpasses existing approaches in image classification, transfer learning and object detection tasks.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"219-231"},"PeriodicalIF":11.1000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10677373/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Contrastive Learning (CL) is an effective self-supervised learning method. It performs instance-level contrastiveness based on the image representations, which enables the model to extract abstract information from images. However, when training data is insufficient, abstract information fails to distinguish samples from different classes. This problem is more severe in the pre-training of Vision Transformer (ViT). In general, detailed information is crucial for enhancing the discrimination of representations. Patch representations, which focus on the details of images, are often overlooked in existing methods that train ViT through CL, resulting in the confusion of similar samples. To address this problem, we propose a Contrastive Learning model with Enhancing Detailed Information (CL-EDI) for pre-training ViT. Our model consists of dual ViT contrastive modules. The first module is similar to MoCo V3, which can learn abstract information about images. The role of the second ViT contrastive module is to enhance detailed information in data representations by aggregating patch representations of images. Extensive experiments demonstrate the necessity of learning detailed information. Across several datasets, our model surpasses existing approaches in image classification, transfer learning and object detection tasks.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用增强详细信息进行对比学习，实现预培训视觉转换器

对比学习是一种有效的自我监督学习方法。它基于图像表示执行实例级对比，使模型能够从图像中提取抽象信息。然而，当训练数据不足时，抽象信息无法区分不同类别的样本。在视觉变压器（Vision Transformer, ViT）的预训练中，这一问题更为严重。一般而言，详细资料对加强对申述的甄别至关重要。在现有的通过CL训练ViT的方法中，往往忽略了关注图像细节的Patch表示，导致相似样本的混淆。为了解决这一问题，我们提出了一种增强详细信息的对比学习模型（CL-EDI）用于预训练ViT。我们的模型由双ViT对比模块组成。第一个模块类似于MoCo V3，可以学习关于图像的抽象信息。第二个ViT对比模块的作用是通过聚合图像的patch表示来增强数据表示中的详细信息。大量的实验证明了学习详细信息的必要性。在多个数据集上，我们的模型在图像分类、迁移学习和目标检测任务方面超越了现有的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.

期刊最新文献

IEEE Circuits and Systems Society Information IEEE Circuits and Systems Society Information 2025 Index IEEE Transactions on Circuits and Systems for Video Technology IEEE Circuits and Systems Society Information IEEE Circuits and Systems Society Information