Contrastive Learning With Enhancing Detailed Information for Pre-Training Vision Transformer

IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-09-11 DOI:10.1109/TCSVT.2024.3457840
Zhuomin Liang;Liang Bai;Jinyu Fan;Xian Yang;Jiye Liang
{"title":"Contrastive Learning With Enhancing Detailed Information for Pre-Training Vision Transformer","authors":"Zhuomin Liang;Liang Bai;Jinyu Fan;Xian Yang;Jiye Liang","doi":"10.1109/TCSVT.2024.3457840","DOIUrl":null,"url":null,"abstract":"Contrastive Learning (CL) is an effective self-supervised learning method. It performs instance-level contrastiveness based on the image representations, which enables the model to extract abstract information from images. However, when training data is insufficient, abstract information fails to distinguish samples from different classes. This problem is more severe in the pre-training of Vision Transformer (ViT). In general, detailed information is crucial for enhancing the discrimination of representations. Patch representations, which focus on the details of images, are often overlooked in existing methods that train ViT through CL, resulting in the confusion of similar samples. To address this problem, we propose a Contrastive Learning model with Enhancing Detailed Information (CL-EDI) for pre-training ViT. Our model consists of dual ViT contrastive modules. The first module is similar to MoCo V3, which can learn abstract information about images. The role of the second ViT contrastive module is to enhance detailed information in data representations by aggregating patch representations of images. Extensive experiments demonstrate the necessity of learning detailed information. Across several datasets, our model surpasses existing approaches in image classification, transfer learning and object detection tasks.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"219-231"},"PeriodicalIF":11.1000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10677373/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Contrastive Learning (CL) is an effective self-supervised learning method. It performs instance-level contrastiveness based on the image representations, which enables the model to extract abstract information from images. However, when training data is insufficient, abstract information fails to distinguish samples from different classes. This problem is more severe in the pre-training of Vision Transformer (ViT). In general, detailed information is crucial for enhancing the discrimination of representations. Patch representations, which focus on the details of images, are often overlooked in existing methods that train ViT through CL, resulting in the confusion of similar samples. To address this problem, we propose a Contrastive Learning model with Enhancing Detailed Information (CL-EDI) for pre-training ViT. Our model consists of dual ViT contrastive modules. The first module is similar to MoCo V3, which can learn abstract information about images. The role of the second ViT contrastive module is to enhance detailed information in data representations by aggregating patch representations of images. Extensive experiments demonstrate the necessity of learning detailed information. Across several datasets, our model surpasses existing approaches in image classification, transfer learning and object detection tasks.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用增强详细信息进行对比学习,实现预培训视觉转换器
对比学习是一种有效的自我监督学习方法。它基于图像表示执行实例级对比,使模型能够从图像中提取抽象信息。然而,当训练数据不足时,抽象信息无法区分不同类别的样本。在视觉变压器(Vision Transformer, ViT)的预训练中,这一问题更为严重。一般而言,详细资料对加强对申述的甄别至关重要。在现有的通过CL训练ViT的方法中,往往忽略了关注图像细节的Patch表示,导致相似样本的混淆。为了解决这一问题,我们提出了一种增强详细信息的对比学习模型(CL-EDI)用于预训练ViT。我们的模型由双ViT对比模块组成。第一个模块类似于MoCo V3,可以学习关于图像的抽象信息。第二个ViT对比模块的作用是通过聚合图像的patch表示来增强数据表示中的详细信息。大量的实验证明了学习详细信息的必要性。在多个数据集上,我们的模型在图像分类、迁移学习和目标检测任务方面超越了现有的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
13.80
自引率
27.40%
发文量
660
审稿时长
5 months
期刊介绍: The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.
期刊最新文献
IEEE Circuits and Systems Society Information IEEE Circuits and Systems Society Information 2025 Index IEEE Transactions on Circuits and Systems for Video Technology IEEE Circuits and Systems Society Information IEEE Circuits and Systems Society Information
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1