DenseCL: A simple framework for self-supervised dense visual pre-training

IF 3.8 3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Visual Informatics Pub Date : 2023-03-01 DOI:10.1016/j.visinf.2022.09.003
Xinlong Wang , Rufeng Zhang , Chunhua Shen , Tao Kong
{"title":"DenseCL: A simple framework for self-supervised dense visual pre-training","authors":"Xinlong Wang ,&nbsp;Rufeng Zhang ,&nbsp;Chunhua Shen ,&nbsp;Tao Kong","doi":"10.1016/j.visinf.2022.09.003","DOIUrl":null,"url":null,"abstract":"<div><p>Self-supervised learning aims to learn a universal feature representation without labels. To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning framework that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. Specifically, we present dense contrastive learning (DenseCL), which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images. Compared to the supervised ImageNet pre-training and other self-supervised learning methods, our self-supervised DenseCL pre-training demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection, semantic segmentation and instance segmentation. Specifically, our approach significantly outperforms the strong MoCo-v2 by 2.0% AP on PASCAL VOC object detection, 1.1% AP on COCO object detection, 0.9% AP on COCO instance segmentation, 3.0% mIoU on PASCAL VOC semantic segmentation and 1.8% mIoU on Cityscapes semantic segmentation. The improvements are up to 3.5% AP and 8.8% mIoU over MoCo-v2, and 6.1% AP and 6.1% mIoU over supervised counterpart with frozen-backbone evaluation protocol.</p><p>Code and models are available at: <span>https://git.io/DenseCL</span><svg><path></path></svg></p></div>","PeriodicalId":36903,"journal":{"name":"Visual Informatics","volume":"7 1","pages":"Pages 30-40"},"PeriodicalIF":3.8000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Visual Informatics","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2468502X22000936","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Self-supervised learning aims to learn a universal feature representation without labels. To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning framework that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. Specifically, we present dense contrastive learning (DenseCL), which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images. Compared to the supervised ImageNet pre-training and other self-supervised learning methods, our self-supervised DenseCL pre-training demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection, semantic segmentation and instance segmentation. Specifically, our approach significantly outperforms the strong MoCo-v2 by 2.0% AP on PASCAL VOC object detection, 1.1% AP on COCO object detection, 0.9% AP on COCO instance segmentation, 3.0% mIoU on PASCAL VOC semantic segmentation and 1.8% mIoU on Cityscapes semantic segmentation. The improvements are up to 3.5% AP and 8.8% mIoU over MoCo-v2, and 6.1% AP and 6.1% mIoU over supervised counterpart with frozen-backbone evaluation protocol.

Code and models are available at: https://git.io/DenseCL

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
DenseCL:一个用于自监督密集视觉预训练的简单框架
自监督学习旨在学习无标签的通用特征表示。到目前为止,大多数现有的自监督学习方法都是为图像分类而设计和优化的。由于图像级预测和像素级预测之间的差异,这些预训练的模型对于密集预测任务可能是次优的。为了填补这一空白,我们的目标是设计一个有效、密集的自监督学习框架,通过考虑局部特征之间的对应关系,直接在像素(或局部特征)水平上工作。具体而言,我们提出了密集对比学习(DenseCL),它通过优化输入图像的两个视图之间像素级的成对对比(dis)相似性损失来实现自监督学习。与监督ImageNet预训练和其他自监督学习方法相比,我们的自监督DenseCL预训练在转移到下游密集预测任务(包括对象检测、语义分割和实例分割)时始终表现出优异的性能。具体而言,我们的方法在PASCAL VOC对象检测上显著优于强MoCo-v2,分别为2.0%的AP、1.1%的AP、0.9%的AP、3.0%mIoU的PASCAL VOC语义分割和1.8%的mIoU的Cityscapes语义分割。与MoCo-v2相比,改进幅度高达3.5%的AP和8.8%的mIoU,与具有冻结骨干评估协议的监督对等物相比,改进了6.1%的AP和6.1%的mIoU。代码和型号位于:https://git.io/DenseCL
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Visual Informatics
Visual Informatics Computer Science-Computer Graphics and Computer-Aided Design
CiteScore
6.70
自引率
3.30%
发文量
33
审稿时长
79 days
期刊最新文献
Intelligent CAD 2.0 Editorial Board RelicCARD: Enhancing cultural relics exploration through semantics-based augmented reality tangible interaction design JobViz: Skill-driven visual exploration of job advertisements Visual evaluation of graph representation learning based on the presentation of community structures
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1