Xinlong Wang , Rufeng Zhang , Chunhua Shen , Tao Kong
{"title":"DenseCL:一个用于自监督密集视觉预训练的简单框架","authors":"Xinlong Wang , Rufeng Zhang , Chunhua Shen , Tao Kong","doi":"10.1016/j.visinf.2022.09.003","DOIUrl":null,"url":null,"abstract":"<div><p>Self-supervised learning aims to learn a universal feature representation without labels. To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning framework that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. Specifically, we present dense contrastive learning (DenseCL), which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images. Compared to the supervised ImageNet pre-training and other self-supervised learning methods, our self-supervised DenseCL pre-training demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection, semantic segmentation and instance segmentation. Specifically, our approach significantly outperforms the strong MoCo-v2 by 2.0% AP on PASCAL VOC object detection, 1.1% AP on COCO object detection, 0.9% AP on COCO instance segmentation, 3.0% mIoU on PASCAL VOC semantic segmentation and 1.8% mIoU on Cityscapes semantic segmentation. The improvements are up to 3.5% AP and 8.8% mIoU over MoCo-v2, and 6.1% AP and 6.1% mIoU over supervised counterpart with frozen-backbone evaluation protocol.</p><p>Code and models are available at: <span>https://git.io/DenseCL</span><svg><path></path></svg></p></div>","PeriodicalId":36903,"journal":{"name":"Visual Informatics","volume":"7 1","pages":"Pages 30-40"},"PeriodicalIF":3.8000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DenseCL: A simple framework for self-supervised dense visual pre-training\",\"authors\":\"Xinlong Wang , Rufeng Zhang , Chunhua Shen , Tao Kong\",\"doi\":\"10.1016/j.visinf.2022.09.003\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Self-supervised learning aims to learn a universal feature representation without labels. To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning framework that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. Specifically, we present dense contrastive learning (DenseCL), which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images. Compared to the supervised ImageNet pre-training and other self-supervised learning methods, our self-supervised DenseCL pre-training demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection, semantic segmentation and instance segmentation. Specifically, our approach significantly outperforms the strong MoCo-v2 by 2.0% AP on PASCAL VOC object detection, 1.1% AP on COCO object detection, 0.9% AP on COCO instance segmentation, 3.0% mIoU on PASCAL VOC semantic segmentation and 1.8% mIoU on Cityscapes semantic segmentation. The improvements are up to 3.5% AP and 8.8% mIoU over MoCo-v2, and 6.1% AP and 6.1% mIoU over supervised counterpart with frozen-backbone evaluation protocol.</p><p>Code and models are available at: <span>https://git.io/DenseCL</span><svg><path></path></svg></p></div>\",\"PeriodicalId\":36903,\"journal\":{\"name\":\"Visual Informatics\",\"volume\":\"7 1\",\"pages\":\"Pages 30-40\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2023-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Visual Informatics\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2468502X22000936\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Visual Informatics","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2468502X22000936","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
DenseCL: A simple framework for self-supervised dense visual pre-training
Self-supervised learning aims to learn a universal feature representation without labels. To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning framework that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. Specifically, we present dense contrastive learning (DenseCL), which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images. Compared to the supervised ImageNet pre-training and other self-supervised learning methods, our self-supervised DenseCL pre-training demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection, semantic segmentation and instance segmentation. Specifically, our approach significantly outperforms the strong MoCo-v2 by 2.0% AP on PASCAL VOC object detection, 1.1% AP on COCO object detection, 0.9% AP on COCO instance segmentation, 3.0% mIoU on PASCAL VOC semantic segmentation and 1.8% mIoU on Cityscapes semantic segmentation. The improvements are up to 3.5% AP and 8.8% mIoU over MoCo-v2, and 6.1% AP and 6.1% mIoU over supervised counterpart with frozen-backbone evaluation protocol.
Code and models are available at: https://git.io/DenseCL