自监督视觉变压器的补丁级表示学习

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Pub Date : 2022-06-01 DOI:10.1109/CVPR52688.2022.00817

Sukmin Yun, Hankook Lee, Jaehyung Kim, Jinwoo Shin

{"title":"自监督视觉变压器的补丁级表示学习","authors":"Sukmin Yun, Hankook Lee, Jaehyung Kim, Jinwoo Shin","doi":"10.1109/CVPR52688.2022.00817","DOIUrl":null,"url":null,"abstract":"Recent self-supervised learning (SSL) methods have shown impressive results in learning visual representations from unlabeled images. This paper aims to improve their performance further by utilizing the architectural advan-tages of the underlying neural network, as the current state-of-the-art visual pretext tasks for SSL do not enjoy the ben-efit, i.e., they are architecture-agnostic. In particular, we fo-cus on Vision Transformers (ViTs), which have gained much attention recently as a better architectural choice, often out-performing convolutional networks for various visual tasks. The unique characteristic of ViT is that it takes a sequence of disjoint patches from an image and processes patch-level representations internally. Inspired by this, we design a simple yet effective visual pretext task, coined Self Patch, for learning better patch-level representations. To be specific, we enforce invariance against each patch and its neigh-bors, i.e., each patch treats similar neighboring patches as positive samples. Consequently, training ViTs with Self-Patch learns more semantically meaningful relations among patches (without using human-annotated labels), which can be beneficial, in particular, to downstream tasks of a dense prediction type. Despite its simplicity, we demonstrate that it can significantly improve the performance of existing SSL methods for various visual tasks, including object detection and semantic segmentation. Specifically, Self Patch signif-icantly improves the recent self-supervised ViT, DINO, by achieving +1.3 AP on COCO object detection, +1.2 AP on COCO instance segmentation, and +2.9 mIoU on ADE20K semantic segmentation.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"290 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"37","resultStr":"{\"title\":\"Patch-level Representation Learning for Self-supervised Vision Transformers\",\"authors\":\"Sukmin Yun, Hankook Lee, Jaehyung Kim, Jinwoo Shin\",\"doi\":\"10.1109/CVPR52688.2022.00817\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent self-supervised learning (SSL) methods have shown impressive results in learning visual representations from unlabeled images. This paper aims to improve their performance further by utilizing the architectural advan-tages of the underlying neural network, as the current state-of-the-art visual pretext tasks for SSL do not enjoy the ben-efit, i.e., they are architecture-agnostic. In particular, we fo-cus on Vision Transformers (ViTs), which have gained much attention recently as a better architectural choice, often out-performing convolutional networks for various visual tasks. The unique characteristic of ViT is that it takes a sequence of disjoint patches from an image and processes patch-level representations internally. Inspired by this, we design a simple yet effective visual pretext task, coined Self Patch, for learning better patch-level representations. To be specific, we enforce invariance against each patch and its neigh-bors, i.e., each patch treats similar neighboring patches as positive samples. Consequently, training ViTs with Self-Patch learns more semantically meaningful relations among patches (without using human-annotated labels), which can be beneficial, in particular, to downstream tasks of a dense prediction type. Despite its simplicity, we demonstrate that it can significantly improve the performance of existing SSL methods for various visual tasks, including object detection and semantic segmentation. Specifically, Self Patch signif-icantly improves the recent self-supervised ViT, DINO, by achieving +1.3 AP on COCO object detection, +1.2 AP on COCO instance segmentation, and +2.9 mIoU on ADE20K semantic segmentation.\",\"PeriodicalId\":355552,\"journal\":{\"name\":\"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)\",\"volume\":\"290 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"37\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CVPR52688.2022.00817\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR52688.2022.00817","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 37

摘要

最近的自监督学习(SSL)方法在从未标记的图像中学习视觉表示方面显示出令人印象深刻的结果。本文旨在通过利用底层神经网络的架构优势来进一步提高它们的性能，因为当前最先进的SSL视觉借口任务没有享受到这种优势，即它们与架构无关。我们特别关注视觉变形(ViTs)，它最近作为一种更好的架构选择而受到广泛关注，在各种视觉任务中通常表现优于卷积网络。ViT的独特之处在于它从图像中获取一系列不相交的补丁，并在内部处理补丁级表示。受此启发，我们设计了一个简单而有效的视觉借口任务，称为自我补丁，用于学习更好的补丁级表示。具体地说，我们对每个补丁及其邻居强制不变性，即每个补丁将相似的相邻补丁视为正样本。因此，使用Self-Patch训练vit可以在patch之间学习更多语义上有意义的关系(不使用人工注释的标签)，这对于密集预测类型的下游任务尤其有益。尽管它很简单，但我们证明它可以显着提高现有SSL方法在各种视觉任务中的性能，包括对象检测和语义分割。具体来说，Self Patch显著改进了最近的自监督ViT DINO，在COCO对象检测上实现了+1.3 AP，在COCO实例分割上实现了+1.2 AP，在ADE20K语义分割上实现了+2.9 mIoU。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Patch-level Representation Learning for Self-supervised Vision Transformers

Recent self-supervised learning (SSL) methods have shown impressive results in learning visual representations from unlabeled images. This paper aims to improve their performance further by utilizing the architectural advan-tages of the underlying neural network, as the current state-of-the-art visual pretext tasks for SSL do not enjoy the ben-efit, i.e., they are architecture-agnostic. In particular, we fo-cus on Vision Transformers (ViTs), which have gained much attention recently as a better architectural choice, often out-performing convolutional networks for various visual tasks. The unique characteristic of ViT is that it takes a sequence of disjoint patches from an image and processes patch-level representations internally. Inspired by this, we design a simple yet effective visual pretext task, coined Self Patch, for learning better patch-level representations. To be specific, we enforce invariance against each patch and its neigh-bors, i.e., each patch treats similar neighboring patches as positive samples. Consequently, training ViTs with Self-Patch learns more semantically meaningful relations among patches (without using human-annotated labels), which can be beneficial, in particular, to downstream tasks of a dense prediction type. Despite its simplicity, we demonstrate that it can significantly improve the performance of existing SSL methods for various visual tasks, including object detection and semantic segmentation. Specifically, Self Patch signif-icantly improves the recent self-supervised ViT, DINO, by achieving +1.3 AP on COCO object detection, +1.2 AP on COCO instance segmentation, and +2.9 mIoU on ADE20K semantic segmentation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助