Self-Distilled Vision Transformer for Domain Generalization

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision Pub Date : 2022-07-25 DOI:10.48550/arXiv.2207.12392

M. Sultana, Muzammal Naseer, Muhammad Haris Khan, Salman Khan, F. Khan

{"title":"Self-Distilled Vision Transformer for Domain Generalization","authors":"M. Sultana, Muzammal Naseer, Muhammad Haris Khan, Salman Khan, F. Khan","doi":"10.48550/arXiv.2207.12392","DOIUrl":null,"url":null,"abstract":"In the recent past, several domain generalization (DG) methods have been proposed, showing encouraging performance, however, almost all of them build on convolutional neural networks (CNNs). There is little to no progress on studying the DG performance of vision transformers (ViTs), which are challenging the supremacy of CNNs on standard benchmarks, often built on i.i.d assumption. This renders the real-world deployment of ViTs doubtful. In this paper, we attempt to explore ViTs towards addressing the DG problem. Similar to CNNs, ViTs also struggle in out-of-distribution scenarios and the main culprit is overfitting to source domains. Inspired by the modular architecture of ViTs, we propose a simple DG approach for ViTs, coined as self-distillation for ViTs. It reduces the overfitting of source domains by easing the learning of input-output mapping problem through curating non-zero entropy supervisory signals for intermediate transformer blocks. Further, it does not introduce any new parameters and can be seamlessly plugged into the modular composition of different ViTs. We empirically demonstrate notable performance gains with different DG baselines and various ViT backbones in five challenging datasets. Moreover, we report favorable performance against recent state-of-the-art DG methods. Our code along with pre-trained models are publicly available at: https://github.com/maryam089/SDViT.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":"83 1","pages":"273-290"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2207.12392","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

In the recent past, several domain generalization (DG) methods have been proposed, showing encouraging performance, however, almost all of them build on convolutional neural networks (CNNs). There is little to no progress on studying the DG performance of vision transformers (ViTs), which are challenging the supremacy of CNNs on standard benchmarks, often built on i.i.d assumption. This renders the real-world deployment of ViTs doubtful. In this paper, we attempt to explore ViTs towards addressing the DG problem. Similar to CNNs, ViTs also struggle in out-of-distribution scenarios and the main culprit is overfitting to source domains. Inspired by the modular architecture of ViTs, we propose a simple DG approach for ViTs, coined as self-distillation for ViTs. It reduces the overfitting of source domains by easing the learning of input-output mapping problem through curating non-zero entropy supervisory signals for intermediate transformer blocks. Further, it does not introduce any new parameters and can be seamlessly plugged into the modular composition of different ViTs. We empirically demonstrate notable performance gains with different DG baselines and various ViT backbones in five challenging datasets. Moreover, we report favorable performance against recent state-of-the-art DG methods. Our code along with pre-trained models are publicly available at: https://github.com/maryam089/SDViT.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

面向领域泛化的自提取视觉变压器

近年来，已经提出了几种领域泛化(DG)方法，并显示出令人鼓舞的性能，然而，几乎所有这些方法都建立在卷积神经网络(cnn)之上。视觉变压器(ViTs)的DG性能研究几乎没有进展，它在标准基准上挑战cnn的霸主地位，通常建立在i.i.d假设上。这使得vit在现实世界的部署值得怀疑。在本文中，我们试图探索vit解决DG问题。与cnn相似，ViTs在非分布情况下也存在问题，其主要原因是对源域的过拟合。受vit模块化架构的启发，我们提出了一种简单的vit DG方法，称为vit的自蒸馏。该算法通过控制中间变压器块的非零熵监控信号，简化了输入输出映射问题的学习，减少了源域的过拟合。此外，它不引入任何新的参数，可以无缝地插入到不同vit的模块化组成中。在五个具有挑战性的数据集中，我们通过经验证明了不同DG基线和各种ViT骨干的显着性能提升。此外，我们报告了与最近最先进的DG方法相比的有利性能。我们的代码以及预训练的模型可以在https://github.com/maryam089/SDViT上公开获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

自引率

0.00%

发文量