The Role of ViT Design and Training in Robustness to Common Corruptions

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Transactions on Multimedia Pub Date : 2024-12-23 DOI:10.1109/TMM.2024.3521721

Rui Tian;Zuxuan Wu;Qi Dai;Micah Goldblum;Han Hu;Yu-Gang Jiang

{"title":"The Role of ViT Design and Training in Robustness to Common Corruptions","authors":"Rui Tian;Zuxuan Wu;Qi Dai;Micah Goldblum;Han Hu;Yu-Gang Jiang","doi":"10.1109/TMM.2024.3521721","DOIUrl":null,"url":null,"abstract":"Vision transformer (ViT) variants have made rapid advances on a variety of computer vision tasks. However, their performance on corrupted inputs, which are inevitable in realistic use cases due to variations in lighting and weather, has not been explored comprehensively. In this paper, we probe the robustness gap among ViT variants and ask how these modern architectural developments affect performance under common types of corruption. Through extensive and rigorous benchmarking, we demonstrate that simple architectural designs such as overlapping patch embedding and convolutional feed-forward networks can promote the robustness of ViTs. Moreover, since the de facto training of ViTs relies heavily on data augmentation, exactly which augmentation strategies make ViTs more robust is worth investigating. We survey the efficacy of previous methods and verify that adversarial noise training is powerful. In addition, we introduce a novel conditional method for generating dynamic augmentation parameters conditioned on input images, which offers state-of-the-art robustness to common corruptions.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1374-1385"},"PeriodicalIF":9.7000,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10812859/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Vision transformer (ViT) variants have made rapid advances on a variety of computer vision tasks. However, their performance on corrupted inputs, which are inevitable in realistic use cases due to variations in lighting and weather, has not been explored comprehensively. In this paper, we probe the robustness gap among ViT variants and ask how these modern architectural developments affect performance under common types of corruption. Through extensive and rigorous benchmarking, we demonstrate that simple architectural designs such as overlapping patch embedding and convolutional feed-forward networks can promote the robustness of ViTs. Moreover, since the de facto training of ViTs relies heavily on data augmentation, exactly which augmentation strategies make ViTs more robust is worth investigating. We survey the efficacy of previous methods and verify that adversarial noise training is powerful. In addition, we introduce a novel conditional method for generating dynamic augmentation parameters conditioned on input images, which offers state-of-the-art robustness to common corruptions.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ViT设计和训练在抗常见腐败稳健性中的作用

视觉变压器（ViT）变体在各种计算机视觉任务中取得了快速进展。然而，由于光照和天气的变化，它们在损坏输入上的表现在现实用例中是不可避免的，尚未得到全面的探讨。在本文中，我们探讨了ViT变体之间的鲁棒性差距，并询问这些现代架构开发如何在常见类型的损坏下影响性能。通过广泛而严格的基准测试，我们证明了简单的架构设计，如重叠补丁嵌入和卷积前馈网络可以提高vit的鲁棒性。此外，由于vit的实际训练在很大程度上依赖于数据增强，因此哪种增强策略使vit更健壮是值得研究的。我们考察了以往方法的有效性，并验证了对抗噪声训练的有效性。此外，我们还引入了一种新的条件方法，用于生成以输入图像为条件的动态增强参数，该方法对常见的损坏具有最先进的鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.