Demystify Transformers & Convolutions in Modern Image Deep Networks

IF 18.6 IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-12-20 DOI:10.1109/TPAMI.2024.3520508

Xiaowei Hu;Min Shi;Weiyun Wang;Sitong Wu;Linjie Xing;Wenhai Wang;Xizhou Zhou;Lewei Lu;Jie Zhou;Xiaogang Wang;Yu Qiao;Jifeng Dai

{"title":"Demystify Transformers & Convolutions in Modern Image Deep Networks","authors":"Xiaowei Hu;Min Shi;Weiyun Wang;Sitong Wu;Linjie Xing;Wenhai Wang;Xizhou Zhou;Lewei Lu;Jie Zhou;Xiaogang Wang;Yu Qiao;Jifeng Dai","doi":"10.1109/TPAMI.2024.3520508","DOIUrl":null,"url":null,"abstract":"Vision transformers have gained popularity recently, leading to the development of new vision backbones with improved features and consistent performance gains. However, these advancements are not solely attributable to novel feature transformation designs; certain benefits also arise from advanced network-level and block-level architectures. This paper aims to identify the real gains of popular convolution and attention operators through a detailed study. We find that the key difference among these feature transformation modules, such as attention or convolution, lies in their spatial feature aggregation approach, known as the “spatial token mixer” (STM). To facilitate an impartial comparison, we introduce a unified architecture to neutralize the impact of divergent network-level and block-level designs. Subsequently, various STMs are integrated into this unified framework for comprehensive comparative analysis. Our experiments on various tasks and an analysis of inductive bias show a significant performance boost due to advanced network-level and block-level designs, but performance differences persist among different STMs. Our detailed analysis also reveals various findings about different STMs, including effective receptive fields, invariance, and adversarial robustness tests.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2416-2428"},"PeriodicalIF":18.6000,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10810738/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Vision transformers have gained popularity recently, leading to the development of new vision backbones with improved features and consistent performance gains. However, these advancements are not solely attributable to novel feature transformation designs; certain benefits also arise from advanced network-level and block-level architectures. This paper aims to identify the real gains of popular convolution and attention operators through a detailed study. We find that the key difference among these feature transformation modules, such as attention or convolution, lies in their spatial feature aggregation approach, known as the “spatial token mixer” (STM). To facilitate an impartial comparison, we introduce a unified architecture to neutralize the impact of divergent network-level and block-level designs. Subsequently, various STMs are integrated into this unified framework for comprehensive comparative analysis. Our experiments on various tasks and an analysis of inductive bias show a significant performance boost due to advanced network-level and block-level designs, but performance differences persist among different STMs. Our detailed analysis also reveals various findings about different STMs, including effective receptive fields, invariance, and adversarial robustness tests.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

揭示现代图像深度网络中的变形和卷积

视觉变压器最近越来越受欢迎，导致开发具有改进功能和一致性能增益的新视觉主干网。然而，这些进步并不仅仅归功于新颖的特征转换设计；先进的网络级和块级架构也带来了某些好处。本文旨在通过详细的研究来识别流行的卷积算子和关注算子的实际收益。我们发现，这些特征转换模块（如注意或卷积）之间的关键区别在于它们的空间特征聚合方法，即“空间令牌混合器”（STM）。为了促进公正的比较，我们引入了一个统一的架构来抵消网络级和块级设计的不同影响。随后，将各种stm整合到这个统一的框架中进行全面的比较分析。我们对各种任务的实验和归纳偏置的分析表明，由于先进的网络级和块级设计，性能显著提高，但不同stm之间的性能差异仍然存在。我们的详细分析还揭示了关于不同stm的各种发现，包括有效接受野、不变性和对抗性稳健性测试。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量

期刊最新文献

Examining the Impact of Optical Aberrations to Image Classification and Object Detection Models. Neural Eigenfunctions are Structured Representation Learners. Distribution-to-Points Matching for Image Text Retrieval. Penny-Wise and Pound-Foolish in AI-Generated Image Detection. Enhancing Adversarial Transferability with Cost-efficient Landscape Flattening.