Sparse Double Descent in Vision Transformers: real or phantom threat?

Proceedings of the ... International Conference on Image Analysis and Processing. International Conference on Image Analysis and Processing Pub Date : 2023-07-26 DOI:10.48550/arXiv.2307.14253

Victor Qu'etu, Marta Milovanović, Enzo Tartaglione

{"title":"Sparse Double Descent in Vision Transformers: real or phantom threat?","authors":"Victor Qu'etu, Marta Milovanović, Enzo Tartaglione","doi":"10.48550/arXiv.2307.14253","DOIUrl":null,"url":null,"abstract":"Vision transformers (ViT) have been of broad interest in recent theoretical and empirical works. They are state-of-the-art thanks to their attention-based approach, which boosts the identification of key features and patterns within images thanks to the capability of avoiding inductive bias, resulting in highly accurate image analysis. Meanwhile, neoteric studies have reported a ``sparse double descent'' phenomenon that can occur in modern deep-learning models, where extremely over-parametrized models can generalize well. This raises practical questions about the optimal size of the model and the quest over finding the best trade-off between sparsity and performance is launched: are Vision Transformers also prone to sparse double descent? Can we find a way to avoid such a phenomenon? Our work tackles the occurrence of sparse double descent on ViTs. Despite some works that have shown that traditional architectures, like Resnet, are condemned to the sparse double descent phenomenon, for ViTs we observe that an optimally-tuned $\\ell_2$ regularization relieves such a phenomenon. However, everything comes at a cost: optimal lambda will sacrifice the potential compression of the ViT.","PeriodicalId":74527,"journal":{"name":"Proceedings of the ... International Conference on Image Analysis and Processing. International Conference on Image Analysis and Processing","volume":"413 1","pages":"490-502"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... International Conference on Image Analysis and Processing. International Conference on Image Analysis and Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2307.14253","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Vision transformers (ViT) have been of broad interest in recent theoretical and empirical works. They are state-of-the-art thanks to their attention-based approach, which boosts the identification of key features and patterns within images thanks to the capability of avoiding inductive bias, resulting in highly accurate image analysis. Meanwhile, neoteric studies have reported a ``sparse double descent'' phenomenon that can occur in modern deep-learning models, where extremely over-parametrized models can generalize well. This raises practical questions about the optimal size of the model and the quest over finding the best trade-off between sparsity and performance is launched: are Vision Transformers also prone to sparse double descent? Can we find a way to avoid such a phenomenon? Our work tackles the occurrence of sparse double descent on ViTs. Despite some works that have shown that traditional architectures, like Resnet, are condemned to the sparse double descent phenomenon, for ViTs we observe that an optimally-tuned $\ell_2$ regularization relieves such a phenomenon. However, everything comes at a cost: optimal lambda will sacrifice the potential compression of the ViT.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

视觉变形金刚的稀疏双下降:真实的还是虚幻的威胁?

视觉变压器在近年来的理论和实证研究中引起了广泛的兴趣。它们是最先进的，因为它们基于注意力的方法，由于能够避免归纳偏差，从而提高了对图像中关键特征和模式的识别，从而实现了高度准确的图像分析。与此同时，最近的研究报告了在现代深度学习模型中可能发生的“稀疏双重下降”现象，其中极度过度参数化的模型可以很好地泛化。这就提出了关于模型最优大小的实际问题，并提出了在稀疏性和性能之间寻找最佳权衡的问题:视觉变形器是否也容易出现稀疏双下降?我们能找到一种方法来避免这种现象吗?我们的工作解决了稀疏双下降在vit上的发生。尽管一些研究表明，传统的架构，如Resnet，会受到稀疏双下降现象的影响，但对于vit，我们观察到一个优化调整的$\ell_2$正则化缓解了这种现象。然而，一切都是有代价的:最优lambda将牺牲ViT的潜在压缩。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the ... International Conference on Image Analysis and Processing. International Conference on Image Analysis and Processing

自引率

0.00%

发文量