图像和视频大型 SSM 的无蒸馏缩放

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI:arxiv-2409.11867

Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall

{"title":"图像和视频大型 SSM 的无蒸馏缩放","authors":"Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall","doi":"arxiv-2409.11867","DOIUrl":null,"url":null,"abstract":"State-space models (SSMs), exemplified by S4, have introduced a novel context\nmodeling method by integrating state-space techniques into deep learning.\nHowever, they struggle with global context modeling due to their\ndata-independent matrices. The Mamba model addressed this with data-dependent\nvariants via the S6 selective-scan algorithm, enhancing context modeling,\nespecially for long sequences. However, Mamba-based architectures are difficult\nto scale with respect to the number of parameters, which is a major limitation\nfor vision applications. This paper addresses the scalability issue of large\nSSMs for image classification and action recognition without requiring\nadditional techniques like knowledge distillation. We analyze the distinct\ncharacteristics of Mamba-based and Attention-based models, proposing a\nMamba-Attention interleaved architecture that enhances scalability, robustness,\nand performance. We demonstrate that the stable and efficient interleaved\narchitecture resolves the scalability issue of Mamba-based architectures for\nimages and videos and increases robustness to common artifacts like JPEG\ncompression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and\nSomething-Something-v2 benchmarks demonstrates that our approach improves the\naccuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Distillation-free Scaling of Large SSMs for Images and Videos\",\"authors\":\"Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall\",\"doi\":\"arxiv-2409.11867\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"State-space models (SSMs), exemplified by S4, have introduced a novel context\\nmodeling method by integrating state-space techniques into deep learning.\\nHowever, they struggle with global context modeling due to their\\ndata-independent matrices. The Mamba model addressed this with data-dependent\\nvariants via the S6 selective-scan algorithm, enhancing context modeling,\\nespecially for long sequences. However, Mamba-based architectures are difficult\\nto scale with respect to the number of parameters, which is a major limitation\\nfor vision applications. This paper addresses the scalability issue of large\\nSSMs for image classification and action recognition without requiring\\nadditional techniques like knowledge distillation. We analyze the distinct\\ncharacteristics of Mamba-based and Attention-based models, proposing a\\nMamba-Attention interleaved architecture that enhances scalability, robustness,\\nand performance. We demonstrate that the stable and efficient interleaved\\narchitecture resolves the scalability issue of Mamba-based architectures for\\nimages and videos and increases robustness to common artifacts like JPEG\\ncompression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and\\nSomething-Something-v2 benchmarks demonstrates that our approach improves the\\naccuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":\"15 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11867\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11867","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

以 S4 为代表的状态空间模型（SSM）通过将状态空间技术整合到深度学习中，引入了一种新的上下文建模方法。Mamba 模型通过 S6 选择性扫描算法，利用与数据相关的变量解决了这一问题，从而增强了上下文建模能力，尤其是对于长序列。然而，基于 Mamba 的架构很难扩展参数的数量，这是视觉应用的一大局限。本文在不需要知识提炼等附加技术的情况下，解决了用于图像分类和动作识别的大型 SSM 的可扩展性问题。我们分析了基于 Mamba 的模型和基于 Attention 的模型的不同特点，提出了一种可增强可扩展性、鲁棒性和性能的 Mamba-Attention 交错架构。我们证明，这种稳定高效的交错架构解决了基于 Mamba 架构的图像和视频可扩展性问题，并增强了对 JPEG 压缩等常见伪影的鲁棒性。我们在 ImageNet-1K、Kinetics-400 和 Something-Something-v2 基准上进行的全面评估表明，我们的方法将基于 Mamba 的最先进架构的准确性提高了高达 $+1.7$。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Distillation-free Scaling of Large SSMs for Images and Videos

State-space models (SSMs), exemplified by S4, have introduced a novel context modeling method by integrating state-space techniques into deep learning. However, they struggle with global context modeling due to their data-independent matrices. The Mamba model addressed this with data-dependent variants via the S6 selective-scan algorithm, enhancing context modeling, especially for long sequences. However, Mamba-based architectures are difficult to scale with respect to the number of parameters, which is a major limitation for vision applications. This paper addresses the scalability issue of large SSMs for image classification and action recognition without requiring additional techniques like knowledge distillation. We analyze the distinct characteristics of Mamba-based and Attention-based models, proposing a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance. We demonstrate that the stable and efficient interleaved architecture resolves the scalability issue of Mamba-based architectures for images and videos and increases robustness to common artifacts like JPEG compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and Something-Something-v2 benchmarks demonstrates that our approach improves the accuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Computer Vision and Pattern Recognition

自引率

0.00%

发文量

期刊最新文献

Massively Multi-Person 3D Human Motion Forecasting with Scene Context Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Precise Forecasting of Sky Images Using Spatial Warping JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation Applications of Knowledge Distillation in Remote Sensing: A Survey