图像和视频大型 SSM 的无蒸馏缩放

Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall
{"title":"图像和视频大型 SSM 的无蒸馏缩放","authors":"Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall","doi":"arxiv-2409.11867","DOIUrl":null,"url":null,"abstract":"State-space models (SSMs), exemplified by S4, have introduced a novel context\nmodeling method by integrating state-space techniques into deep learning.\nHowever, they struggle with global context modeling due to their\ndata-independent matrices. The Mamba model addressed this with data-dependent\nvariants via the S6 selective-scan algorithm, enhancing context modeling,\nespecially for long sequences. However, Mamba-based architectures are difficult\nto scale with respect to the number of parameters, which is a major limitation\nfor vision applications. This paper addresses the scalability issue of large\nSSMs for image classification and action recognition without requiring\nadditional techniques like knowledge distillation. We analyze the distinct\ncharacteristics of Mamba-based and Attention-based models, proposing a\nMamba-Attention interleaved architecture that enhances scalability, robustness,\nand performance. We demonstrate that the stable and efficient interleaved\narchitecture resolves the scalability issue of Mamba-based architectures for\nimages and videos and increases robustness to common artifacts like JPEG\ncompression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and\nSomething-Something-v2 benchmarks demonstrates that our approach improves the\naccuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Distillation-free Scaling of Large SSMs for Images and Videos\",\"authors\":\"Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall\",\"doi\":\"arxiv-2409.11867\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"State-space models (SSMs), exemplified by S4, have introduced a novel context\\nmodeling method by integrating state-space techniques into deep learning.\\nHowever, they struggle with global context modeling due to their\\ndata-independent matrices. The Mamba model addressed this with data-dependent\\nvariants via the S6 selective-scan algorithm, enhancing context modeling,\\nespecially for long sequences. However, Mamba-based architectures are difficult\\nto scale with respect to the number of parameters, which is a major limitation\\nfor vision applications. This paper addresses the scalability issue of large\\nSSMs for image classification and action recognition without requiring\\nadditional techniques like knowledge distillation. We analyze the distinct\\ncharacteristics of Mamba-based and Attention-based models, proposing a\\nMamba-Attention interleaved architecture that enhances scalability, robustness,\\nand performance. We demonstrate that the stable and efficient interleaved\\narchitecture resolves the scalability issue of Mamba-based architectures for\\nimages and videos and increases robustness to common artifacts like JPEG\\ncompression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and\\nSomething-Something-v2 benchmarks demonstrates that our approach improves the\\naccuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":\"15 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11867\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11867","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

以 S4 为代表的状态空间模型(SSM)通过将状态空间技术整合到深度学习中,引入了一种新的上下文建模方法。Mamba 模型通过 S6 选择性扫描算法,利用与数据相关的变量解决了这一问题,从而增强了上下文建模能力,尤其是对于长序列。然而,基于 Mamba 的架构很难扩展参数的数量,这是视觉应用的一大局限。本文在不需要知识提炼等附加技术的情况下,解决了用于图像分类和动作识别的大型 SSM 的可扩展性问题。我们分析了基于 Mamba 的模型和基于 Attention 的模型的不同特点,提出了一种可增强可扩展性、鲁棒性和性能的 Mamba-Attention 交错架构。我们证明,这种稳定高效的交错架构解决了基于 Mamba 架构的图像和视频可扩展性问题,并增强了对 JPEG 压缩等常见伪影的鲁棒性。我们在 ImageNet-1K、Kinetics-400 和 Something-Something-v2 基准上进行的全面评估表明,我们的方法将基于 Mamba 的最先进架构的准确性提高了高达 $+1.7$。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Distillation-free Scaling of Large SSMs for Images and Videos
State-space models (SSMs), exemplified by S4, have introduced a novel context modeling method by integrating state-space techniques into deep learning. However, they struggle with global context modeling due to their data-independent matrices. The Mamba model addressed this with data-dependent variants via the S6 selective-scan algorithm, enhancing context modeling, especially for long sequences. However, Mamba-based architectures are difficult to scale with respect to the number of parameters, which is a major limitation for vision applications. This paper addresses the scalability issue of large SSMs for image classification and action recognition without requiring additional techniques like knowledge distillation. We analyze the distinct characteristics of Mamba-based and Attention-based models, proposing a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance. We demonstrate that the stable and efficient interleaved architecture resolves the scalability issue of Mamba-based architectures for images and videos and increases robustness to common artifacts like JPEG compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and Something-Something-v2 benchmarks demonstrates that our approach improves the accuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Massively Multi-Person 3D Human Motion Forecasting with Scene Context Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Precise Forecasting of Sky Images Using Spatial Warping JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation Applications of Knowledge Distillation in Remote Sensing: A Survey
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1