Jonathan Cui;David A. Araujo;Suman Saha;Md Faisal Kabir
{"title":"CS-Mixer: A Cross-Scale Vision Multilayer Perceptron With Spatial–Channel Mixing","authors":"Jonathan Cui;David A. Araujo;Suman Saha;Md Faisal Kabir","doi":"10.1109/TAI.2024.3415551","DOIUrl":null,"url":null,"abstract":"Despite simpler architectural designs compared with vision transformers (ViTs) and convolutional neural networks, vision multilayer perceptrons (MLPs) have demonstrated strong performance and high data efficiency for image classification and semantic segmentation. Following pioneering works such as MLP-Mixers and gMLPs, later research proposed a plethora of vision MLP architectures that achieve token-mixing with specifically engineered convolution- or attentionlike mechanisms. However, existing methods such as \n<inline-formula><tex-math>$\\text{S}^{\\text{2}}$</tex-math></inline-formula>\n-MLPs and PoolFormers typically model spatial information in equal-sized spatial regions and do not consider cross-scale spatial interactions, thus delivering subpar performance compared with transformer models that employ global token mixing. Further, these MLP token-mixers, along with most ViTs, only model one- or two-axis correlations among space and channels, avoiding simultaneous three-axis spatial–channel mixing due to its computational demands. We, therefore, propose CS-Mixer, a hierarchical vision MLP that learns dynamic low-rank transformations for tokens aggregated across scales, both locally and globally. Such aggregation allows for token-mixing that explicitly models spatial–channel interactions, made computationally possible by a multihead design that projects to low-dimensional subspaces. The proposed methodology achieves competitive results on popular image recognition benchmarks without incurring substantially more computing. Our largest model, CS-Mixer-L, reaches 83.2% top-1 accuracy on ImageNet-1k with 13.7 GFLOPs and 94 M parameters.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"5 10","pages":"4915-4927"},"PeriodicalIF":0.0000,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10562192/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Despite simpler architectural designs compared with vision transformers (ViTs) and convolutional neural networks, vision multilayer perceptrons (MLPs) have demonstrated strong performance and high data efficiency for image classification and semantic segmentation. Following pioneering works such as MLP-Mixers and gMLPs, later research proposed a plethora of vision MLP architectures that achieve token-mixing with specifically engineered convolution- or attentionlike mechanisms. However, existing methods such as
$\text{S}^{\text{2}}$
-MLPs and PoolFormers typically model spatial information in equal-sized spatial regions and do not consider cross-scale spatial interactions, thus delivering subpar performance compared with transformer models that employ global token mixing. Further, these MLP token-mixers, along with most ViTs, only model one- or two-axis correlations among space and channels, avoiding simultaneous three-axis spatial–channel mixing due to its computational demands. We, therefore, propose CS-Mixer, a hierarchical vision MLP that learns dynamic low-rank transformations for tokens aggregated across scales, both locally and globally. Such aggregation allows for token-mixing that explicitly models spatial–channel interactions, made computationally possible by a multihead design that projects to low-dimensional subspaces. The proposed methodology achieves competitive results on popular image recognition benchmarks without incurring substantially more computing. Our largest model, CS-Mixer-L, reaches 83.2% top-1 accuracy on ImageNet-1k with 13.7 GFLOPs and 94 M parameters.