Pub Date : 2024-06-18DOI: 10.1109/TAI.2024.3415551
Jonathan Cui;David A. Araujo;Suman Saha;Md Faisal Kabir
Despite simpler architectural designs compared with vision transformers (ViTs) and convolutional neural networks, vision multilayer perceptrons (MLPs) have demonstrated strong performance and high data efficiency for image classification and semantic segmentation. Following pioneering works such as MLP-Mixers and gMLPs, later research proposed a plethora of vision MLP architectures that achieve token-mixing with specifically engineered convolution- or attentionlike mechanisms. However, existing methods such as $text{S}^{text{2}}$