{"title":"A Pyramid Fusion MLP for Dense Prediction","authors":"Qiuyu Huang;Zequn Jie;Lin Ma;Li Shen;Shenqi Lai","doi":"10.1109/TIP.2025.3526054","DOIUrl":null,"url":null,"abstract":"Recently, MLP-based architectures have achieved competitive performance with convolutional neural networks (CNNs) and vision transformers (ViTs) across various vision tasks. However, most MLP-based methods introduce local feature interactions to facilitate direct adaptation to downstream tasks, thereby lacking the ability to capture global visual dependencies and multi-scale context, ultimately resulting in unsatisfactory performance on dense prediction. This paper proposes a competitive and effective MLP-based architecture called Pyramid Fusion MLP (PFMLP) to address the above limitation. Specifically, each block in PFMLP introduces multi-scale pooling and fully connected layers to generate feature pyramids, which are subsequently fused using up-sample layers and an additional fully connected layer. Employing different down-sample rates allows us to obtain diverse receptive fields, enabling the model to simultaneously capture long-range dependencies and fine-grained cues, thereby exploiting the potential of global context information and enhancing the spatial representation power of the model. Our PFMLP is the first lightweight MLP to obtain comparable results with state-of-the-art CNNs and ViTs on the ImageNet-1K benchmark. With larger FLOPs, it exceeds state-of-the-art CNNs, ViTs, and MLPs under similar computational complexity. Furthermore, experiments in object detection, instance segmentation, and semantic segmentation demonstrate that the visual representation acquired from PFMLP can be seamlessly transferred to downstream tasks, producing competitive results. All materials contain the training codes and logs are released at <uri>https://github.com/huangqiuyu/PFMLP</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"455-467"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10841959/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Recently, MLP-based architectures have achieved competitive performance with convolutional neural networks (CNNs) and vision transformers (ViTs) across various vision tasks. However, most MLP-based methods introduce local feature interactions to facilitate direct adaptation to downstream tasks, thereby lacking the ability to capture global visual dependencies and multi-scale context, ultimately resulting in unsatisfactory performance on dense prediction. This paper proposes a competitive and effective MLP-based architecture called Pyramid Fusion MLP (PFMLP) to address the above limitation. Specifically, each block in PFMLP introduces multi-scale pooling and fully connected layers to generate feature pyramids, which are subsequently fused using up-sample layers and an additional fully connected layer. Employing different down-sample rates allows us to obtain diverse receptive fields, enabling the model to simultaneously capture long-range dependencies and fine-grained cues, thereby exploiting the potential of global context information and enhancing the spatial representation power of the model. Our PFMLP is the first lightweight MLP to obtain comparable results with state-of-the-art CNNs and ViTs on the ImageNet-1K benchmark. With larger FLOPs, it exceeds state-of-the-art CNNs, ViTs, and MLPs under similar computational complexity. Furthermore, experiments in object detection, instance segmentation, and semantic segmentation demonstrate that the visual representation acquired from PFMLP can be seamlessly transferred to downstream tasks, producing competitive results. All materials contain the training codes and logs are released at https://github.com/huangqiuyu/PFMLP.