用于密集预测的金字塔融合 MLP

Qiuyu Huang;Zequn Jie;Lin Ma;Li Shen;Shenqi Lai
{"title":"用于密集预测的金字塔融合 MLP","authors":"Qiuyu Huang;Zequn Jie;Lin Ma;Li Shen;Shenqi Lai","doi":"10.1109/TIP.2025.3526054","DOIUrl":null,"url":null,"abstract":"Recently, MLP-based architectures have achieved competitive performance with convolutional neural networks (CNNs) and vision transformers (ViTs) across various vision tasks. However, most MLP-based methods introduce local feature interactions to facilitate direct adaptation to downstream tasks, thereby lacking the ability to capture global visual dependencies and multi-scale context, ultimately resulting in unsatisfactory performance on dense prediction. This paper proposes a competitive and effective MLP-based architecture called Pyramid Fusion MLP (PFMLP) to address the above limitation. Specifically, each block in PFMLP introduces multi-scale pooling and fully connected layers to generate feature pyramids, which are subsequently fused using up-sample layers and an additional fully connected layer. Employing different down-sample rates allows us to obtain diverse receptive fields, enabling the model to simultaneously capture long-range dependencies and fine-grained cues, thereby exploiting the potential of global context information and enhancing the spatial representation power of the model. Our PFMLP is the first lightweight MLP to obtain comparable results with state-of-the-art CNNs and ViTs on the ImageNet-1K benchmark. With larger FLOPs, it exceeds state-of-the-art CNNs, ViTs, and MLPs under similar computational complexity. Furthermore, experiments in object detection, instance segmentation, and semantic segmentation demonstrate that the visual representation acquired from PFMLP can be seamlessly transferred to downstream tasks, producing competitive results. All materials contain the training codes and logs are released at <uri>https://github.com/huangqiuyu/PFMLP</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"455-467"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Pyramid Fusion MLP for Dense Prediction\",\"authors\":\"Qiuyu Huang;Zequn Jie;Lin Ma;Li Shen;Shenqi Lai\",\"doi\":\"10.1109/TIP.2025.3526054\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, MLP-based architectures have achieved competitive performance with convolutional neural networks (CNNs) and vision transformers (ViTs) across various vision tasks. However, most MLP-based methods introduce local feature interactions to facilitate direct adaptation to downstream tasks, thereby lacking the ability to capture global visual dependencies and multi-scale context, ultimately resulting in unsatisfactory performance on dense prediction. This paper proposes a competitive and effective MLP-based architecture called Pyramid Fusion MLP (PFMLP) to address the above limitation. Specifically, each block in PFMLP introduces multi-scale pooling and fully connected layers to generate feature pyramids, which are subsequently fused using up-sample layers and an additional fully connected layer. Employing different down-sample rates allows us to obtain diverse receptive fields, enabling the model to simultaneously capture long-range dependencies and fine-grained cues, thereby exploiting the potential of global context information and enhancing the spatial representation power of the model. Our PFMLP is the first lightweight MLP to obtain comparable results with state-of-the-art CNNs and ViTs on the ImageNet-1K benchmark. With larger FLOPs, it exceeds state-of-the-art CNNs, ViTs, and MLPs under similar computational complexity. Furthermore, experiments in object detection, instance segmentation, and semantic segmentation demonstrate that the visual representation acquired from PFMLP can be seamlessly transferred to downstream tasks, producing competitive results. All materials contain the training codes and logs are released at <uri>https://github.com/huangqiuyu/PFMLP</uri>.\",\"PeriodicalId\":94032,\"journal\":{\"name\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"volume\":\"34 \",\"pages\":\"455-467\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-01-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10841959/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10841959/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

最近,基于mlp的架构在各种视觉任务上取得了与卷积神经网络(cnn)和视觉变压器(ViTs)竞争的性能。然而,大多数基于mlp的方法引入了局部特征交互来促进对下游任务的直接适应,因此缺乏捕获全局视觉依赖和多尺度上下文的能力,最终导致密集预测的性能不理想。本文提出了一种具有竞争力和有效性的基于MLP的体系结构,称为金字塔融合MLP (PFMLP),以解决上述限制。具体来说,PFMLP中的每个块都引入了多尺度池化和全连接层来生成特征金字塔,随后使用上样层和额外的全连接层进行融合。采用不同的下采样率使我们能够获得不同的接受域,使模型能够同时捕获远程依赖关系和细粒度线索,从而利用全局上下文信息的潜力,增强模型的空间表征能力。我们的PFMLP是第一个在ImageNet-1K基准测试中获得与最先进的cnn和vit相当结果的轻量级MLP。在相同的计算复杂度下,更大的FLOPs超过了最先进的cnn、vit和mlp。此外,在目标检测、实例分割和语义分割方面的实验表明,从PFMLP中获得的视觉表征可以无缝地转移到下游任务中,从而产生有竞争力的结果。所有包含培训代码和日志的材料都在https://github.com/huangqiuyu/PFMLP上发布。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A Pyramid Fusion MLP for Dense Prediction
Recently, MLP-based architectures have achieved competitive performance with convolutional neural networks (CNNs) and vision transformers (ViTs) across various vision tasks. However, most MLP-based methods introduce local feature interactions to facilitate direct adaptation to downstream tasks, thereby lacking the ability to capture global visual dependencies and multi-scale context, ultimately resulting in unsatisfactory performance on dense prediction. This paper proposes a competitive and effective MLP-based architecture called Pyramid Fusion MLP (PFMLP) to address the above limitation. Specifically, each block in PFMLP introduces multi-scale pooling and fully connected layers to generate feature pyramids, which are subsequently fused using up-sample layers and an additional fully connected layer. Employing different down-sample rates allows us to obtain diverse receptive fields, enabling the model to simultaneously capture long-range dependencies and fine-grained cues, thereby exploiting the potential of global context information and enhancing the spatial representation power of the model. Our PFMLP is the first lightweight MLP to obtain comparable results with state-of-the-art CNNs and ViTs on the ImageNet-1K benchmark. With larger FLOPs, it exceeds state-of-the-art CNNs, ViTs, and MLPs under similar computational complexity. Furthermore, experiments in object detection, instance segmentation, and semantic segmentation demonstrate that the visual representation acquired from PFMLP can be seamlessly transferred to downstream tasks, producing competitive results. All materials contain the training codes and logs are released at https://github.com/huangqiuyu/PFMLP.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Rethinking Feature Reconstruction via Category Prototype in Semantic Segmentation Spiking Neural Networks With Adaptive Membrane Time Constant for Event-Based Tracking Self-Supervised Monocular Depth Estimation With Dual-Path Encoders and Offset Field Interpolation Hyperspectral Image Classification via Cascaded Spatial Cross-Attention Network A New Cross-Space Total Variation Regularization Model for Color Image Restoration With Quaternion Blur Operator
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1