MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping

Amirreza Fateh, Mohammad Reza Mohammadi, Mohammad Reza Jahed Motlagh
{"title":"MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping","authors":"Amirreza Fateh, Mohammad Reza Mohammadi, Mohammad Reza Jahed Motlagh","doi":"arxiv-2409.11316","DOIUrl":null,"url":null,"abstract":"Few-shot Semantic Segmentation addresses the challenge of segmenting objects\nin query images with only a handful of annotated examples. However, many\nprevious state-of-the-art methods either have to discard intricate local\nsemantic features or suffer from high computational complexity. To address\nthese challenges, we propose a new Few-shot Semantic Segmentation framework\nbased on the transformer architecture. Our approach introduces the spatial\ntransformer decoder and the contextual mask generation module to improve the\nrelational understanding between support and query images. Moreover, we\nintroduce a multi-scale decoder to refine the segmentation mask by\nincorporating features from different resolutions in a hierarchical manner.\nAdditionally, our approach integrates global features from intermediate encoder\nstages to improve contextual understanding, while maintaining a lightweight\nstructure to reduce complexity. This balance between performance and efficiency\nenables our method to achieve state-of-the-art results on benchmark datasets\nsuch as $PASCAL-5^i$ and $COCO-20^i$ in both 1-shot and 5-shot settings.\nNotably, our model with only 1.5 million parameters demonstrates competitive\nperformance while overcoming limitations of existing methodologies.\nhttps://github.com/amirrezafateh/MSDNet","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11316","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Few-shot Semantic Segmentation addresses the challenge of segmenting objects in query images with only a handful of annotated examples. However, many previous state-of-the-art methods either have to discard intricate local semantic features or suffer from high computational complexity. To address these challenges, we propose a new Few-shot Semantic Segmentation framework based on the transformer architecture. Our approach introduces the spatial transformer decoder and the contextual mask generation module to improve the relational understanding between support and query images. Moreover, we introduce a multi-scale decoder to refine the segmentation mask by incorporating features from different resolutions in a hierarchical manner. Additionally, our approach integrates global features from intermediate encoder stages to improve contextual understanding, while maintaining a lightweight structure to reduce complexity. This balance between performance and efficiency enables our method to achieve state-of-the-art results on benchmark datasets such as $PASCAL-5^i$ and $COCO-20^i$ in both 1-shot and 5-shot settings. Notably, our model with only 1.5 million parameters demonstrates competitive performance while overcoming limitations of existing methodologies. https://github.com/amirrezafateh/MSDNet
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
MSDNet:通过变压器引导的原型设计实现少镜头语义分割的多尺度解码器
少镜头语义分割(Few-shot Semantic Segmentation)解决了在只有少量注释示例的情况下对查询图像中的物体进行分割的难题。然而,许多先前的先进方法要么不得不放弃错综复杂的局部语义特征,要么存在计算复杂度高的问题。为了应对这些挑战,我们提出了一种基于变换器架构的全新 "少镜头语义分割 "框架。我们的方法引入了空间变换器解码器和上下文掩码生成模块,以提高支持图像和查询图像之间的关联理解。此外,我们还引入了多尺度解码器,以分层的方式整合来自不同分辨率的特征,从而完善分割掩码。此外,我们的方法还整合了来自中间编码器阶段的全局特征,以提高上下文理解能力,同时保持轻量级结构以降低复杂性。这种性能与效率之间的平衡使我们的方法在基准数据集上取得了最先进的结果,例如在 1 次拍摄和 5 次拍摄设置中的 $PASCAL-5^i$ 和 $COCO-20^i$ 。值得注意的是,我们的模型只有 150 万个参数,在克服现有方法局限性的同时,还展示了具有竞争力的性能。https://github.com/amirrezafateh/MSDNet。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Massively Multi-Person 3D Human Motion Forecasting with Scene Context Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Precise Forecasting of Sky Images Using Spatial Warping JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation Applications of Knowledge Distillation in Remote Sensing: A Survey
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1