MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-17 DOI:arxiv-2409.11316

Amirreza Fateh, Mohammad Reza Mohammadi, Mohammad Reza Jahed Motlagh

{"title":"MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping","authors":"Amirreza Fateh, Mohammad Reza Mohammadi, Mohammad Reza Jahed Motlagh","doi":"arxiv-2409.11316","DOIUrl":null,"url":null,"abstract":"Few-shot Semantic Segmentation addresses the challenge of segmenting objects\nin query images with only a handful of annotated examples. However, many\nprevious state-of-the-art methods either have to discard intricate local\nsemantic features or suffer from high computational complexity. To address\nthese challenges, we propose a new Few-shot Semantic Segmentation framework\nbased on the transformer architecture. Our approach introduces the spatial\ntransformer decoder and the contextual mask generation module to improve the\nrelational understanding between support and query images. Moreover, we\nintroduce a multi-scale decoder to refine the segmentation mask by\nincorporating features from different resolutions in a hierarchical manner.\nAdditionally, our approach integrates global features from intermediate encoder\nstages to improve contextual understanding, while maintaining a lightweight\nstructure to reduce complexity. This balance between performance and efficiency\nenables our method to achieve state-of-the-art results on benchmark datasets\nsuch as $PASCAL-5^i$ and $COCO-20^i$ in both 1-shot and 5-shot settings.\nNotably, our model with only 1.5 million parameters demonstrates competitive\nperformance while overcoming limitations of existing methodologies.\nhttps://github.com/amirrezafateh/MSDNet","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"155 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11316","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Few-shot Semantic Segmentation addresses the challenge of segmenting objects in query images with only a handful of annotated examples. However, many previous state-of-the-art methods either have to discard intricate local semantic features or suffer from high computational complexity. To address these challenges, we propose a new Few-shot Semantic Segmentation framework based on the transformer architecture. Our approach introduces the spatial transformer decoder and the contextual mask generation module to improve the relational understanding between support and query images. Moreover, we introduce a multi-scale decoder to refine the segmentation mask by incorporating features from different resolutions in a hierarchical manner. Additionally, our approach integrates global features from intermediate encoder stages to improve contextual understanding, while maintaining a lightweight structure to reduce complexity. This balance between performance and efficiency enables our method to achieve state-of-the-art results on benchmark datasets such as $PASCAL-5^i$ and $COCO-20^i$ in both 1-shot and 5-shot settings. Notably, our model with only 1.5 million parameters demonstrates competitive performance while overcoming limitations of existing methodologies. https://github.com/amirrezafateh/MSDNet

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MSDNet：通过变压器引导的原型设计实现少镜头语义分割的多尺度解码器

少镜头语义分割（Few-shot Semantic Segmentation）解决了在只有少量注释示例的情况下对查询图像中的物体进行分割的难题。然而，许多先前的先进方法要么不得不放弃错综复杂的局部语义特征，要么存在计算复杂度高的问题。为了应对这些挑战，我们提出了一种基于变换器架构的全新 "少镜头语义分割 "框架。我们的方法引入了空间变换器解码器和上下文掩码生成模块，以提高支持图像和查询图像之间的关联理解。此外，我们还引入了多尺度解码器，以分层的方式整合来自不同分辨率的特征，从而完善分割掩码。此外，我们的方法还整合了来自中间编码器阶段的全局特征，以提高上下文理解能力，同时保持轻量级结构以降低复杂性。这种性能与效率之间的平衡使我们的方法在基准数据集上取得了最先进的结果，例如在 1 次拍摄和 5 次拍摄设置中的 $PASCAL-5^i$ 和 $COCO-20^i$ 。值得注意的是，我们的模型只有 150 万个参数，在克服现有方法局限性的同时，还展示了具有竞争力的性能。https://github.com/amirrezafateh/MSDNet。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Computer Vision and Pattern Recognition

自引率

0.00%

发文量

期刊最新文献

Massively Multi-Person 3D Human Motion Forecasting with Scene Context Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Precise Forecasting of Sky Images Using Spatial Warping JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation Applications of Knowledge Distillation in Remote Sensing: A Survey