Multi-scale feature correspondence and pseudo label retraining strategy for weakly supervised semantic segmentation

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Image and Vision Computing Pub Date : 2024-08-10 DOI:10.1016/j.imavis.2024.105215

Weizheng Wang, Lei Zhou, Haonan Wang

{"title":"Multi-scale feature correspondence and pseudo label retraining strategy for weakly supervised semantic segmentation","authors":"Weizheng Wang, Lei Zhou, Haonan Wang","doi":"10.1016/j.imavis.2024.105215","DOIUrl":null,"url":null,"abstract":"<div><p>Recently, the performance of semantic segmentation using weakly supervised learning has significantly improved. Weakly supervised semantic segmentation (WSSS) that uses only image-level labels has received widespread attention, it employs Class Activation Maps (CAM) to generate pseudo labels. Compared to traditional use of pixel-level labels, this technique greatly reduces annotation costs by utilizing simpler and more readily available image-level annotations. Besides, due to the local perceptual ability of Convolutional Neural Networks (CNN), the generated CAM cannot activate the entire object area. Researchers have found that this CNN limitation can be compensated for by using Vision Transformer (ViT). However, ViT also introduces an over-smoothing problem. Recent research has made good progress in solving this issue, but when discussing CAM and its related segmentation predictions, it is easy to overlook their intrinsic information and the interrelationships between them. In this paper, we propose a Multi-Scale Feature Correspondence (MSFC) method. Our MSFC can obtain the feature correspondence of CAM and segmentation predictions at different scales, re-extract useful semantic information from them, enhancing the network's learning of feature information and improving the quality of CAM. Moreover, to further improve the segmentation precision, we design a Pseudo Label Retraining Strategy (PLRS). This strategy refines the accuracy in local regions, elevates the quality of pseudo labels, and aims to enhance segmentation precision. Experimental results on the PASCAL VOC 2012 and MS COCO 2014 datasets show that our method achieves impressive performance among end-to-end WSSS methods.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105215"},"PeriodicalIF":4.2000,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624003202","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Recently, the performance of semantic segmentation using weakly supervised learning has significantly improved. Weakly supervised semantic segmentation (WSSS) that uses only image-level labels has received widespread attention, it employs Class Activation Maps (CAM) to generate pseudo labels. Compared to traditional use of pixel-level labels, this technique greatly reduces annotation costs by utilizing simpler and more readily available image-level annotations. Besides, due to the local perceptual ability of Convolutional Neural Networks (CNN), the generated CAM cannot activate the entire object area. Researchers have found that this CNN limitation can be compensated for by using Vision Transformer (ViT). However, ViT also introduces an over-smoothing problem. Recent research has made good progress in solving this issue, but when discussing CAM and its related segmentation predictions, it is easy to overlook their intrinsic information and the interrelationships between them. In this paper, we propose a Multi-Scale Feature Correspondence (MSFC) method. Our MSFC can obtain the feature correspondence of CAM and segmentation predictions at different scales, re-extract useful semantic information from them, enhancing the network's learning of feature information and improving the quality of CAM. Moreover, to further improve the segmentation precision, we design a Pseudo Label Retraining Strategy (PLRS). This strategy refines the accuracy in local regions, elevates the quality of pseudo labels, and aims to enhance segmentation precision. Experimental results on the PASCAL VOC 2012 and MS COCO 2014 datasets show that our method achieves impressive performance among end-to-end WSSS methods.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

弱监督语义分割的多尺度特征对应和伪标签再训练策略

最近，使用弱监督学习进行语义分割的性能有了显著提高。仅使用图像级标签的弱监督语义分割（WSSS）受到广泛关注，它采用类激活图（CAM）生成伪标签。与传统的像素级标签相比，这种技术利用更简单、更容易获得的图像级注释，大大降低了注释成本。此外，由于卷积神经网络（CNN）的局部感知能力，生成的 CAM 无法激活整个对象区域。研究人员发现，使用视觉转换器（ViT）可以弥补 CNN 的这一局限。然而，ViT 也会带来过度平滑问题。最近的研究在解决这一问题方面取得了良好进展，但在讨论 CAM 及其相关分割预测时，很容易忽略它们的内在信息以及它们之间的相互关系。本文提出了一种多尺度特征对应（MSFC）方法。我们的 MSFC 可以获得 CAM 与分割预测在不同尺度上的特征对应关系，从中重新提取有用的语义信息，增强网络对特征信息的学习，提高 CAM 的质量。此外，为了进一步提高分割精度，我们还设计了一种伪标签重训练策略（PLRS）。该策略可以提高局部区域的准确性，提升伪标签的质量，从而达到提高分割精度的目的。在 PASCAL VOC 2012 和 MS COCO 2014 数据集上的实验结果表明，在端到端 WSSS 方法中，我们的方法取得了令人瞩目的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.

期刊最新文献

Multi-modal Few-shot Image Recognition with enhanced semantic and visual integration Editorial Board Object tracking based on temporal and spatial context information Early progression detection from MCI to AD using multi-view MRI for enhanced assisted living An edge-aware high-resolution framework for camouflaged object detection