{"title":"Local optimization cropping and boundary enhancement for end-to-end weakly-supervised segmentation network","authors":"Weizheng Wang, Chao Zeng, Haonan Wang, Lei Zhou","doi":"10.1016/j.cviu.2024.104260","DOIUrl":null,"url":null,"abstract":"<div><div>In recent years, the performance of weakly-supervised semantic segmentation(WSSS) has significantly increased. It usually employs image-level labels to generate Class Activation Map (CAM) for producing pseudo-labels, which greatly reduces the cost of annotation. Since CNN cannot fully identify object regions, researchers found that Vision Transformers (ViT) can complement the deficiencies of CNN by better extracting global contextual information. However, ViT also introduces the problem of over-smoothing. Great progress has been made in recent years to solve the over-smoothing problem, yet two issues remain. The first issue is that the high-confidence regions in the network-generated CAM still contain areas irrelevant to the class. The second issue is the inaccuracy of CAM boundaries, which contain a small portion of background regions. As we know, the precision of label boundaries is closely tied to excellent segmentation performance. In this work, to address the first issue, we propose a local optimized cropping module (LOC). By randomly cropping selected regions, we allow the local class tokens to be contrasted with the global class tokens. This method facilitates enhanced consistency between local and global representations. To address the second issue, we design a boundary enhancement module (BE) that utilizes an erasing strategy to re-train the image, increasing the network’s extraction of boundary information and greatly improving the accuracy of CAM boundaries, thereby enhancing the quality of pseudo labels. Experiments on the PASCAL VOC dataset show that the performance of our proposed LOC-BE Net outperforms multi-stage methods and is competitive with end-to-end methods. On the PASCAL VOC dataset, our method achieves a CAM mIoU of 74.2% and a segmentation mIoU of 73.1%. On the COCO2014 dataset, our method achieves a CAM mIoU of 43.8% and a segmentation mIoU of 43.4%. Our code has been open sourced: <span><span>https://github.com/whn786/LOC-BE/tree/main</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104260"},"PeriodicalIF":4.3000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224003412","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, the performance of weakly-supervised semantic segmentation(WSSS) has significantly increased. It usually employs image-level labels to generate Class Activation Map (CAM) for producing pseudo-labels, which greatly reduces the cost of annotation. Since CNN cannot fully identify object regions, researchers found that Vision Transformers (ViT) can complement the deficiencies of CNN by better extracting global contextual information. However, ViT also introduces the problem of over-smoothing. Great progress has been made in recent years to solve the over-smoothing problem, yet two issues remain. The first issue is that the high-confidence regions in the network-generated CAM still contain areas irrelevant to the class. The second issue is the inaccuracy of CAM boundaries, which contain a small portion of background regions. As we know, the precision of label boundaries is closely tied to excellent segmentation performance. In this work, to address the first issue, we propose a local optimized cropping module (LOC). By randomly cropping selected regions, we allow the local class tokens to be contrasted with the global class tokens. This method facilitates enhanced consistency between local and global representations. To address the second issue, we design a boundary enhancement module (BE) that utilizes an erasing strategy to re-train the image, increasing the network’s extraction of boundary information and greatly improving the accuracy of CAM boundaries, thereby enhancing the quality of pseudo labels. Experiments on the PASCAL VOC dataset show that the performance of our proposed LOC-BE Net outperforms multi-stage methods and is competitive with end-to-end methods. On the PASCAL VOC dataset, our method achieves a CAM mIoU of 74.2% and a segmentation mIoU of 73.1%. On the COCO2014 dataset, our method achieves a CAM mIoU of 43.8% and a segmentation mIoU of 43.4%. Our code has been open sourced: https://github.com/whn786/LOC-BE/tree/main.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems