UniMatch V2：突破半监督语义分割的极限

IF 18.6 IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-01-13 DOI:10.1109/TPAMI.2025.3528453

Lihe Yang;Zhen Zhao;Hengshuang Zhao

{"title":"UniMatch V2：突破半监督语义分割的极限","authors":"Lihe Yang;Zhen Zhao;Hengshuang Zhao","doi":"10.1109/TPAMI.2025.3528453","DOIUrl":null,"url":null,"abstract":"Semi-supervised semantic segmentation (SSS) aims at learning rich visual knowledge from cheap unlabeled images to enhance semantic segmentation capability. Among recent works, UniMatch (Yang et al. 2023) improves its precedents tremendously by amplifying the practice of weak-to-strong consistency regularization. Subsequent works typically follow similar pipelines and propose various delicate designs. Despite the achieved progress, strangely, even in this flourishing era of numerous powerful vision models, almost all SSS works are still sticking to 1) using outdated ResNet encoders with small-scale ImageNet-1 K pre-training, and 2) evaluation on simple Pascal and Cityscapes datasets. In this work, we argue that, it is necessary to switch the baseline of SSS from ResNet-based encoders to more capable ViT-based encoders (e.g., DINOv2) that are pre-trained on massive data. A simple update on the encoder (even using 2× fewer parameters) can bring more significant improvement than careful method designs. Built on this competitive baseline, we present our upgraded and simplified UniMatch V2, inheriting the core spirit of weak-to-strong consistency from V1, but requiring less training cost and providing consistently better results. Additionally, witnessing the gradually saturated performance on Pascal and Cityscapes, we appeal that we should focus on more challenging benchmarks with complex taxonomy, such as ADE20K and COCO datasets.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"3031-3048"},"PeriodicalIF":18.6000,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"UniMatch V2: Pushing the Limit of Semi-Supervised Semantic Segmentation\",\"authors\":\"Lihe Yang;Zhen Zhao;Hengshuang Zhao\",\"doi\":\"10.1109/TPAMI.2025.3528453\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Semi-supervised semantic segmentation (SSS) aims at learning rich visual knowledge from cheap unlabeled images to enhance semantic segmentation capability. Among recent works, UniMatch (Yang et al. 2023) improves its precedents tremendously by amplifying the practice of weak-to-strong consistency regularization. Subsequent works typically follow similar pipelines and propose various delicate designs. Despite the achieved progress, strangely, even in this flourishing era of numerous powerful vision models, almost all SSS works are still sticking to 1) using outdated ResNet encoders with small-scale ImageNet-1 K pre-training, and 2) evaluation on simple Pascal and Cityscapes datasets. In this work, we argue that, it is necessary to switch the baseline of SSS from ResNet-based encoders to more capable ViT-based encoders (e.g., DINOv2) that are pre-trained on massive data. A simple update on the encoder (even using 2× fewer parameters) can bring more significant improvement than careful method designs. Built on this competitive baseline, we present our upgraded and simplified UniMatch V2, inheriting the core spirit of weak-to-strong consistency from V1, but requiring less training cost and providing consistently better results. Additionally, witnessing the gradually saturated performance on Pascal and Cityscapes, we appeal that we should focus on more challenging benchmarks with complex taxonomy, such as ADE20K and COCO datasets.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"47 4\",\"pages\":\"3031-3048\"},\"PeriodicalIF\":18.6000,\"publicationDate\":\"2025-01-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10839097/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10839097/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

半监督语义分割（Semi-supervised semantic segmentation， SSS）旨在从廉价的未标记图像中学习丰富的视觉知识，以增强语义分割能力。在最近的工作中，UniMatch （Yang et al. 2023）通过放大弱到强一致性正则化的实践，极大地改进了其先例。随后的作品通常遵循类似的管道，并提出各种精致的设计。尽管取得了进步，但奇怪的是，即使在这个拥有众多强大视觉模型的繁荣时代，几乎所有的SSS工作仍然坚持1)使用过时的ResNet编码器和小规模的ImageNet-1 K预训练，2)在简单的Pascal和cityscape数据集上进行评估。在这项工作中，我们认为有必要将SSS的基线从基于resnet的编码器切换到更强大的基于vit的编码器（例如DINOv2），这些编码器是在大量数据上进行预训练的。对编码器进行简单的更新（甚至使用2倍的参数）可以带来比仔细的方法设计更显着的改进。在这一竞争基线的基础上，我们推出了升级简化的UniMatch V2，继承了V1从弱到强一致性的核心精神，但需要更少的培训成本，并提供始终如一的更好结果。此外，目睹Pascal和cityscape的性能逐渐饱和，我们呼吁我们应该关注具有复杂分类法的更具挑战性的基准测试，例如ADE20K和COCO数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

UniMatch V2: Pushing the Limit of Semi-Supervised Semantic Segmentation

Semi-supervised semantic segmentation (SSS) aims at learning rich visual knowledge from cheap unlabeled images to enhance semantic segmentation capability. Among recent works, UniMatch (Yang et al. 2023) improves its precedents tremendously by amplifying the practice of weak-to-strong consistency regularization. Subsequent works typically follow similar pipelines and propose various delicate designs. Despite the achieved progress, strangely, even in this flourishing era of numerous powerful vision models, almost all SSS works are still sticking to 1) using outdated ResNet encoders with small-scale ImageNet-1 K pre-training, and 2) evaluation on simple Pascal and Cityscapes datasets. In this work, we argue that, it is necessary to switch the baseline of SSS from ResNet-based encoders to more capable ViT-based encoders (e.g., DINOv2) that are pre-trained on massive data. A simple update on the encoder (even using 2× fewer parameters) can bring more significant improvement than careful method designs. Built on this competitive baseline, we present our upgraded and simplified UniMatch V2, inheriting the core spirit of weak-to-strong consistency from V1, but requiring less training cost and providing consistently better results. Additionally, witnessing the gradually saturated performance on Pascal and Cityscapes, we appeal that we should focus on more challenging benchmarks with complex taxonomy, such as ADE20K and COCO datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量

期刊最新文献

Calibrating Biased Distribution in VFM-Derived Latent Space via Cross-Domain Geometric Consistency. Penny-Wise and Pound-Foolish in AI-Generated Image Detection. 50 Years of Automated Face Recognition. Soft Label Pruning and Quantization for Large-Scale Dataset Distillation. On the Adversarial Transferability of Generalized "Skip Connections".