{"title":"ESDA: Zero-shot semantic segmentation based on an embedding semantic space distribution adjustment strategy","authors":"Jiaguang Li, Ying Wei, Wei Zhang, Chuyuan Wang","doi":"10.1016/j.imavis.2025.105456","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, the CLIP model, which is pre-trained on large-scale vision-language data, has promoted the development of zero-shot recognition tasks. Some researchers apply CLIP to zero-shot semantic segmentation, but they often struggle to achieve satisfactory results. This is because this dense prediction task requires not only a precise understanding of semantics, but also a precise perception of different regions within one image. However, CLIP is trained on image-level vision-language data, resulting in ineffective perception of pixel-level regions. In this paper, we propose a new zero-shot semantic segmentation (ZS3) method based on an embedding semantic space distribution adjustment strategy (ESDA), which enables CLIP to accurately perceive both semantics and regions. This method inserts additional trainable blocks into the CLIP image encoder, enabling it to effectively perceive regions without losing semantic understanding. Besides, we design spatial distribution losses to guide the update of parameters of the trainable blocks, thereby further enhancing the regional characteristics of pixel-level image embeddings. In addition, previous methods only obtain semantic support through a text [CLS] token, which is far from sufficient for the dense prediction task. Therefore, we design a vision-language embedding interactor, which can obtain richer semantic support through the interaction between the entire text embedding and image embedding. It can also further enhance the semantic support and strengthen the image embedding. Plenty of experiments on PASCAL-<span><math><msup><mrow><mn>5</mn></mrow><mrow><mi>i</mi></mrow></msup></math></span> and COCO-<span><math><mrow><mn>2</mn><msup><mrow><mn>0</mn></mrow><mrow><mi>i</mi></mrow></msup></mrow></math></span> prove the effectiveness of our method. Our method achieves new state-of-the-art for zero-shot semantic segmentation and exceeds many few-shot semantic segmentation methods. Codes are available at <span><span>https://github.com/Jiaguang-NEU/ESDA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105456"},"PeriodicalIF":4.2000,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625000447","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Recently, the CLIP model, which is pre-trained on large-scale vision-language data, has promoted the development of zero-shot recognition tasks. Some researchers apply CLIP to zero-shot semantic segmentation, but they often struggle to achieve satisfactory results. This is because this dense prediction task requires not only a precise understanding of semantics, but also a precise perception of different regions within one image. However, CLIP is trained on image-level vision-language data, resulting in ineffective perception of pixel-level regions. In this paper, we propose a new zero-shot semantic segmentation (ZS3) method based on an embedding semantic space distribution adjustment strategy (ESDA), which enables CLIP to accurately perceive both semantics and regions. This method inserts additional trainable blocks into the CLIP image encoder, enabling it to effectively perceive regions without losing semantic understanding. Besides, we design spatial distribution losses to guide the update of parameters of the trainable blocks, thereby further enhancing the regional characteristics of pixel-level image embeddings. In addition, previous methods only obtain semantic support through a text [CLS] token, which is far from sufficient for the dense prediction task. Therefore, we design a vision-language embedding interactor, which can obtain richer semantic support through the interaction between the entire text embedding and image embedding. It can also further enhance the semantic support and strengthen the image embedding. Plenty of experiments on PASCAL- and COCO- prove the effectiveness of our method. Our method achieves new state-of-the-art for zero-shot semantic segmentation and exceeds many few-shot semantic segmentation methods. Codes are available at https://github.com/Jiaguang-NEU/ESDA.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.