Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models

BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference Pub Date : 2022-10-27 DOI:10.48550/arXiv.2210.15138

Chao Ma, Yu-Hao Yang, Yanfeng Wang, Ya Zhang, Weidi Xie

{"title":"Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models","authors":"Chao Ma, Yu-Hao Yang, Yanfeng Wang, Ya Zhang, Weidi Xie","doi":"10.48550/arXiv.2210.15138","DOIUrl":null,"url":null,"abstract":"When trained at a sufficient scale, self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks. In this paper, we investigate simple, yet effective approaches for adapting the pre-trained foundation models to the downstream task of interest, namely, open-vocabulary semantic segmentation. To this end, we make the following contributions: (i) we introduce Fusioner, with a lightweight, transformer-based fusion module, that pairs the frozen visual representation with language concept through a handful of image segmentation data. As a consequence, the model gains the capability of zero-shot transfer to segment novel categories; (ii) without loss of generality, we experiment on a broad range of self-supervised models that have been pre-trained with different schemes, e.g. visual-only models (MoCo v3, DINO), language-only models (BERT), visual-language model (CLIP), and show that, the proposed fusion approach is effective to any pair of visual and language models, even those pre-trained on a corpus of uni-modal data; (iii) we conduct thorough ablation studies to analyze the critical components in our proposed Fusioner, while evaluating on standard benchmarks, e.g. PASCAL-5i and COCO-20i , it surpasses existing state-of-the-art models by a large margin, despite only being trained on frozen visual and language features; (iv) to measure the model's robustness on learning visual-language correspondence, we further evaluate on synthetic dataset, named Mosaic-4, where images are constructed by mosaicking the samples from FSS-1000. Fusioner demonstrates superior performance over previous models.","PeriodicalId":72437,"journal":{"name":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","volume":"21 1","pages":"45"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2210.15138","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 28

Abstract

When trained at a sufficient scale, self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks. In this paper, we investigate simple, yet effective approaches for adapting the pre-trained foundation models to the downstream task of interest, namely, open-vocabulary semantic segmentation. To this end, we make the following contributions: (i) we introduce Fusioner, with a lightweight, transformer-based fusion module, that pairs the frozen visual representation with language concept through a handful of image segmentation data. As a consequence, the model gains the capability of zero-shot transfer to segment novel categories; (ii) without loss of generality, we experiment on a broad range of self-supervised models that have been pre-trained with different schemes, e.g. visual-only models (MoCo v3, DINO), language-only models (BERT), visual-language model (CLIP), and show that, the proposed fusion approach is effective to any pair of visual and language models, even those pre-trained on a corpus of uni-modal data; (iii) we conduct thorough ablation studies to analyze the critical components in our proposed Fusioner, while evaluating on standard benchmarks, e.g. PASCAL-5i and COCO-20i , it surpasses existing state-of-the-art models by a large margin, despite only being trained on frozen visual and language features; (iv) to measure the model's robustness on learning visual-language correspondence, we further evaluate on synthetic dataset, named Mosaic-4, where images are constructed by mosaicking the samples from FSS-1000. Fusioner demonstrates superior performance over previous models.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于冻结视觉语言模型的开放词汇语义分割

在足够规模的训练下，自我监督学习在解决广泛的视觉或语言理解任务方面表现出显著的能力。在本文中，我们研究了简单而有效的方法，使预训练的基础模型适应下游任务，即开放词汇语义分割。为此，我们做出了以下贡献:(i)我们引入了Fusioner，它是一个轻量级的，基于变压器的融合模块，通过少量图像分割数据将冻结的视觉表示与语言概念配对。结果表明，该模型具有细分新类别的零次迁移能力;(ii)在不损失通用性的情况下，我们对使用不同方案进行预训练的广泛自监督模型进行了实验，例如，纯视觉模型(MoCo v3, DINO)，纯语言模型(BERT)，视觉语言模型(CLIP)，并表明，所提出的融合方法对任何对视觉和语言模型都是有效的，即使是在单模态数据语料库上预训练的模型;(iii)我们进行彻底的消融研究，以分析我们提议的fusion中的关键组件，同时对标准基准进行评估，例如PASCAL-5i和COCO-20i，尽管仅接受过冻结视觉和语言特征的训练，但它仍大大超过了现有的最先进模型;(iv)为了衡量模型在学习视觉语言对应性方面的鲁棒性，我们进一步评估了名为Mosaic-4的合成数据集，其中图像是通过拼接来自FSS-1000的样本构建的。Fusioner的性能优于以前的型号。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference

自引率

0.00%

发文量

期刊最新文献

Learning Anatomically Consistent Embedding for Chest Radiography. Single Pixel Spectral Color Constancy DiffSketching: Sketch Control Image Synthesis with Diffusion Models Defect Transfer GAN: Diverse Defect Synthesis for Data Augmentation Mitigating Bias in Visual Transformers via Targeted Alignment