Language-Aware Vision Transformer for Referring Segmentation

IF 18.6 IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-09-25 DOI:10.1109/TPAMI.2024.3468640

Zhao Yang;Jiaqi Wang;Xubing Ye;Yansong Tang;Kai Chen;Hengshuang Zhao;Philip H. S. Torr

{"title":"Language-Aware Vision Transformer for Referring Segmentation","authors":"Zhao Yang;Jiaqi Wang;Xubing Ye;Yansong Tang;Kai Chen;Hengshuang Zhao;Philip H. S. Torr","doi":"10.1109/TPAMI.2024.3468640","DOIUrl":null,"url":null,"abstract":"Referring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image or video frames. A paradigm for tackling this problem in both the image and the video domains is to leverage a powerful vision-language (“cross-modal”) decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advances in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer’s overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. Based on the idea of conducting cross-modal feature fusion in the visual feature encoding stage, we propose a unified framework named Language-Aware Vision Transformer (<italic>LAVT</i>), which leverages the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results can be harvested with a light-weight mask predictor. One of the key components in the proposed system is a dense attention mechanism for collecting pixel-specific linguistic cues. When dealing with video inputs, we present the <italic>video LAVT</i> framework and design a 3D version of this component by introducing multi-scale convolutional operators arranged in a parallel fashion, which can exploit spatio-temporal dependencies at different granularity levels. We further introduce <italic>unified LAVT</i> as a unified framework that could handle both image and video inputs with enhanced segmentation capability on unified referring segmentation task. Our methods surpass previous state-of-the-art methods on seven benchmarks for referring image segmentation and referring video segmentation. The code to reproduce our experiments is available at LAVT-RS.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 7","pages":"5238-5255"},"PeriodicalIF":18.6000,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10694805/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Referring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image or video frames. A paradigm for tackling this problem in both the image and the video domains is to leverage a powerful vision-language (“cross-modal”) decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advances in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer’s overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. Based on the idea of conducting cross-modal feature fusion in the visual feature encoding stage, we propose a unified framework named Language-Aware Vision Transformer (LAVT), which leverages the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results can be harvested with a light-weight mask predictor. One of the key components in the proposed system is a dense attention mechanism for collecting pixel-specific linguistic cues. When dealing with video inputs, we present the video LAVT framework and design a 3D version of this component by introducing multi-scale convolutional operators arranged in a parallel fashion, which can exploit spatio-temporal dependencies at different granularity levels. We further introduce unified LAVT as a unified framework that could handle both image and video inputs with enhanced segmentation capability on unified referring segmentation task. Our methods surpass previous state-of-the-art methods on seven benchmarks for referring image segmentation and referring video segmentation. The code to reproduce our experiments is available at LAVT-RS.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于参考分割的语言感知视觉转换器

参考分割是一项基本的视觉语言任务，旨在根据自然语言描述从图像或视频中分割出对象。这项任务背后的关键挑战之一是利用引用表达式来突出显示图像或视频帧中的相关位置。在图像和视频领域解决这个问题的一个范例是利用一个强大的视觉语言（“跨模态”）解码器来融合从视觉编码器和语言编码器中独立提取的特征。最近的方法通过利用Transformer作为跨模态解码器在这个范例中取得了显著的进步，同时Transformer在许多其他视觉语言任务中取得了压倒性的成功。在这项工作中，我们采用了一种不同的方法，表明可以通过在视觉变压器编码器网络的中间层中早期融合语言和视觉特征来实现更好的跨模态对齐。基于在视觉特征编码阶段进行跨模态特征融合的思想，我们提出了一个统一的框架，称为语言感知视觉转换器（LAVT），它利用了转换器编码器的相关建模能力来挖掘有用的多模态上下文。这样，可以使用轻量级掩码预测器获得准确的分割结果。该系统的关键组成部分之一是用于收集像素特定语言线索的密集注意机制。在处理视频输入时，我们提出了视频LAVT框架，并通过引入以并行方式排列的多尺度卷积算子设计了该组件的3D版本，该组件可以利用不同粒度级别的时空依赖关系。我们进一步介绍了统一的LAVT作为一个统一的框架，可以同时处理图像和视频输入，并在统一的参考分割任务上增强了分割能力。我们的方法在参考图像分割和参考视频分割的七个基准上超越了以前最先进的方法。复制我们实验的代码可以在LAVT-RS上找到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量