{"title":"Visual Grounding with Multi-modal Conditional Adaptation","authors":"Ruilin Yao, Shengwu Xiong, Yichen Zhao, Yi Rong","doi":"arxiv-2409.04999","DOIUrl":null,"url":null,"abstract":"Visual grounding is the task of locating objects specified by natural\nlanguage expressions. Existing methods extend generic object detection\nframeworks to tackle this task. They typically extract visual and textual\nfeatures separately using independent visual and textual encoders, then fuse\nthese features in a multi-modal decoder for final prediction. However, visual\ngrounding presents unique challenges. It often involves locating objects with\ndifferent text descriptions within the same image. Existing methods struggle\nwith this task because the independent visual encoder produces identical visual\nfeatures for the same image, limiting detection performance. Some recently\napproaches propose various language-guided visual encoders to address this\nissue, but they mostly rely solely on textual information and require\nsophisticated designs. In this paper, we introduce Multi-modal Conditional\nAdaptation (MMCA), which enables the visual encoder to adaptively update\nweights, directing its focus towards text-relevant regions. Specifically, we\nfirst integrate information from different modalities to obtain multi-modal\nembeddings. Then we utilize a set of weighting coefficients, which generated\nfrom the multimodal embeddings, to reorganize the weight update matrices and\napply them to the visual encoder of the visual grounding model. Extensive\nexperiments on four widely used datasets demonstrate that MMCA achieves\nsignificant improvements and state-of-the-art results. Ablation experiments\nfurther demonstrate the lightweight and efficiency of our method. Our source\ncode is available at: https://github.com/Mr-Bigworth/MMCA.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"5 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04999","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Visual grounding is the task of locating objects specified by natural
language expressions. Existing methods extend generic object detection
frameworks to tackle this task. They typically extract visual and textual
features separately using independent visual and textual encoders, then fuse
these features in a multi-modal decoder for final prediction. However, visual
grounding presents unique challenges. It often involves locating objects with
different text descriptions within the same image. Existing methods struggle
with this task because the independent visual encoder produces identical visual
features for the same image, limiting detection performance. Some recently
approaches propose various language-guided visual encoders to address this
issue, but they mostly rely solely on textual information and require
sophisticated designs. In this paper, we introduce Multi-modal Conditional
Adaptation (MMCA), which enables the visual encoder to adaptively update
weights, directing its focus towards text-relevant regions. Specifically, we
first integrate information from different modalities to obtain multi-modal
embeddings. Then we utilize a set of weighting coefficients, which generated
from the multimodal embeddings, to reorganize the weight update matrices and
apply them to the visual encoder of the visual grounding model. Extensive
experiments on four widely used datasets demonstrate that MMCA achieves
significant improvements and state-of-the-art results. Ablation experiments
further demonstrate the lightweight and efficiency of our method. Our source
code is available at: https://github.com/Mr-Bigworth/MMCA.