Bingbing Zhang , Ying Zhang , Jianxin Zhang , Qiule Sun , Rong Wang , Qiang Zhang
{"title":"Visual-guided hierarchical iterative fusion for multi-modal video action recognition","authors":"Bingbing Zhang , Ying Zhang , Jianxin Zhang , Qiule Sun , Rong Wang , Qiang Zhang","doi":"10.1016/j.patrec.2024.10.003","DOIUrl":null,"url":null,"abstract":"<div><div>Vision-Language models<!--> <!-->(VLMs) have shown promising improvements on various visual tasks. Most existing VLMs employ two separate transformer-based encoders, each dedicated to modeling visual and language features independently. Because the visual features and language features are unaligned in the feature space, it is challenging for the multi-modal encoder to learn vision-language interactions. In this paper, we propose a <strong>V</strong>isual-<strong>g</strong>uided <strong>H</strong>ierarchical <strong>I</strong>terative <strong>F</strong>usion (VgHIF) method for VLMs in video action recognition, which acquires more discriminative vision and language representation. VgHIF leverages visual features from different levels in visual encoder to interact with language representation. The interaction is processed by the attention mechanism to calculate the correlation between visual features and language representation. VgHIF learns grounded video-text representation and supports many different pre-trained VLMs in a flexible and efficient manner with a tiny computational cost. We conducted experiments on the Kinetics-400 Mini Kinetics 200 HMDB51, and UCF101 using VLMs: CLIP, X-CLIP, and ViFi-CLIP. The experiments were conducted under full supervision and few shot settings, and compared with the baseline multi-modal model without VgHIF, the Top-1 accuracy of the proposed method has been improved to varying degrees, and several groups of results have achieved comparable results with state-of-the-art performance, which strongly verified the effectiveness of the proposed method.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 213-220"},"PeriodicalIF":3.9000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865524002915","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Vision-Language models (VLMs) have shown promising improvements on various visual tasks. Most existing VLMs employ two separate transformer-based encoders, each dedicated to modeling visual and language features independently. Because the visual features and language features are unaligned in the feature space, it is challenging for the multi-modal encoder to learn vision-language interactions. In this paper, we propose a Visual-guided Hierarchical Iterative Fusion (VgHIF) method for VLMs in video action recognition, which acquires more discriminative vision and language representation. VgHIF leverages visual features from different levels in visual encoder to interact with language representation. The interaction is processed by the attention mechanism to calculate the correlation between visual features and language representation. VgHIF learns grounded video-text representation and supports many different pre-trained VLMs in a flexible and efficient manner with a tiny computational cost. We conducted experiments on the Kinetics-400 Mini Kinetics 200 HMDB51, and UCF101 using VLMs: CLIP, X-CLIP, and ViFi-CLIP. The experiments were conducted under full supervision and few shot settings, and compared with the baseline multi-modal model without VgHIF, the Top-1 accuracy of the proposed method has been improved to varying degrees, and several groups of results have achieved comparable results with state-of-the-art performance, which strongly verified the effectiveness of the proposed method.
期刊介绍:
Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition.
Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.