Text–video retrieval re-ranking via multi-grained cross attention and frozen image encoders

IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pattern Recognition Pub Date : 2024-11-01 DOI:10.1016/j.patcog.2024.111099
Zuozhuo Dai , Kaihui Cheng , Fangtao Shao , Zilong Dong , Siyu Zhu
{"title":"Text–video retrieval re-ranking via multi-grained cross attention and frozen image encoders","authors":"Zuozhuo Dai ,&nbsp;Kaihui Cheng ,&nbsp;Fangtao Shao ,&nbsp;Zilong Dong ,&nbsp;Siyu Zhu","doi":"10.1016/j.patcog.2024.111099","DOIUrl":null,"url":null,"abstract":"<div><div>State-of-the-art methods for text–video retrieval generally leverage CLIP embeddings and cosine similarity for efficient retrieval. Meanwhile, recent advancements in cross-attention techniques introduce transformer decoders to facilitate attention computation between text queries and visual tokens extracted from video frames, enabling a more comprehensive interaction between textual and visual information. In this study, we combine the advantages of both approaches and propose a fine-grained re-ranking approach incorporating a multi-grained text–video cross attention module. Specifically, the re-ranker enhances the top K similar candidates identified by the cosine similarity network. To explore video and text interactions efficiently, we introduce frame and video token selectors to obtain salient visual tokens at both frame and video levels. Then, a multi-grained cross-attention mechanism is applied between text and visual tokens at these levels to capture multimodal information. To reduce the training overhead associated with the multi-grained cross-attention module, we freeze the vision backbone and only train the multi-grained cross attention module. This frozen strategy allows for scalability to larger pre-trained vision models such as ViT-G, leading to enhanced retrieval performance. Experimental evaluations on text–video retrieval datasets showcase the effectiveness and scalability of our proposed re-ranker combined with existing state-of-the-art methodologies.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111099"},"PeriodicalIF":7.6000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324008501","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

State-of-the-art methods for text–video retrieval generally leverage CLIP embeddings and cosine similarity for efficient retrieval. Meanwhile, recent advancements in cross-attention techniques introduce transformer decoders to facilitate attention computation between text queries and visual tokens extracted from video frames, enabling a more comprehensive interaction between textual and visual information. In this study, we combine the advantages of both approaches and propose a fine-grained re-ranking approach incorporating a multi-grained text–video cross attention module. Specifically, the re-ranker enhances the top K similar candidates identified by the cosine similarity network. To explore video and text interactions efficiently, we introduce frame and video token selectors to obtain salient visual tokens at both frame and video levels. Then, a multi-grained cross-attention mechanism is applied between text and visual tokens at these levels to capture multimodal information. To reduce the training overhead associated with the multi-grained cross-attention module, we freeze the vision backbone and only train the multi-grained cross attention module. This frozen strategy allows for scalability to larger pre-trained vision models such as ViT-G, leading to enhanced retrieval performance. Experimental evaluations on text–video retrieval datasets showcase the effectiveness and scalability of our proposed re-ranker combined with existing state-of-the-art methodologies.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过多粒度交叉注意力和冻结图像编码器进行文本-视频检索重新排序
最先进的文本-视频检索方法通常利用 CLIP 嵌入和余弦相似性来实现高效检索。同时,交叉注意力技术的最新进展引入了变换器解码器,以促进文本查询和从视频帧中提取的视觉标记之间的注意力计算,从而实现文本和视觉信息之间更全面的交互。在本研究中,我们结合了这两种方法的优点,提出了一种包含多粒度文本-视频交叉注意力模块的细粒度重新排序方法。具体来说,重排序器会增强余弦相似性网络识别出的前 K 个相似候选者。为了有效探索视频和文本之间的交互,我们引入了帧和视频标记选择器,以获取帧和视频级别的突出视觉标记。然后,在这些级别的文本和视觉标记之间应用多级交叉关注机制,以捕捉多模态信息。为了减少与多粒度交叉注意模块相关的训练开销,我们冻结了视觉骨干,只训练多粒度交叉注意模块。这种冻结策略可以扩展到更大的预训练视觉模型(如 ViT-G),从而提高检索性能。在文本-视频检索数据集上进行的实验评估展示了我们提出的重排序器与现有先进方法相结合的有效性和可扩展性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Pattern Recognition
Pattern Recognition 工程技术-工程:电子与电气
CiteScore
14.40
自引率
16.20%
发文量
683
审稿时长
5.6 months
期刊介绍: The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.
期刊最新文献
Discussion on “Interpretable medical deep framework by logits-constraint attention guiding graph-based multi-scale fusion for Alzheimer’s disease analysis” by J. Xu, C. Yuan, X. Ma, H. Shang, X. Shi & X. Zhu. (Pattern Recognition, vol. 152, 2024”) 3D temporal-spatial convolutional LSTM network for assessing drug addiction treatment Pairwise joint symmetric uncertainty based on macro-neighborhood entropy for heterogeneous feature selection Low-rank fused modality assisted magnetic resonance imaging reconstruction via an anatomical variation adaptive transformer LCF3D: A robust and real-time late-cascade fusion framework for 3D object detection in autonomous driving
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1