TC-MGC: Text-conditioned multi-grained contrastive learning for text–video retrieval

IF 15.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Information Fusion Pub Date : 2025-09-01 Epub Date: 2025-04-05 DOI:10.1016/j.inffus.2025.103151
Xiaolun Jing, Genke Yang, Jian Chu
{"title":"TC-MGC: Text-conditioned multi-grained contrastive learning for text–video retrieval","authors":"Xiaolun Jing,&nbsp;Genke Yang,&nbsp;Jian Chu","doi":"10.1016/j.inffus.2025.103151","DOIUrl":null,"url":null,"abstract":"<div><div>Motivated by the success of coarse-grained or fine-grained contrast in text–video retrieval, there emerge multi-grained contrastive learning methods which focus on the integration of contrasts with different granularity. However, due to the wider semantic range of videos, the text-agnostic video representations might encode misleading information not described in texts, thus impeding the model from capturing precise cross-modal semantic correspondence. To this end, we propose a Text-Conditioned Multi-Grained Contrast framework, dubbed TC-MGC. Specifically, our model employs a language–video attention block to generate aggregated frame and video representations conditioned on the word’s and text’s attention weights over frames. To filter unnecessary similarity interactions and decrease trainable parameters in the Interactive Similarity Aggregation (ISA) module, we design a Similarity Reorganization (SR) module to identify attentive similarities and reorganize cross-modal similarity vectors and matrices. Next, we argue that the imbalance problem among multi-grained similarities may result in over- and under-representation issues. We thereby introduce an auxiliary Similarity Decorrelation Regularization (SDR) loss to facilitate cooperative relationship utilization by similarity variance minimization on matching text–video pairs. Finally, we present a Linear Softmax Aggregation (LSA) module to explicitly encourage the interactions between multiple similarities and promote the usage of multi-grained information. Empirically, TC-MGC achieves competitive results on multiple text–video retrieval benchmarks, outperforming X-CLIP model by +2.8% (+1.3%), +2.2% (+1.0%), +1.5% (+0.9%) relative (absolute) improvements in text-to-video retrieval R@1 on MSR-VTT, DiDeMo and VATEX, respectively. Our code is publicly available at <span><span>https://github.com/JingXiaolun/TC-MGC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"121 ","pages":"Article 103151"},"PeriodicalIF":15.5000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525002246","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/5 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Motivated by the success of coarse-grained or fine-grained contrast in text–video retrieval, there emerge multi-grained contrastive learning methods which focus on the integration of contrasts with different granularity. However, due to the wider semantic range of videos, the text-agnostic video representations might encode misleading information not described in texts, thus impeding the model from capturing precise cross-modal semantic correspondence. To this end, we propose a Text-Conditioned Multi-Grained Contrast framework, dubbed TC-MGC. Specifically, our model employs a language–video attention block to generate aggregated frame and video representations conditioned on the word’s and text’s attention weights over frames. To filter unnecessary similarity interactions and decrease trainable parameters in the Interactive Similarity Aggregation (ISA) module, we design a Similarity Reorganization (SR) module to identify attentive similarities and reorganize cross-modal similarity vectors and matrices. Next, we argue that the imbalance problem among multi-grained similarities may result in over- and under-representation issues. We thereby introduce an auxiliary Similarity Decorrelation Regularization (SDR) loss to facilitate cooperative relationship utilization by similarity variance minimization on matching text–video pairs. Finally, we present a Linear Softmax Aggregation (LSA) module to explicitly encourage the interactions between multiple similarities and promote the usage of multi-grained information. Empirically, TC-MGC achieves competitive results on multiple text–video retrieval benchmarks, outperforming X-CLIP model by +2.8% (+1.3%), +2.2% (+1.0%), +1.5% (+0.9%) relative (absolute) improvements in text-to-video retrieval R@1 on MSR-VTT, DiDeMo and VATEX, respectively. Our code is publicly available at https://github.com/JingXiaolun/TC-MGC.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用于文本视频检索的文本条件多粒度对比学习
随着粗粒度和细粒度对比在文本视频检索中的成功应用,多粒度对比学习方法应运而生,其重点是对不同粒度的对比进行整合。然而,由于视频的语义范围更广,与文本无关的视频表示可能会编码文本中未描述的误导性信息,从而阻碍模型捕获精确的跨模态语义对应。为此,我们提出了一个文本条件多粒度对比框架,称为TC-MGC。具体来说,我们的模型采用语言-视频注意力块来生成聚合帧和视频表示,这取决于单词和文本在帧上的注意力权重。为了在交互相似聚合(ISA)模块中过滤不必要的相似交互并减少可训练参数,我们设计了一个相似重组(SR)模块来识别关注的相似点并重组跨模态相似向量和矩阵。接下来,我们认为多粒度相似性之间的不平衡问题可能导致过度和不足的表示问题。因此,我们引入了一种辅助的相似度去相关正则化(SDR)损失,通过最小化匹配文本视频对的相似度方差来促进合作关系的利用。最后,我们提出了一个线性Softmax聚合(LSA)模块来明确地鼓励多个相似性之间的交互,并促进多粒度信息的使用。从经验上看,TC-MGC在多个文本-视频检索基准上取得了具有竞争力的结果,在MSR-VTT、DiDeMo和VATEX上,在文本到视频检索R@1方面的相对(绝对)改进分别优于X-CLIP模型+2.8%(+1.3%)、+2.2%(+1.0%)、+1.5%(+0.9%)。我们的代码可以在https://github.com/JingXiaolun/TC-MGC上公开获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Information Fusion
Information Fusion 工程技术-计算机:理论方法
CiteScore
33.20
自引率
4.30%
发文量
161
审稿时长
7.9 months
期刊介绍: Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.
期刊最新文献
(a, b)-FG-functionals: a generalization of the Sugeno integral with floating domains in arbitrary closed real intervals and its applications FedCLIPOT: Federated CLIP model via parameter reusing and optimal transport Normalization-driven optimization of knowledge graph creation for semantically grounded information fusion Towards underwater image enhancement via meta-gated fusion strategy and domain adaptation Fast fuzzy clustering via sparse anchor graph with manifold and balance regularizations
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1