Enhancing Text-Video Retrieval Performance With Low-Salient but Discriminative Objects

IF 13.7 IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-01-15 DOI:10.1109/TIP.2025.3527369

Yanwei Zheng;Bowen Huang;Zekai Chen;Dongxiao Yu

{"title":"Enhancing Text-Video Retrieval Performance With Low-Salient but Discriminative Objects","authors":"Yanwei Zheng;Bowen Huang;Zekai Chen;Dongxiao Yu","doi":"10.1109/TIP.2025.3527369","DOIUrl":null,"url":null,"abstract":"Text-video retrieval aims to establish a matching relationship between a video and its corresponding text. However, previous works have primarily focused on salient video subjects, such as humans or animals, often overlooking Low-Salient but Discriminative Objects (LSDOs) that play a critical role in understanding content. To address this limitation, we propose a novel model that enhances retrieval performance by emphasizing these overlooked elements across video and text modalities. In the video modality, our model first incorporates a feature selection module to gather video-level LSDO features, and applies cross-modal attention to assign frame-specific weights based on relevance, yielding frame-level LSDO features. In the text modality, text-level LSDO features are captured by generating multiple object prototypes in a sparse aggregation manner. Extensive experiments on benchmark datasets, including MSR-VTT, MSVD, LSMDC, and DiDeMo, demonstrate that our model achieves state-of-the-art results across various evaluation metrics.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"581-593"},"PeriodicalIF":13.7000,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10841928/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Text-video retrieval aims to establish a matching relationship between a video and its corresponding text. However, previous works have primarily focused on salient video subjects, such as humans or animals, often overlooking Low-Salient but Discriminative Objects (LSDOs) that play a critical role in understanding content. To address this limitation, we propose a novel model that enhances retrieval performance by emphasizing these overlooked elements across video and text modalities. In the video modality, our model first incorporates a feature selection module to gather video-level LSDO features, and applies cross-modal attention to assign frame-specific weights based on relevance, yielding frame-level LSDO features. In the text modality, text-level LSDO features are captured by generating multiple object prototypes in a sparse aggregation manner. Extensive experiments on benchmark datasets, including MSR-VTT, MSVD, LSMDC, and DiDeMo, demonstrate that our model achieves state-of-the-art results across various evaluation metrics.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于低显著性但有区别目标的文本视频检索性能增强

文本-视频检索的目的是在视频和相应的文本之间建立匹配关系。然而，以前的作品主要集中在突出的视频主题上，如人类或动物，往往忽略了在理解内容中起关键作用的低突出但判别对象（ldos）。为了解决这一限制，我们提出了一个新的模型，通过强调视频和文本模式中这些被忽视的元素来提高检索性能。在视频模态中，我们的模型首先集成了一个特征选择模块来收集视频级LSDO特征，并应用跨模态关注来根据相关性分配特定帧的权重，从而产生帧级LSDO特征。在文本模式中，通过稀疏聚合方式生成多个对象原型来捕获文本级LSDO特征。在包括MSR-VTT、MSVD、LSMDC和DiDeMo在内的基准数据集上进行的大量实验表明，我们的模型在各种评估指标上取得了最先进的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量