Reliable Phrase Feature Mining for Hierarchical Video-Text Retrieval

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-07-03 DOI:10.1109/TCSVT.2024.3422869

Huakai Lai;Wenfei Yang;Tianzhu Zhang;Yongdong Zhang

{"title":"Reliable Phrase Feature Mining for Hierarchical Video-Text Retrieval","authors":"Huakai Lai;Wenfei Yang;Tianzhu Zhang;Yongdong Zhang","doi":"10.1109/TCSVT.2024.3422869","DOIUrl":null,"url":null,"abstract":"Video-Text Retrieval is a fundamental task in multi-modal understanding and has attracted increasing attention from both academia and industry communities in recent years. Generally, video inherently contains multi-grained semantic and each video corresponds to several different texts, which is challenging. Previous best-performing methods adopt video-sentence, phrase-phrase, and frame-word interactions simultaneously. Different from word/frame features that can be obtained directly, phrase features need to be adaptively aggregated from correlative word/frame features, which makes it very demanding. However, existing method utilizes simple intra-modal self-attention to generate phrase features without considering the following three aspects: cross-modality semantic correlation, phrase generation noise and diversity. In this paper, we propose a novel Reliable Phrase Mining model (RPM) to construct reliable phrase features and conduct hierarchical cross-modal interactions for video-text retrieval. The proposed RPM model enjoys several merits. Firstly, to guarantee the semantic consistency between video phrases and text phrases, we propose a set of modality-shared prototypes as the joint query to aggregate the semantically related frame/word features into adaptive-grained phrase features. Secondly, to deal with the phrase generation noise, the proposed denoised decoder module is responsible for obtaining more reliable similarity between prototypes and frame/word features. Specifically, not only the correlation between frame/word features and prototypes, but also the correlation among prototypes, should be taken into account when calculating the similarity. Furthermore, to encourage different prototypes to focus on different semantic information, we design a prototype contrastive loss whose core idea is enabling phrases produced by the same prototype to be more similar than those produced by different prototypes. Extensive experiment results demonstrate that the proposed method performs favorably on three benchmark datasets, including MSR-VTT, MSVD, and LSMDC.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"34 11","pages":"12019-12031"},"PeriodicalIF":11.1000,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10583931/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Video-Text Retrieval is a fundamental task in multi-modal understanding and has attracted increasing attention from both academia and industry communities in recent years. Generally, video inherently contains multi-grained semantic and each video corresponds to several different texts, which is challenging. Previous best-performing methods adopt video-sentence, phrase-phrase, and frame-word interactions simultaneously. Different from word/frame features that can be obtained directly, phrase features need to be adaptively aggregated from correlative word/frame features, which makes it very demanding. However, existing method utilizes simple intra-modal self-attention to generate phrase features without considering the following three aspects: cross-modality semantic correlation, phrase generation noise and diversity. In this paper, we propose a novel Reliable Phrase Mining model (RPM) to construct reliable phrase features and conduct hierarchical cross-modal interactions for video-text retrieval. The proposed RPM model enjoys several merits. Firstly, to guarantee the semantic consistency between video phrases and text phrases, we propose a set of modality-shared prototypes as the joint query to aggregate the semantically related frame/word features into adaptive-grained phrase features. Secondly, to deal with the phrase generation noise, the proposed denoised decoder module is responsible for obtaining more reliable similarity between prototypes and frame/word features. Specifically, not only the correlation between frame/word features and prototypes, but also the correlation among prototypes, should be taken into account when calculating the similarity. Furthermore, to encourage different prototypes to focus on different semantic information, we design a prototype contrastive loss whose core idea is enabling phrases produced by the same prototype to be more similar than those produced by different prototypes. Extensive experiment results demonstrate that the proposed method performs favorably on three benchmark datasets, including MSR-VTT, MSVD, and LSMDC.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

为分层视频文本检索挖掘可靠的短语特征

视频文本检索是多模态理解中的一项基本任务，近年来日益受到学术界和工业界的关注。一般来说，视频本身包含多粒度语义，且每个视频对应多个不同文本，因此具有挑战性。以往表现最好的方法同时采用视频-句子、短语-短语和帧-词交互。与可直接获取的词/帧特征不同，短语特征需要从相关的词/帧特征中进行自适应聚合，因此要求很高。然而，现有方法利用简单的模内自注意来生成短语特征，而没有考虑以下三个方面：跨模语义相关性、短语生成噪声和多样性。在本文中，我们提出了一种新颖的可靠短语挖掘模型（RPM）来构建可靠的短语特征，并为视频-文本检索进行分层跨模态交互。所提出的 RPM 模型有几个优点。首先，为了保证视频短语和文本短语之间的语义一致性，我们提出了一组模态共享原型作为联合查询，将语义相关的帧/词特征聚合为自适应粒度的短语特征。其次，为了处理短语生成噪声，我们提出的去噪解码器模块负责获取原型与帧/词特征之间更可靠的相似性。具体来说，在计算相似度时，不仅要考虑帧/词特征与原型之间的相关性，还要考虑原型之间的相关性。此外，为了鼓励不同的原型关注不同的语义信息，我们设计了一种原型对比损失，其核心思想是使同一原型产生的短语比不同原型产生的短语更相似。广泛的实验结果表明，所提出的方法在 MSR-VTT、MSVD 和 LSMDC 等三个基准数据集上表现良好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.