Phrase-level Prediction for Video Temporal Localization

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI:10.1145/3512527.3531382

Sizhe Li, C. Li, Minghang Zheng, Yang Liu

{"title":"Phrase-level Prediction for Video Temporal Localization","authors":"Sizhe Li, C. Li, Minghang Zheng, Yang Liu","doi":"10.1145/3512527.3531382","DOIUrl":null,"url":null,"abstract":"Video temporal localization aims to locate a period that semantically matches a natural language query in a given untrimmed video. We empirically observe that although existing approaches gain steady progress on sentence localization, the performance of phrase localization is far from satisfactory. In principle, the phrase should be easier to localize as fewer combinations of visual concepts need to be considered; such incapability indicates that the existing models only capture the sentence annotation bias in the benchmark but lack sufficient understanding of the intrinsic relationship between simple visual and language concepts, thus the model generalization and interpretability is questioned. This paper proposes a unified framework that can deal with both sentence and phrase-level localization, namely Phrase Level Prediction Net (PLPNet). Specifically, based on the hypothesis that similar phrases tend to focus on similar video cues, while dissimilar ones should not, we build a contrastive mechanism to restrain phrase-level localization without fine-grained phrase boundary annotation required in training. Moreover, considering the sentence's flexibility and wide discrepancy among phrases, we propose a clustering-based batch sampler to ensure that contrastive learning can be conducted efficiently. Extensive experiments demonstrate that our method surpasses state-of-the-art methods of phrase-level temporal localization while maintaining high performance in sentence localization and boosting the model's interpretability and generalization capability. Our code is available at https://github.com/sizhelee/PLPNet.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3512527.3531382","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Video temporal localization aims to locate a period that semantically matches a natural language query in a given untrimmed video. We empirically observe that although existing approaches gain steady progress on sentence localization, the performance of phrase localization is far from satisfactory. In principle, the phrase should be easier to localize as fewer combinations of visual concepts need to be considered; such incapability indicates that the existing models only capture the sentence annotation bias in the benchmark but lack sufficient understanding of the intrinsic relationship between simple visual and language concepts, thus the model generalization and interpretability is questioned. This paper proposes a unified framework that can deal with both sentence and phrase-level localization, namely Phrase Level Prediction Net (PLPNet). Specifically, based on the hypothesis that similar phrases tend to focus on similar video cues, while dissimilar ones should not, we build a contrastive mechanism to restrain phrase-level localization without fine-grained phrase boundary annotation required in training. Moreover, considering the sentence's flexibility and wide discrepancy among phrases, we propose a clustering-based batch sampler to ensure that contrastive learning can be conducted efficiently. Extensive experiments demonstrate that our method surpasses state-of-the-art methods of phrase-level temporal localization while maintaining high performance in sentence localization and boosting the model's interpretability and generalization capability. Our code is available at https://github.com/sizhelee/PLPNet.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

视频时间定位的短语级预测

视频时间定位的目的是在给定的未修剪视频中定位语义上与自然语言查询匹配的时间段。我们通过经验观察到，虽然现有的方法在句子定位方面取得了稳步的进展，但在短语定位方面的表现却不尽人意。原则上，短语应该更容易本地化，因为需要考虑的视觉概念组合更少;这说明现有模型只捕获了基准中的句子标注偏差，而对简单的视觉概念和语言概念之间的内在关系缺乏足够的理解，从而使模型的泛化和可解释性受到质疑。本文提出了一种既能处理句子级定位又能处理短语级定位的统一框架，即短语级预测网(PLPNet)。具体来说，基于相似短语倾向于关注相似的视频线索，而不相似短语不应该关注的假设，我们构建了一种对比机制来约束短语级定位，而不需要训练中需要的细粒度短语边界标注。此外，考虑到句子的灵活性和短语之间的巨大差异，我们提出了一种基于聚类的批处理采样器，以确保有效地进行对比学习。大量的实验表明，我们的方法超越了目前最先进的短语级时间定位方法，同时保持了句子定位的高性能，提高了模型的可解释性和泛化能力。我们的代码可在https://github.com/sizhelee/PLPNet上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2022 International Conference on Multimedia Retrieval

自引率

0.00%

发文量