Joint Modality Synergy and Spatio-temporal Cue Purification for Moment Localization

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI:10.1145/3512527.3531396

Xingyu Shen, L. Lan, Huibin Tan, Xiang Zhang, X. Ma, Zhigang Luo

{"title":"Joint Modality Synergy and Spatio-temporal Cue Purification for Moment Localization","authors":"Xingyu Shen, L. Lan, Huibin Tan, Xiang Zhang, X. Ma, Zhigang Luo","doi":"10.1145/3512527.3531396","DOIUrl":null,"url":null,"abstract":"Currently, many approaches to the sentence query based moment location (SQML) task emphasize (inter-)modality interaction between video and language query via transformer-based cross-attention or contrastive learning. However, they could still face two issues: 1) modality interaction could be unexpectedly friendly to modality specific learning that merely learns modality specific patterns, and 2) modality interaction easily confuses spatio-temporal cues and ultimately makes time cues in the original video ambiguous. In this paper, we propose a modality synergy with spatio-temporal cue purification method (MS2P) for SQML to address the above two issues. Particularly, a conceptually simple modality synergy strategy is explored to keep features modality specific while absorbing the other modality complementary information with both carefully designed cross-attention unit and non-contrastive learning. As a result, modality specific semantics can be calibrated progressively in a safer way. To preserve time cues in original video, we further purify video representation into spatial and temporal parts to enhance localization resolution by the proposed two light-weight sentence-aware filtering operations. Experiments on Charades-STA, TACoS, and ActivityNet Caption datasets show our model outperforms the state-of-the-art approaches by a large margin.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3512527.3531396","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Currently, many approaches to the sentence query based moment location (SQML) task emphasize (inter-)modality interaction between video and language query via transformer-based cross-attention or contrastive learning. However, they could still face two issues: 1) modality interaction could be unexpectedly friendly to modality specific learning that merely learns modality specific patterns, and 2) modality interaction easily confuses spatio-temporal cues and ultimately makes time cues in the original video ambiguous. In this paper, we propose a modality synergy with spatio-temporal cue purification method (MS2P) for SQML to address the above two issues. Particularly, a conceptually simple modality synergy strategy is explored to keep features modality specific while absorbing the other modality complementary information with both carefully designed cross-attention unit and non-contrastive learning. As a result, modality specific semantics can be calibrated progressively in a safer way. To preserve time cues in original video, we further purify video representation into spatial and temporal parts to enhance localization resolution by the proposed two light-weight sentence-aware filtering operations. Experiments on Charades-STA, TACoS, and ActivityNet Caption datasets show our model outperforms the state-of-the-art approaches by a large margin.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

瞬间定位的联合模态协同与时空线索净化

目前，基于句子查询的时刻定位(SQML)任务的许多方法都强调通过基于转换的交叉注意或对比学习来实现视频和语言查询之间的情态交互。然而，他们仍然可能面临两个问题:1)情态交互可能对仅学习情态特定模式的情态特定学习意外友好;2)情态交互容易混淆时空线索，最终使原始视频中的时间线索模糊不清。在本文中，我们提出了一种与时空线索净化方法(MS2P)相结合的SQML模态协同方法来解决上述两个问题。特别地，我们探索了一种概念上简单的情态协同策略，通过精心设计的交叉注意单元和非对比学习来吸收其他情态的补充信息，同时保持特征的情态特异性。因此，可以以更安全的方式逐步校准特定于情态的语义。为了保留原始视频中的时间线索，我们进一步将视频表示净化为空间和时间部分，通过提出两种轻量级的句子感知滤波操作来提高定位分辨率。在Charades-STA、TACoS和ActivityNet字幕数据集上的实验表明，我们的模型在很大程度上优于最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2022 International Conference on Multimedia Retrieval

自引率

0.00%

发文量