Joint Modality Synergy and Spatio-temporal Cue Purification for Moment Localization

Xingyu Shen, L. Lan, Huibin Tan, Xiang Zhang, X. Ma, Zhigang Luo
{"title":"Joint Modality Synergy and Spatio-temporal Cue Purification for Moment Localization","authors":"Xingyu Shen, L. Lan, Huibin Tan, Xiang Zhang, X. Ma, Zhigang Luo","doi":"10.1145/3512527.3531396","DOIUrl":null,"url":null,"abstract":"Currently, many approaches to the sentence query based moment location (SQML) task emphasize (inter-)modality interaction between video and language query via transformer-based cross-attention or contrastive learning. However, they could still face two issues: 1) modality interaction could be unexpectedly friendly to modality specific learning that merely learns modality specific patterns, and 2) modality interaction easily confuses spatio-temporal cues and ultimately makes time cues in the original video ambiguous. In this paper, we propose a modality synergy with spatio-temporal cue purification method (MS2P) for SQML to address the above two issues. Particularly, a conceptually simple modality synergy strategy is explored to keep features modality specific while absorbing the other modality complementary information with both carefully designed cross-attention unit and non-contrastive learning. As a result, modality specific semantics can be calibrated progressively in a safer way. To preserve time cues in original video, we further purify video representation into spatial and temporal parts to enhance localization resolution by the proposed two light-weight sentence-aware filtering operations. Experiments on Charades-STA, TACoS, and ActivityNet Caption datasets show our model outperforms the state-of-the-art approaches by a large margin.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3512527.3531396","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Currently, many approaches to the sentence query based moment location (SQML) task emphasize (inter-)modality interaction between video and language query via transformer-based cross-attention or contrastive learning. However, they could still face two issues: 1) modality interaction could be unexpectedly friendly to modality specific learning that merely learns modality specific patterns, and 2) modality interaction easily confuses spatio-temporal cues and ultimately makes time cues in the original video ambiguous. In this paper, we propose a modality synergy with spatio-temporal cue purification method (MS2P) for SQML to address the above two issues. Particularly, a conceptually simple modality synergy strategy is explored to keep features modality specific while absorbing the other modality complementary information with both carefully designed cross-attention unit and non-contrastive learning. As a result, modality specific semantics can be calibrated progressively in a safer way. To preserve time cues in original video, we further purify video representation into spatial and temporal parts to enhance localization resolution by the proposed two light-weight sentence-aware filtering operations. Experiments on Charades-STA, TACoS, and ActivityNet Caption datasets show our model outperforms the state-of-the-art approaches by a large margin.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
瞬间定位的联合模态协同与时空线索净化
目前,基于句子查询的时刻定位(SQML)任务的许多方法都强调通过基于转换的交叉注意或对比学习来实现视频和语言查询之间的情态交互。然而,他们仍然可能面临两个问题:1)情态交互可能对仅学习情态特定模式的情态特定学习意外友好;2)情态交互容易混淆时空线索,最终使原始视频中的时间线索模糊不清。在本文中,我们提出了一种与时空线索净化方法(MS2P)相结合的SQML模态协同方法来解决上述两个问题。特别地,我们探索了一种概念上简单的情态协同策略,通过精心设计的交叉注意单元和非对比学习来吸收其他情态的补充信息,同时保持特征的情态特异性。因此,可以以更安全的方式逐步校准特定于情态的语义。为了保留原始视频中的时间线索,我们进一步将视频表示净化为空间和时间部分,通过提出两种轻量级的句子感知滤波操作来提高定位分辨率。在Charades-STA、TACoS和ActivityNet字幕数据集上的实验表明,我们的模型在很大程度上优于最先进的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Self-Lifting: A Novel Framework for Unsupervised Voice-Face Association Learning DMPCANet: A Low Dimensional Aggregation Network for Visual Place Recognition Revisiting Performance Measures for Cross-Modal Hashing MFGAN: A Lightweight Fast Multi-task Multi-scale Feature-fusion Model based on GAN Weakly Supervised Fine-grained Recognition based on Combined Learning for Small Data and Coarse Label
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1