基于空间和语言时间注意的跨模态视频时刻检索

Bin Jiang, Xin Huang, Chao Yang, Junsong Yuan
{"title":"基于空间和语言时间注意的跨模态视频时刻检索","authors":"Bin Jiang, Xin Huang, Chao Yang, Junsong Yuan","doi":"10.1145/3323873.3325019","DOIUrl":null,"url":null,"abstract":"Given an untrimmed video and a description query, temporal moment retrieval aims to localize the temporal segment within the video that best describes the textual query. Existing studies predominantly employ coarse frame-level features as the visual representation, obfuscating the specific details which may provide critical cues for localizing the desired moment. We propose a SLTA (short for \"Spatial and Language-Temporal Attention\") method to address the detail missing issue. Specifically, the SLTA method takes advantage of object-level local features and attends to the most relevant local features (e.g., the local features \"girl\", \"cup\") by spatial attention. Then we encode the sequence of local features on consecutive frames to capture the interaction information among these objects (e.g., the interaction \"pour\" involving these two objects). Meanwhile, a language-temporal attention is utilized to emphasize the keywords based on moment context information. Therefore, our proposed two attention sub-networks can recognize the most relevant objects and interactions in the video, and simultaneously highlight the keywords in the query. Extensive experiments on TACOS, Charades-STA and DiDeMo datasets demonstrate the effectiveness of our model as compared to state-of-the-art methods.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"61","resultStr":"{\"title\":\"Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention\",\"authors\":\"Bin Jiang, Xin Huang, Chao Yang, Junsong Yuan\",\"doi\":\"10.1145/3323873.3325019\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Given an untrimmed video and a description query, temporal moment retrieval aims to localize the temporal segment within the video that best describes the textual query. Existing studies predominantly employ coarse frame-level features as the visual representation, obfuscating the specific details which may provide critical cues for localizing the desired moment. We propose a SLTA (short for \\\"Spatial and Language-Temporal Attention\\\") method to address the detail missing issue. Specifically, the SLTA method takes advantage of object-level local features and attends to the most relevant local features (e.g., the local features \\\"girl\\\", \\\"cup\\\") by spatial attention. Then we encode the sequence of local features on consecutive frames to capture the interaction information among these objects (e.g., the interaction \\\"pour\\\" involving these two objects). Meanwhile, a language-temporal attention is utilized to emphasize the keywords based on moment context information. Therefore, our proposed two attention sub-networks can recognize the most relevant objects and interactions in the video, and simultaneously highlight the keywords in the query. Extensive experiments on TACOS, Charades-STA and DiDeMo datasets demonstrate the effectiveness of our model as compared to state-of-the-art methods.\",\"PeriodicalId\":149041,\"journal\":{\"name\":\"Proceedings of the 2019 on International Conference on Multimedia Retrieval\",\"volume\":\"44 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-06-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"61\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2019 on International Conference on Multimedia Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3323873.3325019\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3323873.3325019","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 61

摘要

给定一个未修剪的视频和一个描述查询,时间矩检索旨在定位视频中最能描述文本查询的时间段。现有的研究主要采用粗帧级特征作为视觉表征,模糊了可能为定位所需时刻提供关键线索的具体细节。我们提出了一种SLTA(简称“空间和语言-时间注意”)方法来解决细节缺失问题。具体而言,SLTA方法利用对象级局部特征,通过空间注意关注最相关的局部特征(例如,局部特征“女孩”,“杯子”)。然后,我们对连续帧上的局部特征序列进行编码,以捕获这些对象之间的交互信息(例如,涉及这两个对象的交互“pour”)。同时,基于时刻上下文信息,利用语言时间注意来强调关键词。因此,我们提出的两个关注子网络可以识别视频中最相关的对象和交互,并同时突出显示查询中的关键字。在TACOS, Charades-STA和DiDeMo数据集上进行的大量实验表明,与最先进的方法相比,我们的模型是有效的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention
Given an untrimmed video and a description query, temporal moment retrieval aims to localize the temporal segment within the video that best describes the textual query. Existing studies predominantly employ coarse frame-level features as the visual representation, obfuscating the specific details which may provide critical cues for localizing the desired moment. We propose a SLTA (short for "Spatial and Language-Temporal Attention") method to address the detail missing issue. Specifically, the SLTA method takes advantage of object-level local features and attends to the most relevant local features (e.g., the local features "girl", "cup") by spatial attention. Then we encode the sequence of local features on consecutive frames to capture the interaction information among these objects (e.g., the interaction "pour" involving these two objects). Meanwhile, a language-temporal attention is utilized to emphasize the keywords based on moment context information. Therefore, our proposed two attention sub-networks can recognize the most relevant objects and interactions in the video, and simultaneously highlight the keywords in the query. Extensive experiments on TACOS, Charades-STA and DiDeMo datasets demonstrate the effectiveness of our model as compared to state-of-the-art methods.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
EAGER Multimodal Multimedia Retrieval with vitrivr RobustiQ: A Robust ANN Search Method for Billion-scale Similarity Search on GPUs Improving What Cross-Modal Retrieval Models Learn through Object-Oriented Inter- and Intra-Modal Attention Networks DeepMarks: A Secure Fingerprinting Framework for Digital Rights Management of Deep Learning Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1