{"title":"用于视频中时态句子接地的多级交互网络","authors":"Guangli Wu, Zhijun Yang, Jing Zhang","doi":"10.3233/jifs-234800","DOIUrl":null,"url":null,"abstract":"Temporal sentence grounding in videos (TSGV), which aims to retrieve video segments from an untrimmed videos that semantically match a given query. Most previous methods focused on learning either local or global query features and then performed cross-modal interaction, but ignore the complementarity between local and global features. In this paper, we propose a novel Multi-Level Interaction Network for Temporal Sentence Grounding in Videos. This network explores the semantics of queries at both phrase and sentence levels, interacting phrase-level features with video features to highlight video segments relevant to the query phrase and sentence-level features with video features to learn more about global localization information. A stacked fusion gate module is designed, which effectively captures the temporal relationships and semantic information among video segments. This module also introduces a gating mechanism to enable the model to adaptively regulate the fusion degree of video features and query features, further improving the accuracy of predicting the target segments. Extensive experiments on the ActivityNet Captions and Charades-STA benchmark datasets demonstrate that the proposed method outperforms the state-of-the-art methods.","PeriodicalId":509313,"journal":{"name":"Journal of Intelligent & Fuzzy Systems","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-Level interaction network for temporal sentence grounding in videos\",\"authors\":\"Guangli Wu, Zhijun Yang, Jing Zhang\",\"doi\":\"10.3233/jifs-234800\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Temporal sentence grounding in videos (TSGV), which aims to retrieve video segments from an untrimmed videos that semantically match a given query. Most previous methods focused on learning either local or global query features and then performed cross-modal interaction, but ignore the complementarity between local and global features. In this paper, we propose a novel Multi-Level Interaction Network for Temporal Sentence Grounding in Videos. This network explores the semantics of queries at both phrase and sentence levels, interacting phrase-level features with video features to highlight video segments relevant to the query phrase and sentence-level features with video features to learn more about global localization information. A stacked fusion gate module is designed, which effectively captures the temporal relationships and semantic information among video segments. This module also introduces a gating mechanism to enable the model to adaptively regulate the fusion degree of video features and query features, further improving the accuracy of predicting the target segments. Extensive experiments on the ActivityNet Captions and Charades-STA benchmark datasets demonstrate that the proposed method outperforms the state-of-the-art methods.\",\"PeriodicalId\":509313,\"journal\":{\"name\":\"Journal of Intelligent & Fuzzy Systems\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-03-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Intelligent & Fuzzy Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3233/jifs-234800\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Intelligent & Fuzzy Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/jifs-234800","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
视频中的时态句子基础(Temporal sentence grounding in videos,TSGV),旨在从未修改的视频中检索出与给定查询语义相匹配的视频片段。以前的大多数方法都侧重于学习局部或全局查询特征,然后进行跨模态交互,但忽略了局部和全局特征之间的互补性。在本文中,我们提出了一种用于视频中时态句子基础的新型多层次交互网络。该网络从短语和句子两个层面探索查询语义,将短语层面的特征与视频特征进行交互,以突出与查询短语相关的视频片段;将句子层面的特征与视频特征进行交互,以了解更多全局定位信息。设计的叠加融合门模块可有效捕捉视频片段之间的时间关系和语义信息。该模块还引入了门控机制,使模型能够自适应地调节视频特征与查询特征的融合度,进一步提高预测目标片段的准确性。在 ActivityNet Captions 和 Charades-STA 基准数据集上进行的大量实验表明,所提出的方法优于最先进的方法。
Multi-Level interaction network for temporal sentence grounding in videos
Temporal sentence grounding in videos (TSGV), which aims to retrieve video segments from an untrimmed videos that semantically match a given query. Most previous methods focused on learning either local or global query features and then performed cross-modal interaction, but ignore the complementarity between local and global features. In this paper, we propose a novel Multi-Level Interaction Network for Temporal Sentence Grounding in Videos. This network explores the semantics of queries at both phrase and sentence levels, interacting phrase-level features with video features to highlight video segments relevant to the query phrase and sentence-level features with video features to learn more about global localization information. A stacked fusion gate module is designed, which effectively captures the temporal relationships and semantic information among video segments. This module also introduces a gating mechanism to enable the model to adaptively regulate the fusion degree of video features and query features, further improving the accuracy of predicting the target segments. Extensive experiments on the ActivityNet Captions and Charades-STA benchmark datasets demonstrate that the proposed method outperforms the state-of-the-art methods.