{"title":"Rethinking Video Sentence Grounding From a Tracking Perspective With Memory Network and Masked Attention","authors":"Zeyu Xiong;Daizong Liu;Xiang Fang;Xiaoye Qu;Jianfeng Dong;Jiahao Zhu;Keke Tang;Pan Zhou","doi":"10.1109/TMM.2024.3453062","DOIUrl":null,"url":null,"abstract":"Video sentence grounding (VSG) is the task of identifying the segment of an untrimmed video that semantically corresponds to a given natural language query. While many existing methods extract frame-grained features using pre-trained 2D or 3D convolution networks, often fail to capture subtle differences between ambiguous adjacent frames. Although some recent approaches incorporate object-grained features using Faster R-CNN to capture more fine-grained details, they are still primarily based on feature enhancement and lack spatio-temporal modeling to explore the semantics of the core persons/objects. To solve the problem of modeling the core target's behavior, in this paper, we propose a new perspective for addressing the VSG task by tracking pivotal objects and activities to learn more fine-grained spatio-temporal features. Specifically, we introduce the Video Sentence Tracker with Memory Network and Masked Attention (VSTMM), which comprises a cross-modal targets generator for producing multi-modal templates and search space, a memory-based tracker for dynamically tracking multi-modal targets using a memory network to record targets' behaviors, a masked attention localizer which learns local shared features between frames and eliminates interference from long-term dependencies, resulting in improved accuracy when localizing the moment. To evaluate the performance of our VSTMM, we conducted extensive experiments and comparisons with state-of-the-art methods on three challenging benchmarks, including Charades-STA, ActivityNet Captions, and TACoS. Without bells and whistles, our VSTMM achieves leading performance with a considerable real-time speed.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11204-11218"},"PeriodicalIF":8.4000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10663234/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Video sentence grounding (VSG) is the task of identifying the segment of an untrimmed video that semantically corresponds to a given natural language query. While many existing methods extract frame-grained features using pre-trained 2D or 3D convolution networks, often fail to capture subtle differences between ambiguous adjacent frames. Although some recent approaches incorporate object-grained features using Faster R-CNN to capture more fine-grained details, they are still primarily based on feature enhancement and lack spatio-temporal modeling to explore the semantics of the core persons/objects. To solve the problem of modeling the core target's behavior, in this paper, we propose a new perspective for addressing the VSG task by tracking pivotal objects and activities to learn more fine-grained spatio-temporal features. Specifically, we introduce the Video Sentence Tracker with Memory Network and Masked Attention (VSTMM), which comprises a cross-modal targets generator for producing multi-modal templates and search space, a memory-based tracker for dynamically tracking multi-modal targets using a memory network to record targets' behaviors, a masked attention localizer which learns local shared features between frames and eliminates interference from long-term dependencies, resulting in improved accuracy when localizing the moment. To evaluate the performance of our VSTMM, we conducted extensive experiments and comparisons with state-of-the-art methods on three challenging benchmarks, including Charades-STA, ActivityNet Captions, and TACoS. Without bells and whistles, our VSTMM achieves leading performance with a considerable real-time speed.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.