{"title":"Dual Semantic Reconstruction Network for Weakly Supervised Temporal Sentence Grounding","authors":"Kefan Tang;Lihuo He;Nannan Wang;Xinbo Gao","doi":"10.1109/TMM.2024.3521676","DOIUrl":null,"url":null,"abstract":"Weakly supervised temporal sentence grounding aims to identify semantically relevant video moments in an untrimmed video corresponding to a given sentence query without exact timestamps. Neuropsychology research indicates that the way the human brain handles information varies based on the grammatical categories of words, highlighting the importance of separately considering nouns and verbs. However, current methodologies primarily utilize pre-extracted video features to reconstruct randomly masked queries, neglecting the distinction between grammatical classes. This oversight could hinder forming meaningful connections between linguistic elements and the corresponding components in the video. To address this limitation, this paper introduces the dual semantic reconstruction network (DSRN) model. DSRN processes video features by distinctly correlating object features with nouns and motion features with verbs, thereby mimicking the human brain's parsing mechanism. It begins with a feature disentanglement module that separately extracts object-aware and motion-aware features from video content. Then, in a dual-branch structure, these disentangled features are used to generate separate proposals for objects and motions through two dedicated proposal generation modules. A consistency constraint is proposed to ensure a high level of agreement between the boundaries of object-related and motion-related proposals. Subsequently, the DSRN independently reconstructs masked nouns and verbs from the sentence queries using the generated proposals. Finally, an integration block is applied to synthesize the two types of proposals, distinguishing between positive and negative instances through contrastive learning. Experiments on the Charades-STA and ActivityNet Captions datasets demonstrate that the proposed method achieves state-of-the-art performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"95-107"},"PeriodicalIF":8.4000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10820011/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Weakly supervised temporal sentence grounding aims to identify semantically relevant video moments in an untrimmed video corresponding to a given sentence query without exact timestamps. Neuropsychology research indicates that the way the human brain handles information varies based on the grammatical categories of words, highlighting the importance of separately considering nouns and verbs. However, current methodologies primarily utilize pre-extracted video features to reconstruct randomly masked queries, neglecting the distinction between grammatical classes. This oversight could hinder forming meaningful connections between linguistic elements and the corresponding components in the video. To address this limitation, this paper introduces the dual semantic reconstruction network (DSRN) model. DSRN processes video features by distinctly correlating object features with nouns and motion features with verbs, thereby mimicking the human brain's parsing mechanism. It begins with a feature disentanglement module that separately extracts object-aware and motion-aware features from video content. Then, in a dual-branch structure, these disentangled features are used to generate separate proposals for objects and motions through two dedicated proposal generation modules. A consistency constraint is proposed to ensure a high level of agreement between the boundaries of object-related and motion-related proposals. Subsequently, the DSRN independently reconstructs masked nouns and verbs from the sentence queries using the generated proposals. Finally, an integration block is applied to synthesize the two types of proposals, distinguishing between positive and negative instances through contrastive learning. Experiments on the Charades-STA and ActivityNet Captions datasets demonstrate that the proposed method achieves state-of-the-art performance.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.