Shu Luo, Shijie Jiang, Da Cao, Huangxiao Deng, Jiawei Wang, Zheng Qin
{"title":"基于单帧时空注释的弱监督时空视频接地","authors":"Shu Luo, Shijie Jiang, Da Cao, Huangxiao Deng, Jiawei Wang, Zheng Qin","doi":"10.1016/j.knosys.2025.113200","DOIUrl":null,"url":null,"abstract":"<div><div>The task of weakly-supervised spatial–temporal video grounding, where model training only relies on video-sentence pairs, has garnered considerable attention. Its objective is to identify and localize spatial–temporal regions within a video that correspond to objects or events described in a query sentence. Existing approaches frame this task as a multiple instance learning (MIL) problem, where a bag is constructed for each frame and the same sentence is assigned to all frame bags. However, this approach can lead to false-positive frames as not all frames necessarily correspond to the query sentence. Additionally, region proposals in each frame are typically generated by pre-trained object detection models, which primarily focus on core regions and may result in inaccurate object or event localization. To address these issues, we propose annotating a spatial–temporal region in a single frame, which provides a simple yet effective means to enhance grounding performance without incurring significant additional cost. Specifically, we innovatively contribute a spatial–temporal MIL framework. In the temporal-level MIL, by applying Gaussian weighting to the frames of a video, we assign higher weights to the frames that are close to the annotated frame, while lower weights are assigned to frames that are further away. In the spatial-level MIL, we propose regions in the each frame and compute their similarity with the annotated bounding box, selecting regions with higher similarity scores for training. Ultimately, temporal-level and spatial-level MILs are integrated to jointly optimize the accuracy of both types of grounding. Through experimental evaluations on two re-annotated datasets, our proposed framework has been demonstrated to exhibit superiority in terms of both overall performance comparison and detailed micro-level analyses. Compared to the latest weakly-supervised methods on the VidSTG dataset, our method improves the temporal localization performance by at least 10.35% and the spatial localization performance by at least 11.89%.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"314 ","pages":"Article 113200"},"PeriodicalIF":7.6000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Weakly-supervised spatial–temporal video grounding via spatial–temporal annotation on a single frame\",\"authors\":\"Shu Luo, Shijie Jiang, Da Cao, Huangxiao Deng, Jiawei Wang, Zheng Qin\",\"doi\":\"10.1016/j.knosys.2025.113200\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The task of weakly-supervised spatial–temporal video grounding, where model training only relies on video-sentence pairs, has garnered considerable attention. Its objective is to identify and localize spatial–temporal regions within a video that correspond to objects or events described in a query sentence. Existing approaches frame this task as a multiple instance learning (MIL) problem, where a bag is constructed for each frame and the same sentence is assigned to all frame bags. However, this approach can lead to false-positive frames as not all frames necessarily correspond to the query sentence. Additionally, region proposals in each frame are typically generated by pre-trained object detection models, which primarily focus on core regions and may result in inaccurate object or event localization. To address these issues, we propose annotating a spatial–temporal region in a single frame, which provides a simple yet effective means to enhance grounding performance without incurring significant additional cost. Specifically, we innovatively contribute a spatial–temporal MIL framework. In the temporal-level MIL, by applying Gaussian weighting to the frames of a video, we assign higher weights to the frames that are close to the annotated frame, while lower weights are assigned to frames that are further away. In the spatial-level MIL, we propose regions in the each frame and compute their similarity with the annotated bounding box, selecting regions with higher similarity scores for training. Ultimately, temporal-level and spatial-level MILs are integrated to jointly optimize the accuracy of both types of grounding. Through experimental evaluations on two re-annotated datasets, our proposed framework has been demonstrated to exhibit superiority in terms of both overall performance comparison and detailed micro-level analyses. Compared to the latest weakly-supervised methods on the VidSTG dataset, our method improves the temporal localization performance by at least 10.35% and the spatial localization performance by at least 11.89%.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"314 \",\"pages\":\"Article 113200\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950705125002473\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/2/27 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125002473","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/27 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Weakly-supervised spatial–temporal video grounding via spatial–temporal annotation on a single frame
The task of weakly-supervised spatial–temporal video grounding, where model training only relies on video-sentence pairs, has garnered considerable attention. Its objective is to identify and localize spatial–temporal regions within a video that correspond to objects or events described in a query sentence. Existing approaches frame this task as a multiple instance learning (MIL) problem, where a bag is constructed for each frame and the same sentence is assigned to all frame bags. However, this approach can lead to false-positive frames as not all frames necessarily correspond to the query sentence. Additionally, region proposals in each frame are typically generated by pre-trained object detection models, which primarily focus on core regions and may result in inaccurate object or event localization. To address these issues, we propose annotating a spatial–temporal region in a single frame, which provides a simple yet effective means to enhance grounding performance without incurring significant additional cost. Specifically, we innovatively contribute a spatial–temporal MIL framework. In the temporal-level MIL, by applying Gaussian weighting to the frames of a video, we assign higher weights to the frames that are close to the annotated frame, while lower weights are assigned to frames that are further away. In the spatial-level MIL, we propose regions in the each frame and compute their similarity with the annotated bounding box, selecting regions with higher similarity scores for training. Ultimately, temporal-level and spatial-level MILs are integrated to jointly optimize the accuracy of both types of grounding. Through experimental evaluations on two re-annotated datasets, our proposed framework has been demonstrated to exhibit superiority in terms of both overall performance comparison and detailed micro-level analyses. Compared to the latest weakly-supervised methods on the VidSTG dataset, our method improves the temporal localization performance by at least 10.35% and the spatial localization performance by at least 11.89%.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.