基于单帧时空注释的弱监督时空视频接地

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Knowledge-Based Systems Pub Date : 2025-04-08 Epub Date: 2025-02-27 DOI:10.1016/j.knosys.2025.113200

Shu Luo, Shijie Jiang, Da Cao, Huangxiao Deng, Jiawei Wang, Zheng Qin

{"title":"基于单帧时空注释的弱监督时空视频接地","authors":"Shu Luo, Shijie Jiang, Da Cao, Huangxiao Deng, Jiawei Wang, Zheng Qin","doi":"10.1016/j.knosys.2025.113200","DOIUrl":null,"url":null,"abstract":"<div><div>The task of weakly-supervised spatial–temporal video grounding, where model training only relies on video-sentence pairs, has garnered considerable attention. Its objective is to identify and localize spatial–temporal regions within a video that correspond to objects or events described in a query sentence. Existing approaches frame this task as a multiple instance learning (MIL) problem, where a bag is constructed for each frame and the same sentence is assigned to all frame bags. However, this approach can lead to false-positive frames as not all frames necessarily correspond to the query sentence. Additionally, region proposals in each frame are typically generated by pre-trained object detection models, which primarily focus on core regions and may result in inaccurate object or event localization. To address these issues, we propose annotating a spatial–temporal region in a single frame, which provides a simple yet effective means to enhance grounding performance without incurring significant additional cost. Specifically, we innovatively contribute a spatial–temporal MIL framework. In the temporal-level MIL, by applying Gaussian weighting to the frames of a video, we assign higher weights to the frames that are close to the annotated frame, while lower weights are assigned to frames that are further away. In the spatial-level MIL, we propose regions in the each frame and compute their similarity with the annotated bounding box, selecting regions with higher similarity scores for training. Ultimately, temporal-level and spatial-level MILs are integrated to jointly optimize the accuracy of both types of grounding. Through experimental evaluations on two re-annotated datasets, our proposed framework has been demonstrated to exhibit superiority in terms of both overall performance comparison and detailed micro-level analyses. Compared to the latest weakly-supervised methods on the VidSTG dataset, our method improves the temporal localization performance by at least 10.35% and the spatial localization performance by at least 11.89%.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"314 ","pages":"Article 113200"},"PeriodicalIF":7.6000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Weakly-supervised spatial–temporal video grounding via spatial–temporal annotation on a single frame\",\"authors\":\"Shu Luo, Shijie Jiang, Da Cao, Huangxiao Deng, Jiawei Wang, Zheng Qin\",\"doi\":\"10.1016/j.knosys.2025.113200\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The task of weakly-supervised spatial–temporal video grounding, where model training only relies on video-sentence pairs, has garnered considerable attention. Its objective is to identify and localize spatial–temporal regions within a video that correspond to objects or events described in a query sentence. Existing approaches frame this task as a multiple instance learning (MIL) problem, where a bag is constructed for each frame and the same sentence is assigned to all frame bags. However, this approach can lead to false-positive frames as not all frames necessarily correspond to the query sentence. Additionally, region proposals in each frame are typically generated by pre-trained object detection models, which primarily focus on core regions and may result in inaccurate object or event localization. To address these issues, we propose annotating a spatial–temporal region in a single frame, which provides a simple yet effective means to enhance grounding performance without incurring significant additional cost. Specifically, we innovatively contribute a spatial–temporal MIL framework. In the temporal-level MIL, by applying Gaussian weighting to the frames of a video, we assign higher weights to the frames that are close to the annotated frame, while lower weights are assigned to frames that are further away. In the spatial-level MIL, we propose regions in the each frame and compute their similarity with the annotated bounding box, selecting regions with higher similarity scores for training. Ultimately, temporal-level and spatial-level MILs are integrated to jointly optimize the accuracy of both types of grounding. Through experimental evaluations on two re-annotated datasets, our proposed framework has been demonstrated to exhibit superiority in terms of both overall performance comparison and detailed micro-level analyses. Compared to the latest weakly-supervised methods on the VidSTG dataset, our method improves the temporal localization performance by at least 10.35% and the spatial localization performance by at least 11.89%.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"314 \",\"pages\":\"Article 113200\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950705125002473\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/2/27 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125002473","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/27 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

弱监督的时空视频基础的任务，其中模型训练仅依赖于视频句子对，已经引起了相当大的关注。它的目标是识别和定位视频中的时空区域，这些区域与查询语句中描述的对象或事件相对应。现有的方法将此任务框架为多实例学习（MIL）问题，其中为每个框架构建一个包，并将相同的句子分配给所有框架包。然而，这种方法可能导致假阳性框架，因为并非所有框架都必须与查询句相对应。此外，每帧中的区域建议通常是由预训练的目标检测模型生成的，这些模型主要关注核心区域，可能导致对象或事件定位不准确。为了解决这些问题，我们建议在单一框架中标注一个时空区域，这提供了一种简单而有效的方法来提高接地性能，而不会产生显著的额外成本。具体而言，我们创新地提出了一个时空MIL框架。在时间级的MIL中，通过对视频的帧应用高斯加权，我们将较高的权重分配给靠近注释帧的帧，而将较低的权重分配给较远的帧。在空间级的MIL中，我们在每帧中提出区域，并计算它们与标注的边界框的相似度，选择相似度得分较高的区域进行训练。最终，将时间级和空间级mil集成在一起，共同优化两种接地类型的精度。通过对两个重新注释的数据集的实验评估，我们提出的框架在总体性能比较和详细的微观层面分析方面都表现出优势。与VidSTG数据集上最新的弱监督方法相比，该方法的时间定位性能提高了至少10.35%，空间定位性能提高了至少11.89%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Weakly-supervised spatial–temporal video grounding via spatial–temporal annotation on a single frame

The task of weakly-supervised spatial–temporal video grounding, where model training only relies on video-sentence pairs, has garnered considerable attention. Its objective is to identify and localize spatial–temporal regions within a video that correspond to objects or events described in a query sentence. Existing approaches frame this task as a multiple instance learning (MIL) problem, where a bag is constructed for each frame and the same sentence is assigned to all frame bags. However, this approach can lead to false-positive frames as not all frames necessarily correspond to the query sentence. Additionally, region proposals in each frame are typically generated by pre-trained object detection models, which primarily focus on core regions and may result in inaccurate object or event localization. To address these issues, we propose annotating a spatial–temporal region in a single frame, which provides a simple yet effective means to enhance grounding performance without incurring significant additional cost. Specifically, we innovatively contribute a spatial–temporal MIL framework. In the temporal-level MIL, by applying Gaussian weighting to the frames of a video, we assign higher weights to the frames that are close to the annotated frame, while lower weights are assigned to frames that are further away. In the spatial-level MIL, we propose regions in the each frame and compute their similarity with the annotated bounding box, selecting regions with higher similarity scores for training. Ultimately, temporal-level and spatial-level MILs are integrated to jointly optimize the accuracy of both types of grounding. Through experimental evaluations on two re-annotated datasets, our proposed framework has been demonstrated to exhibit superiority in terms of both overall performance comparison and detailed micro-level analyses. Compared to the latest weakly-supervised methods on the VidSTG dataset, our method improves the temporal localization performance by at least 10.35% and the spatial localization performance by at least 11.89%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.