Daniel Cores , Lorenzo Seidenari , Alberto Del Bimbo , Víctor M. Brea , Manuel Mucientes
{"title":"A fine-tuning approach based on spatio-temporal features for few-shot video object detection","authors":"Daniel Cores , Lorenzo Seidenari , Alberto Del Bimbo , Víctor M. Brea , Manuel Mucientes","doi":"10.1016/j.engappai.2025.110198","DOIUrl":null,"url":null,"abstract":"<div><div>This paper describes a new Fine-Tuning approach for Few-Shot object detection in Videos that exploits spatio-temporal information to boost detection precision. Despite the progress made in the single image domain in recent years, the few-shot video object detection problem remains almost unexplored. A few-shot detector must quickly adapt to a new domain with a limited number of annotations per category. Therefore, it is not possible to include videos in the training set, hindering the spatio-temporal learning process. We propose augmenting each training image with synthetic frames to train the spatio-temporal module of our method. This module employs attention mechanisms to mine relationships between proposals across frames, effectively leveraging spatio-temporal information. A spatio-temporal double head then localizes objects in the current frame while classifying them using both context from nearby frames and information from the current frame. Finally, the predicted scores are fed into a long-term object-linking method that generates object tubes across the video. By optimizing the classification score based on these tubes, our approach ensures spatio-temporal consistency. Classification is the primary challenge in few-shot object detection. Our results show that spatio-temporal information helps to mitigate this issue, paving the way for future research in this direction. FTFSVid achieves 41.9 AP50 on the Few-Shot Video Object Detection (FSVOD-500) and 42.9 AP50 on the Few-Shot YouTube Video (FSYTV-40) dataset, surpassing our spatial baseline by 4.3 and 2.5 points. Additionally, FTFSVid outperforms previous few-shot video object detectors by 3.2 points on FSVOD-500 and 14.5 points on FSYTV-40, setting a new state-of-the-art.</div></div>","PeriodicalId":50523,"journal":{"name":"Engineering Applications of Artificial Intelligence","volume":"146 ","pages":"Article 110198"},"PeriodicalIF":7.5000,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Applications of Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0952197625001988","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
This paper describes a new Fine-Tuning approach for Few-Shot object detection in Videos that exploits spatio-temporal information to boost detection precision. Despite the progress made in the single image domain in recent years, the few-shot video object detection problem remains almost unexplored. A few-shot detector must quickly adapt to a new domain with a limited number of annotations per category. Therefore, it is not possible to include videos in the training set, hindering the spatio-temporal learning process. We propose augmenting each training image with synthetic frames to train the spatio-temporal module of our method. This module employs attention mechanisms to mine relationships between proposals across frames, effectively leveraging spatio-temporal information. A spatio-temporal double head then localizes objects in the current frame while classifying them using both context from nearby frames and information from the current frame. Finally, the predicted scores are fed into a long-term object-linking method that generates object tubes across the video. By optimizing the classification score based on these tubes, our approach ensures spatio-temporal consistency. Classification is the primary challenge in few-shot object detection. Our results show that spatio-temporal information helps to mitigate this issue, paving the way for future research in this direction. FTFSVid achieves 41.9 AP50 on the Few-Shot Video Object Detection (FSVOD-500) and 42.9 AP50 on the Few-Shot YouTube Video (FSYTV-40) dataset, surpassing our spatial baseline by 4.3 and 2.5 points. Additionally, FTFSVid outperforms previous few-shot video object detectors by 3.2 points on FSVOD-500 and 14.5 points on FSYTV-40, setting a new state-of-the-art.
期刊介绍:
Artificial Intelligence (AI) is pivotal in driving the fourth industrial revolution, witnessing remarkable advancements across various machine learning methodologies. AI techniques have become indispensable tools for practicing engineers, enabling them to tackle previously insurmountable challenges. Engineering Applications of Artificial Intelligence serves as a global platform for the swift dissemination of research elucidating the practical application of AI methods across all engineering disciplines. Submitted papers are expected to present novel aspects of AI utilized in real-world engineering applications, validated using publicly available datasets to ensure the replicability of research outcomes. Join us in exploring the transformative potential of AI in engineering.