Joint image-instance spatial–temporal attention for few-shot action recognition

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Vision and Image Understanding Pub Date : 2025-03-01 DOI:10.1016/j.cviu.2025.104322

Zefeng Qian , Chongyang Zhang , Yifei Huang , Gang Wang , Jiangyong Ying

{"title":"Joint image-instance spatial–temporal attention for few-shot action recognition","authors":"Zefeng Qian , Chongyang Zhang , Yifei Huang , Gang Wang , Jiangyong Ying","doi":"10.1016/j.cviu.2025.104322","DOIUrl":null,"url":null,"abstract":"<div><div>Few-shot Action Recognition (FSAR) constitutes a crucial challenge in computer vision, entailing the recognition of actions from a limited set of examples. Recent approaches mainly focus on employing image-level features to construct temporal dependencies and generate prototypes for each action category. However, a considerable number of these methods utilize mainly image-level features that incorporate background noise and focus insufficiently on real foreground (action-related instances), thereby compromising the recognition capability, particularly in the few-shot scenario. To tackle this issue, we propose a novel joint Image-Instance level Spatial–temporal attention approach (I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>ST) for Few-shot Action Recognition. The core concept of I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>ST is to perceive the action-related instances and integrate them with image features via spatial–temporal attention. Specifically, I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>ST consists of two key components: Action-related Instance Perception and Joint Image-Instance Spatial–temporal Attention. Given the basic representations from the feature extractor, the Action-related Instance Perception is introduced to perceive action-related instances under the guidance of a text-guided segmentation model. Subsequently, the Joint Image-Instance Spatial–temporal Attention is used to construct the feature dependency between instances and images. To enhance the prototype representations of different categories of videos, a pair of spatial–temporal attention sub-modules is introduced to combine image features and instance embeddings across both temporal and spatial dimensions, and a global fusion sub-module is utilized to aggregate global contextual information, then robust action video prototypes can be formed. Finally, based on the video prototype, a Global–Local Prototype Matching is performed for reliable few-shot video matching. In this manner, our proposed I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>ST can effectively exploit the foreground instance-level cues and model more accurate spatial–temporal relationships for the complex few-shot video recognition scenarios. Extensive experiments across standard few-shot benchmarks demonstrate that the proposed framework outperforms existing methods and achieves state-of-the-art performance under various few-shot settings.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"254 ","pages":"Article 104322"},"PeriodicalIF":4.3000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225000451","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Few-shot Action Recognition (FSAR) constitutes a crucial challenge in computer vision, entailing the recognition of actions from a limited set of examples. Recent approaches mainly focus on employing image-level features to construct temporal dependencies and generate prototypes for each action category. However, a considerable number of these methods utilize mainly image-level features that incorporate background noise and focus insufficiently on real foreground (action-related instances), thereby compromising the recognition capability, particularly in the few-shot scenario. To tackle this issue, we propose a novel joint Image-Instance level Spatial–temporal attention approach (I

^{2}

ST) for Few-shot Action Recognition. The core concept of I

^{2}

ST is to perceive the action-related instances and integrate them with image features via spatial–temporal attention. Specifically, I

^{2}

ST consists of two key components: Action-related Instance Perception and Joint Image-Instance Spatial–temporal Attention. Given the basic representations from the feature extractor, the Action-related Instance Perception is introduced to perceive action-related instances under the guidance of a text-guided segmentation model. Subsequently, the Joint Image-Instance Spatial–temporal Attention is used to construct the feature dependency between instances and images. To enhance the prototype representations of different categories of videos, a pair of spatial–temporal attention sub-modules is introduced to combine image features and instance embeddings across both temporal and spatial dimensions, and a global fusion sub-module is utilized to aggregate global contextual information, then robust action video prototypes can be formed. Finally, based on the video prototype, a Global–Local Prototype Matching is performed for reliable few-shot video matching. In this manner, our proposed I

^{2}

ST can effectively exploit the foreground instance-level cues and model more accurate spatial–temporal relationships for the complex few-shot video recognition scenarios. Extensive experiments across standard few-shot benchmarks demonstrate that the proposed framework outperforms existing methods and achieves state-of-the-art performance under various few-shot settings.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems

期刊最新文献

Editorial Board Incremental few-shot instance segmentation without fine-tuning on novel classes Navigating social contexts: A transformer approach to relationship recognition View-to-label: Multi-view consistency for self-supervised monocular 3D object detection Joint image-instance spatial–temporal attention for few-shot action recognition