Yucheng Zhan , Yijun Pan , Siying Wu , Yueyi Zhang , Xiaoyan Sun
{"title":"分级任务感知时间建模与匹配的少镜头动作识别","authors":"Yucheng Zhan , Yijun Pan , Siying Wu , Yueyi Zhang , Xiaoyan Sun","doi":"10.1016/j.neucom.2025.129467","DOIUrl":null,"url":null,"abstract":"<div><div>Few-shot action recognition seeks to classify new action categories using only a few labeled video samples as reference. Due to the lack of sufficient training samples and the complex structure of video data, it is difficult to extract global features that can be directly used for classification. Therefore, most previous works adopt image encoders to extract the features of each frame individually, and then perform temporal fusion and alignment for the query and support features. However, they neglect the importance of sufficient spatiotemporal modeling and relationships with other categories in the few-shot task when extracting features, rendering them less effective at distinguishing between the given action classes, especially those that require the perception of local motion. In this paper, we present Hierarchical Task-aware Temporal Modeling and Matching (HTTMM) to better perceive critical motion patterns and extract task-relevant discriminative features for matching. Specifically, we propose a task guidance generator and a hierarchical task-aware temporal module. The former collects samples across all categories to get task contextual information and generate task guidance. The latter performs hierarchical spatiotemporal modeling in a task-aware manner by incorporating the task guidance and leveraging the multi-stage visual features of the image encoder. This design makes the extracted features rich in motion cues and more discriminative among the new classes in the task. Based on the global and local features obtained, we further propose a simple matching strategy that takes multi-level relationships into consideration to improve the robustness of matching. To demonstrate the effectiveness of our method, we evaluate it on four commonly used datasets, <em>i.e.</em>, Kinetics, UCF101, HMDB51, and Something–Something v2. The experimental results show that our method outperforms the existing state-of-the-art methods.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"624 ","pages":"Article 129467"},"PeriodicalIF":6.7000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hierarchical Task-aware Temporal Modeling and Matching for few-shot action recognition\",\"authors\":\"Yucheng Zhan , Yijun Pan , Siying Wu , Yueyi Zhang , Xiaoyan Sun\",\"doi\":\"10.1016/j.neucom.2025.129467\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Few-shot action recognition seeks to classify new action categories using only a few labeled video samples as reference. Due to the lack of sufficient training samples and the complex structure of video data, it is difficult to extract global features that can be directly used for classification. Therefore, most previous works adopt image encoders to extract the features of each frame individually, and then perform temporal fusion and alignment for the query and support features. However, they neglect the importance of sufficient spatiotemporal modeling and relationships with other categories in the few-shot task when extracting features, rendering them less effective at distinguishing between the given action classes, especially those that require the perception of local motion. In this paper, we present Hierarchical Task-aware Temporal Modeling and Matching (HTTMM) to better perceive critical motion patterns and extract task-relevant discriminative features for matching. Specifically, we propose a task guidance generator and a hierarchical task-aware temporal module. The former collects samples across all categories to get task contextual information and generate task guidance. The latter performs hierarchical spatiotemporal modeling in a task-aware manner by incorporating the task guidance and leveraging the multi-stage visual features of the image encoder. This design makes the extracted features rich in motion cues and more discriminative among the new classes in the task. Based on the global and local features obtained, we further propose a simple matching strategy that takes multi-level relationships into consideration to improve the robustness of matching. To demonstrate the effectiveness of our method, we evaluate it on four commonly used datasets, <em>i.e.</em>, Kinetics, UCF101, HMDB51, and Something–Something v2. The experimental results show that our method outperforms the existing state-of-the-art methods.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"624 \",\"pages\":\"Article 129467\"},\"PeriodicalIF\":6.7000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225001390\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/27 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225001390","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/27 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Hierarchical Task-aware Temporal Modeling and Matching for few-shot action recognition
Few-shot action recognition seeks to classify new action categories using only a few labeled video samples as reference. Due to the lack of sufficient training samples and the complex structure of video data, it is difficult to extract global features that can be directly used for classification. Therefore, most previous works adopt image encoders to extract the features of each frame individually, and then perform temporal fusion and alignment for the query and support features. However, they neglect the importance of sufficient spatiotemporal modeling and relationships with other categories in the few-shot task when extracting features, rendering them less effective at distinguishing between the given action classes, especially those that require the perception of local motion. In this paper, we present Hierarchical Task-aware Temporal Modeling and Matching (HTTMM) to better perceive critical motion patterns and extract task-relevant discriminative features for matching. Specifically, we propose a task guidance generator and a hierarchical task-aware temporal module. The former collects samples across all categories to get task contextual information and generate task guidance. The latter performs hierarchical spatiotemporal modeling in a task-aware manner by incorporating the task guidance and leveraging the multi-stage visual features of the image encoder. This design makes the extracted features rich in motion cues and more discriminative among the new classes in the task. Based on the global and local features obtained, we further propose a simple matching strategy that takes multi-level relationships into consideration to improve the robustness of matching. To demonstrate the effectiveness of our method, we evaluate it on four commonly used datasets, i.e., Kinetics, UCF101, HMDB51, and Something–Something v2. The experimental results show that our method outperforms the existing state-of-the-art methods.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.