分级任务感知时间建模与匹配的少镜头动作识别

IF 6.7 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neurocomputing Pub Date : 2025-04-01 Epub Date: 2025-01-27 DOI:10.1016/j.neucom.2025.129467

Yucheng Zhan , Yijun Pan , Siying Wu , Yueyi Zhang , Xiaoyan Sun

{"title":"分级任务感知时间建模与匹配的少镜头动作识别","authors":"Yucheng Zhan , Yijun Pan , Siying Wu , Yueyi Zhang , Xiaoyan Sun","doi":"10.1016/j.neucom.2025.129467","DOIUrl":null,"url":null,"abstract":"<div><div>Few-shot action recognition seeks to classify new action categories using only a few labeled video samples as reference. Due to the lack of sufficient training samples and the complex structure of video data, it is difficult to extract global features that can be directly used for classification. Therefore, most previous works adopt image encoders to extract the features of each frame individually, and then perform temporal fusion and alignment for the query and support features. However, they neglect the importance of sufficient spatiotemporal modeling and relationships with other categories in the few-shot task when extracting features, rendering them less effective at distinguishing between the given action classes, especially those that require the perception of local motion. In this paper, we present Hierarchical Task-aware Temporal Modeling and Matching (HTTMM) to better perceive critical motion patterns and extract task-relevant discriminative features for matching. Specifically, we propose a task guidance generator and a hierarchical task-aware temporal module. The former collects samples across all categories to get task contextual information and generate task guidance. The latter performs hierarchical spatiotemporal modeling in a task-aware manner by incorporating the task guidance and leveraging the multi-stage visual features of the image encoder. This design makes the extracted features rich in motion cues and more discriminative among the new classes in the task. Based on the global and local features obtained, we further propose a simple matching strategy that takes multi-level relationships into consideration to improve the robustness of matching. To demonstrate the effectiveness of our method, we evaluate it on four commonly used datasets, <em>i.e.</em>, Kinetics, UCF101, HMDB51, and Something–Something v2. The experimental results show that our method outperforms the existing state-of-the-art methods.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"624 ","pages":"Article 129467"},"PeriodicalIF":6.7000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hierarchical Task-aware Temporal Modeling and Matching for few-shot action recognition\",\"authors\":\"Yucheng Zhan , Yijun Pan , Siying Wu , Yueyi Zhang , Xiaoyan Sun\",\"doi\":\"10.1016/j.neucom.2025.129467\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Few-shot action recognition seeks to classify new action categories using only a few labeled video samples as reference. Due to the lack of sufficient training samples and the complex structure of video data, it is difficult to extract global features that can be directly used for classification. Therefore, most previous works adopt image encoders to extract the features of each frame individually, and then perform temporal fusion and alignment for the query and support features. However, they neglect the importance of sufficient spatiotemporal modeling and relationships with other categories in the few-shot task when extracting features, rendering them less effective at distinguishing between the given action classes, especially those that require the perception of local motion. In this paper, we present Hierarchical Task-aware Temporal Modeling and Matching (HTTMM) to better perceive critical motion patterns and extract task-relevant discriminative features for matching. Specifically, we propose a task guidance generator and a hierarchical task-aware temporal module. The former collects samples across all categories to get task contextual information and generate task guidance. The latter performs hierarchical spatiotemporal modeling in a task-aware manner by incorporating the task guidance and leveraging the multi-stage visual features of the image encoder. This design makes the extracted features rich in motion cues and more discriminative among the new classes in the task. Based on the global and local features obtained, we further propose a simple matching strategy that takes multi-level relationships into consideration to improve the robustness of matching. To demonstrate the effectiveness of our method, we evaluate it on four commonly used datasets, <em>i.e.</em>, Kinetics, UCF101, HMDB51, and Something–Something v2. The experimental results show that our method outperforms the existing state-of-the-art methods.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"624 \",\"pages\":\"Article 129467\"},\"PeriodicalIF\":6.7000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225001390\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/27 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225001390","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/27 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

少镜头动作识别试图分类新的动作类别，只使用少数标记的视频样本作为参考。由于缺乏足够的训练样本和视频数据的复杂结构，很难提取出可以直接用于分类的全局特征。因此，以往的工作大多采用图像编码器单独提取每一帧的特征，然后对查询和支持特征进行时间融合和对齐。然而，当提取特征时，他们忽略了在少数镜头任务中充分的时空建模和与其他类别的关系的重要性，使得他们在区分给定动作类别时效率较低，特别是那些需要感知局部运动的动作类别。在本文中，我们提出了分层任务感知时态建模和匹配（HTTMM）来更好地感知关键运动模式并提取与任务相关的判别特征进行匹配。具体来说，我们提出了一个任务引导生成器和一个分层任务感知时态模块。前者收集所有类别的样本，以获取任务上下文信息并生成任务指导。后者通过结合任务指导和利用图像编码器的多阶段视觉特征，以任务感知的方式执行分层时空建模。这种设计使得提取的特征具有丰富的运动线索，并且在任务中的新类之间具有更强的辨别力。在得到全局和局部特征的基础上，我们进一步提出了一种考虑多层次关系的简单匹配策略，以提高匹配的鲁棒性。为了证明我们的方法的有效性，我们在四个常用的数据集上进行了评估，即Kinetics， UCF101， HMDB51和Something-Something v2。实验结果表明，该方法优于现有的最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Hierarchical Task-aware Temporal Modeling and Matching for few-shot action recognition

Few-shot action recognition seeks to classify new action categories using only a few labeled video samples as reference. Due to the lack of sufficient training samples and the complex structure of video data, it is difficult to extract global features that can be directly used for classification. Therefore, most previous works adopt image encoders to extract the features of each frame individually, and then perform temporal fusion and alignment for the query and support features. However, they neglect the importance of sufficient spatiotemporal modeling and relationships with other categories in the few-shot task when extracting features, rendering them less effective at distinguishing between the given action classes, especially those that require the perception of local motion. In this paper, we present Hierarchical Task-aware Temporal Modeling and Matching (HTTMM) to better perceive critical motion patterns and extract task-relevant discriminative features for matching. Specifically, we propose a task guidance generator and a hierarchical task-aware temporal module. The former collects samples across all categories to get task contextual information and generate task guidance. The latter performs hierarchical spatiotemporal modeling in a task-aware manner by incorporating the task guidance and leveraging the multi-stage visual features of the image encoder. This design makes the extracted features rich in motion cues and more discriminative among the new classes in the task. Based on the global and local features obtained, we further propose a simple matching strategy that takes multi-level relationships into consideration to improve the robustness of matching. To demonstrate the effectiveness of our method, we evaluate it on four commonly used datasets, i.e., Kinetics, UCF101, HMDB51, and Something–Something v2. The experimental results show that our method outperforms the existing state-of-the-art methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.