分级任务感知时间建模与匹配的少镜头动作识别

IF 6.7 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neurocomputing Pub Date : 2025-04-01 Epub Date: 2025-01-27 DOI:10.1016/j.neucom.2025.129467
Yucheng Zhan , Yijun Pan , Siying Wu , Yueyi Zhang , Xiaoyan Sun
{"title":"分级任务感知时间建模与匹配的少镜头动作识别","authors":"Yucheng Zhan ,&nbsp;Yijun Pan ,&nbsp;Siying Wu ,&nbsp;Yueyi Zhang ,&nbsp;Xiaoyan Sun","doi":"10.1016/j.neucom.2025.129467","DOIUrl":null,"url":null,"abstract":"<div><div>Few-shot action recognition seeks to classify new action categories using only a few labeled video samples as reference. Due to the lack of sufficient training samples and the complex structure of video data, it is difficult to extract global features that can be directly used for classification. Therefore, most previous works adopt image encoders to extract the features of each frame individually, and then perform temporal fusion and alignment for the query and support features. However, they neglect the importance of sufficient spatiotemporal modeling and relationships with other categories in the few-shot task when extracting features, rendering them less effective at distinguishing between the given action classes, especially those that require the perception of local motion. In this paper, we present Hierarchical Task-aware Temporal Modeling and Matching (HTTMM) to better perceive critical motion patterns and extract task-relevant discriminative features for matching. Specifically, we propose a task guidance generator and a hierarchical task-aware temporal module. The former collects samples across all categories to get task contextual information and generate task guidance. The latter performs hierarchical spatiotemporal modeling in a task-aware manner by incorporating the task guidance and leveraging the multi-stage visual features of the image encoder. This design makes the extracted features rich in motion cues and more discriminative among the new classes in the task. Based on the global and local features obtained, we further propose a simple matching strategy that takes multi-level relationships into consideration to improve the robustness of matching. To demonstrate the effectiveness of our method, we evaluate it on four commonly used datasets, <em>i.e.</em>, Kinetics, UCF101, HMDB51, and Something–Something v2. The experimental results show that our method outperforms the existing state-of-the-art methods.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"624 ","pages":"Article 129467"},"PeriodicalIF":6.7000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hierarchical Task-aware Temporal Modeling and Matching for few-shot action recognition\",\"authors\":\"Yucheng Zhan ,&nbsp;Yijun Pan ,&nbsp;Siying Wu ,&nbsp;Yueyi Zhang ,&nbsp;Xiaoyan Sun\",\"doi\":\"10.1016/j.neucom.2025.129467\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Few-shot action recognition seeks to classify new action categories using only a few labeled video samples as reference. Due to the lack of sufficient training samples and the complex structure of video data, it is difficult to extract global features that can be directly used for classification. Therefore, most previous works adopt image encoders to extract the features of each frame individually, and then perform temporal fusion and alignment for the query and support features. However, they neglect the importance of sufficient spatiotemporal modeling and relationships with other categories in the few-shot task when extracting features, rendering them less effective at distinguishing between the given action classes, especially those that require the perception of local motion. In this paper, we present Hierarchical Task-aware Temporal Modeling and Matching (HTTMM) to better perceive critical motion patterns and extract task-relevant discriminative features for matching. Specifically, we propose a task guidance generator and a hierarchical task-aware temporal module. The former collects samples across all categories to get task contextual information and generate task guidance. The latter performs hierarchical spatiotemporal modeling in a task-aware manner by incorporating the task guidance and leveraging the multi-stage visual features of the image encoder. This design makes the extracted features rich in motion cues and more discriminative among the new classes in the task. Based on the global and local features obtained, we further propose a simple matching strategy that takes multi-level relationships into consideration to improve the robustness of matching. To demonstrate the effectiveness of our method, we evaluate it on four commonly used datasets, <em>i.e.</em>, Kinetics, UCF101, HMDB51, and Something–Something v2. The experimental results show that our method outperforms the existing state-of-the-art methods.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"624 \",\"pages\":\"Article 129467\"},\"PeriodicalIF\":6.7000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225001390\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/27 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225001390","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/27 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

少镜头动作识别试图分类新的动作类别,只使用少数标记的视频样本作为参考。由于缺乏足够的训练样本和视频数据的复杂结构,很难提取出可以直接用于分类的全局特征。因此,以往的工作大多采用图像编码器单独提取每一帧的特征,然后对查询和支持特征进行时间融合和对齐。然而,当提取特征时,他们忽略了在少数镜头任务中充分的时空建模和与其他类别的关系的重要性,使得他们在区分给定动作类别时效率较低,特别是那些需要感知局部运动的动作类别。在本文中,我们提出了分层任务感知时态建模和匹配(HTTMM)来更好地感知关键运动模式并提取与任务相关的判别特征进行匹配。具体来说,我们提出了一个任务引导生成器和一个分层任务感知时态模块。前者收集所有类别的样本,以获取任务上下文信息并生成任务指导。后者通过结合任务指导和利用图像编码器的多阶段视觉特征,以任务感知的方式执行分层时空建模。这种设计使得提取的特征具有丰富的运动线索,并且在任务中的新类之间具有更强的辨别力。在得到全局和局部特征的基础上,我们进一步提出了一种考虑多层次关系的简单匹配策略,以提高匹配的鲁棒性。为了证明我们的方法的有效性,我们在四个常用的数据集上进行了评估,即Kinetics, UCF101, HMDB51和Something-Something v2。实验结果表明,该方法优于现有的最先进的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Hierarchical Task-aware Temporal Modeling and Matching for few-shot action recognition
Few-shot action recognition seeks to classify new action categories using only a few labeled video samples as reference. Due to the lack of sufficient training samples and the complex structure of video data, it is difficult to extract global features that can be directly used for classification. Therefore, most previous works adopt image encoders to extract the features of each frame individually, and then perform temporal fusion and alignment for the query and support features. However, they neglect the importance of sufficient spatiotemporal modeling and relationships with other categories in the few-shot task when extracting features, rendering them less effective at distinguishing between the given action classes, especially those that require the perception of local motion. In this paper, we present Hierarchical Task-aware Temporal Modeling and Matching (HTTMM) to better perceive critical motion patterns and extract task-relevant discriminative features for matching. Specifically, we propose a task guidance generator and a hierarchical task-aware temporal module. The former collects samples across all categories to get task contextual information and generate task guidance. The latter performs hierarchical spatiotemporal modeling in a task-aware manner by incorporating the task guidance and leveraging the multi-stage visual features of the image encoder. This design makes the extracted features rich in motion cues and more discriminative among the new classes in the task. Based on the global and local features obtained, we further propose a simple matching strategy that takes multi-level relationships into consideration to improve the robustness of matching. To demonstrate the effectiveness of our method, we evaluate it on four commonly used datasets, i.e., Kinetics, UCF101, HMDB51, and Something–Something v2. The experimental results show that our method outperforms the existing state-of-the-art methods.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Neurocomputing
Neurocomputing 工程技术-计算机:人工智能
CiteScore
13.10
自引率
10.00%
发文量
1382
审稿时长
70 days
期刊介绍: Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.
期刊最新文献
Globally and locally constrained non-negative Tucker decomposition for enhanced tensor clustering Gait generation approach for multi-legged robots by using the delayed Hopfield-like CPG control system Distribution-aware feature selection reveals risk factors for osteonecrosis in systemic lupus erythematosus TSADmetrics: A library for evaluating time series anomaly detection methods Integrity verification of cloud-based neural network model training
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1