FTAN: Frame-to-frame temporal alignment network with contrastive learning for few-shot action recognition

IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Image and Vision Computing Pub Date : 2024-07-04 DOI:10.1016/j.imavis.2024.105159
Bin Yu , Yonghong Hou , Zihui Guo , Zhiyi Gao , Yueyang Li
{"title":"FTAN: Frame-to-frame temporal alignment network with contrastive learning for few-shot action recognition","authors":"Bin Yu ,&nbsp;Yonghong Hou ,&nbsp;Zihui Guo ,&nbsp;Zhiyi Gao ,&nbsp;Yueyang Li","doi":"10.1016/j.imavis.2024.105159","DOIUrl":null,"url":null,"abstract":"<div><p>Most current few-shot action recognition approaches follow the metric learning paradigm, measuring the distance of any sub-sequences (frames, any frame combinations or clips) between different actions for classification. However, this disordered distance metric between action sub-sequences ignores the long-term temporal relations of actions, which may result in significant metric deviations. What's more, the distance metric suffers from the distinctive temporal distribution of different actions, including intra-class temporal offsets and inter-class local similarity. In this paper, a novel few-shot action recognition framework, Frame-to-frame Temporal Alignment Network (<strong>FTAN</strong>), is proposed to address the above challenges. Specifically, an attention-based temporal alignment (<strong>ATA</strong>) module is devised to calculate the distance between corresponding frames of different actions along the temporal dimension to achieve frame-to-frame temporal alignment. Meanwhile, the Temporal Context module (<strong>TCM</strong>) is proposed to increase inter-class diversity by enriching the frame-level feature representation, and the Frames Cyclic Shift Module (<strong>FCSM</strong>) performs frame-level temporal cyclic shift to reduce intra-class inconsistency. In addition, we present temporal and global contrastive objectives to assist in learning discriminative and class-agnostic visual features. Experimental results show that the proposed architecture achieves state-of-the-art on HMDB51, UCF101, Something-Something V2 and Kinetics-100 datasets.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2000,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624002646","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Most current few-shot action recognition approaches follow the metric learning paradigm, measuring the distance of any sub-sequences (frames, any frame combinations or clips) between different actions for classification. However, this disordered distance metric between action sub-sequences ignores the long-term temporal relations of actions, which may result in significant metric deviations. What's more, the distance metric suffers from the distinctive temporal distribution of different actions, including intra-class temporal offsets and inter-class local similarity. In this paper, a novel few-shot action recognition framework, Frame-to-frame Temporal Alignment Network (FTAN), is proposed to address the above challenges. Specifically, an attention-based temporal alignment (ATA) module is devised to calculate the distance between corresponding frames of different actions along the temporal dimension to achieve frame-to-frame temporal alignment. Meanwhile, the Temporal Context module (TCM) is proposed to increase inter-class diversity by enriching the frame-level feature representation, and the Frames Cyclic Shift Module (FCSM) performs frame-level temporal cyclic shift to reduce intra-class inconsistency. In addition, we present temporal and global contrastive objectives to assist in learning discriminative and class-agnostic visual features. Experimental results show that the proposed architecture achieves state-of-the-art on HMDB51, UCF101, Something-Something V2 and Kinetics-100 datasets.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
FTAN: 采用对比学习的帧到帧时序对齐网络,用于少镜头动作识别
目前,大多数短镜头动作识别方法都采用度量学习范式,测量不同动作之间的任何子序列(帧、任何帧组合或片段)的距离,以进行分类。然而,这种无序的动作子序列之间的距离度量忽略了动作的长期时间关系,可能会导致显著的度量偏差。此外,这种距离度量还受到不同动作的独特时间分布的影响,包括类内时间偏移和类间局部相似性。本文提出了一种新颖的少镜头动作识别框架--帧到帧时序对齐网络(FTAN),以应对上述挑战。具体来说,本文设计了一个基于注意力的时空对齐(ATA)模块,用于计算不同动作对应帧之间的时空距离,从而实现帧对帧的时空对齐。同时,我们还提出了时空语境模块(Temporal Context,TCM),通过丰富帧级特征表示来增加类间多样性,而帧周期移动模块(Frames Cyclic Shift Module,FCSM)则执行帧级时空周期移动,以减少类内不一致性。此外,我们还提出了时间和全局对比目标,以帮助学习具有区分性和类别无关性的视觉特征。实验结果表明,所提出的架构在 HMDB51、UCF101、Something-Something V2 和 Kinetics-100 数据集上达到了一流水平。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Image and Vision Computing
Image and Vision Computing 工程技术-工程:电子与电气
CiteScore
8.50
自引率
8.50%
发文量
143
审稿时长
7.8 months
期刊介绍: Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.
期刊最新文献
A dictionary learning based unsupervised neural network for single image compressed sensing Unbiased scene graph generation via head-tail cooperative network with self-supervised learning UIR-ES: An unsupervised underwater image restoration framework with equivariance and stein unbiased risk estimator A new deepfake detection model for responding to perception attacks in embodied artificial intelligence Ground4Act: Leveraging visual-language model for collaborative pushing and grasping in clutter
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1