Transformer-based multiview spatiotemporal feature interactive fusion for human action recognition in depth videos

IF 3.4 3区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Signal Processing-Image Communication Pub Date : 2024-11-23 DOI:10.1016/j.image.2024.117244
Hanbo Wu, Xin Ma, Yibin Li
{"title":"Transformer-based multiview spatiotemporal feature interactive fusion for human action recognition in depth videos","authors":"Hanbo Wu,&nbsp;Xin Ma,&nbsp;Yibin Li","doi":"10.1016/j.image.2024.117244","DOIUrl":null,"url":null,"abstract":"<div><div>Spatiotemporal feature modeling is the key to human action recognition task. Multiview data is helpful in acquiring numerous clues to improve the robustness and accuracy of feature description. However, multiview action recognition has not been well explored yet. Most existing methods perform action recognition only from a single view, which leads to the limited performance. Depth data is insensitive to illumination and color variations and offers significant advantages by providing reliable 3D geometric information of the human body. In this study, we concentrate on action recognition from depth videos and introduce a transformer-based framework for the interactive fusion of multiview spatiotemporal features, facilitating effective action recognition through deep integration of multiview information. Specifically, the proposed framework consists of intra-view spatiotemporal feature modeling (ISTFM) and cross-view feature interactive fusion (CFIF). Firstly, we project a depth video into three orthogonal views to construct multiview depth dynamic volumes that describe the 3D spatiotemporal evolution of human actions. ISTFM takes multiview depth dynamic volumes as input to extract spatiotemporal features of three views with 3D CNN, then applies self-attention mechanism in transformer to model global context dependency within each view. CFIF subsequently extends self-attention into cross-attention to conduct deep interaction between different views, and further integrates cross-view features together to generate a multiview joint feature representation. Our proposed method is tested on two large-scale RGBD datasets by extensive experiments to demonstrate the remarkable improvement for enhancing the recognition performance.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"131 ","pages":"Article 117244"},"PeriodicalIF":3.4000,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing-Image Communication","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0923596524001450","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Spatiotemporal feature modeling is the key to human action recognition task. Multiview data is helpful in acquiring numerous clues to improve the robustness and accuracy of feature description. However, multiview action recognition has not been well explored yet. Most existing methods perform action recognition only from a single view, which leads to the limited performance. Depth data is insensitive to illumination and color variations and offers significant advantages by providing reliable 3D geometric information of the human body. In this study, we concentrate on action recognition from depth videos and introduce a transformer-based framework for the interactive fusion of multiview spatiotemporal features, facilitating effective action recognition through deep integration of multiview information. Specifically, the proposed framework consists of intra-view spatiotemporal feature modeling (ISTFM) and cross-view feature interactive fusion (CFIF). Firstly, we project a depth video into three orthogonal views to construct multiview depth dynamic volumes that describe the 3D spatiotemporal evolution of human actions. ISTFM takes multiview depth dynamic volumes as input to extract spatiotemporal features of three views with 3D CNN, then applies self-attention mechanism in transformer to model global context dependency within each view. CFIF subsequently extends self-attention into cross-attention to conduct deep interaction between different views, and further integrates cross-view features together to generate a multiview joint feature representation. Our proposed method is tested on two large-scale RGBD datasets by extensive experiments to demonstrate the remarkable improvement for enhancing the recognition performance.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于变换的多视点时空特征交互融合深度视频人体动作识别
时空特征建模是人体动作识别任务的关键。多视图数据有助于获取大量线索,提高特征描述的鲁棒性和准确性。然而,多视点动作识别还没有得到很好的探索。现有的方法大多只从单个视图进行动作识别,这导致了性能的限制。深度数据对光照和颜色变化不敏感,提供可靠的人体三维几何信息,具有显著的优势。在本研究中,我们将重点放在深度视频的动作识别上,引入一种基于transform的多视点时空特征交互融合框架,通过深度融合多视点信息实现有效的动作识别。具体而言,该框架包括视图内时空特征建模(ISTFM)和跨视图特征交互融合(CFIF)。首先,我们将深度视频投影到三个正交视图中,构建描述人类行为三维时空演变的多视图深度动态体。ISTFM以多视图深度动态体为输入,利用3D CNN提取三个视图的时空特征,然后利用transformer中的自关注机制来模拟每个视图内的全局上下文依赖关系。CFIF随后将自注意扩展为交叉注意,在不同视图之间进行深度交互,并进一步将交叉视图特征整合在一起,生成多视图联合特征表示。在两个大规模的RGBD数据集上进行了大量的实验,证明了该方法在提高识别性能方面的显著改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Signal Processing-Image Communication
Signal Processing-Image Communication 工程技术-工程:电子与电气
CiteScore
8.40
自引率
2.90%
发文量
138
审稿时长
5.2 months
期刊介绍: Signal Processing: Image Communication is an international journal for the development of the theory and practice of image communication. Its primary objectives are the following: To present a forum for the advancement of theory and practice of image communication. To stimulate cross-fertilization between areas similar in nature which have traditionally been separated, for example, various aspects of visual communications and information systems. To contribute to a rapid information exchange between the industrial and academic environments. The editorial policy and the technical content of the journal are the responsibility of the Editor-in-Chief, the Area Editors and the Advisory Editors. The Journal is self-supporting from subscription income and contains a minimum amount of advertisements. Advertisements are subject to the prior approval of the Editor-in-Chief. The journal welcomes contributions from every country in the world. Signal Processing: Image Communication publishes articles relating to aspects of the design, implementation and use of image communication systems. The journal features original research work, tutorial and review articles, and accounts of practical developments. Subjects of interest include image/video coding, 3D video representations and compression, 3D graphics and animation compression, HDTV and 3DTV systems, video adaptation, video over IP, peer-to-peer video networking, interactive visual communication, multi-user video conferencing, wireless video broadcasting and communication, visual surveillance, 2D and 3D image/video quality measures, pre/post processing, video restoration and super-resolution, multi-camera video analysis, motion analysis, content-based image/video indexing and retrieval, face and gesture processing, video synthesis, 2D and 3D image/video acquisition and display technologies, architectures for image/video processing and communication.
期刊最新文献
Graph-based image captioning with semantic and spatial features Visual information fidelity based frame level rate control for H.265/HEVC Editorial Board Transformer-based multiview spatiotemporal feature interactive fusion for human action recognition in depth videos Vocal cord anomaly detection based on Local Fine-Grained Contour Features
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1