Spatio-Temporal Dynamic Interlaced Network for 3D human pose estimation in video

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Vision and Image Understanding Pub Date : 2025-02-01 DOI:10.1016/j.cviu.2024.104258

Feiyi Xu , Jifan Wang , Ying Sun , Jin Qi , Zhenjiang Dong , Yanfei Sun

{"title":"Spatio-Temporal Dynamic Interlaced Network for 3D human pose estimation in video","authors":"Feiyi Xu , Jifan Wang , Ying Sun , Jin Qi , Zhenjiang Dong , Yanfei Sun","doi":"10.1016/j.cviu.2024.104258","DOIUrl":null,"url":null,"abstract":"<div><div>Recent transformer-based methods have achieved excellent performance in 3D human pose estimation. The distinguishing characteristic of transformer lies in its equitable treatment of each token, encoding them independently. When applied to the human skeleton, transformer regards each joint as an equally significant token. This can lead to a lack of clarity in the extraction of connection relationships between joints, thus affecting the accuracy of relationship information. In addition, transformer also treats each frame of temporal sequences equally. This design can introduce a lot of redundant information in short frames with frequent action changes, which can have a negative impact on learning temporal correlations. To alleviate the above issues, we propose an end-to-end framework, a Spatio-Temporal Dynamic Interlaced Network (S-TDINet), including a dynamic spatial GCN encoder (DSGCE) and an interlaced temporal transformer encoder (ITTE). In the DSGCE module, we design three adaptive adjacency matrices to model spatial correlation from static and dynamic perspectives. In the ITTE module, we introduce a global–local interlaced mechanism to mitigate potential interference from redundant information in fast motion scenarios, thereby achieving more accurate temporal correlation modeling. Finally, we conduct extensive experiments and validate the effectiveness of our approach on two widely recognized benchmark datasets: Human3.6M and MPI-INF-3DHP.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104258"},"PeriodicalIF":4.3000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224003394","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Recent transformer-based methods have achieved excellent performance in 3D human pose estimation. The distinguishing characteristic of transformer lies in its equitable treatment of each token, encoding them independently. When applied to the human skeleton, transformer regards each joint as an equally significant token. This can lead to a lack of clarity in the extraction of connection relationships between joints, thus affecting the accuracy of relationship information. In addition, transformer also treats each frame of temporal sequences equally. This design can introduce a lot of redundant information in short frames with frequent action changes, which can have a negative impact on learning temporal correlations. To alleviate the above issues, we propose an end-to-end framework, a Spatio-Temporal Dynamic Interlaced Network (S-TDINet), including a dynamic spatial GCN encoder (DSGCE) and an interlaced temporal transformer encoder (ITTE). In the DSGCE module, we design three adaptive adjacency matrices to model spatial correlation from static and dynamic perspectives. In the ITTE module, we introduce a global–local interlaced mechanism to mitigate potential interference from redundant information in fast motion scenarios, thereby achieving more accurate temporal correlation modeling. Finally, we conduct extensive experiments and validate the effectiveness of our approach on two widely recognized benchmark datasets: Human3.6M and MPI-INF-3DHP.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems