Sathiyamoorthi Arthanari, Jae Hoon Jeong, Young Hoon Joo
{"title":"利用特征帧填充网络探索用于三维人体姿态估计的多级变换器","authors":"Sathiyamoorthi Arthanari, Jae Hoon Jeong, Young Hoon Joo","doi":"10.1007/s00530-024-01451-4","DOIUrl":null,"url":null,"abstract":"<p>Recently, transformer-based architecture achieved remarkable performance in 2D to 3D lifting pose estimation. Despite advancements in transformer-based architecture they still struggle to handle depth ambiguity, limited temporal information, lacking edge frame details, and short-term temporal features. Consequently, transformer architecture encounters challenges in preciously estimating the 3D human position. To address these problems, we proposed Multi-Level Transformers with a Feature Frame Padding Network (MLTFFPN). To do this, we first propose the frame-padding network, which allows the network to capture longer temporal dependencies and effectively address the lacking edge frame information, enabling a better understanding of the sequential nature of human motion and improving the accuracy of pose estimation. Furthermore, we employ a multi-level transformer to extract temporal information from 3D human poses, which aims to improve the short-range temporal dependencies among keypoints of the human pose skeleton. Specifically, we introduce the Refined Temporal Constriction and Proliferation Transformer (RTCPT), which incorporates spatio-temporal encoders and a Temporal Constriction and Proliferation (TCP) structure to reveal multi-scale attention information and effectively addresses the depth ambiguity problem. Moreover, we incorporate the Feature Aggregation Refinement (FAR) module into the TCP block in a cross-layer manner, which facilitates semantic representation through the persistent interaction of queries, keys, and values. We extensively evaluate the efficiency of our method through experiments on two well-known benchmark datasets: Human3.6M and MPI-INF-3DHP.</p>","PeriodicalId":3,"journal":{"name":"ACS Applied Electronic Materials","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploring multi-level transformers with feature frame padding network for 3D human pose estimation\",\"authors\":\"Sathiyamoorthi Arthanari, Jae Hoon Jeong, Young Hoon Joo\",\"doi\":\"10.1007/s00530-024-01451-4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Recently, transformer-based architecture achieved remarkable performance in 2D to 3D lifting pose estimation. Despite advancements in transformer-based architecture they still struggle to handle depth ambiguity, limited temporal information, lacking edge frame details, and short-term temporal features. Consequently, transformer architecture encounters challenges in preciously estimating the 3D human position. To address these problems, we proposed Multi-Level Transformers with a Feature Frame Padding Network (MLTFFPN). To do this, we first propose the frame-padding network, which allows the network to capture longer temporal dependencies and effectively address the lacking edge frame information, enabling a better understanding of the sequential nature of human motion and improving the accuracy of pose estimation. Furthermore, we employ a multi-level transformer to extract temporal information from 3D human poses, which aims to improve the short-range temporal dependencies among keypoints of the human pose skeleton. Specifically, we introduce the Refined Temporal Constriction and Proliferation Transformer (RTCPT), which incorporates spatio-temporal encoders and a Temporal Constriction and Proliferation (TCP) structure to reveal multi-scale attention information and effectively addresses the depth ambiguity problem. Moreover, we incorporate the Feature Aggregation Refinement (FAR) module into the TCP block in a cross-layer manner, which facilitates semantic representation through the persistent interaction of queries, keys, and values. We extensively evaluate the efficiency of our method through experiments on two well-known benchmark datasets: Human3.6M and MPI-INF-3DHP.</p>\",\"PeriodicalId\":3,\"journal\":{\"name\":\"ACS Applied Electronic Materials\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2024-08-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACS Applied Electronic Materials\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s00530-024-01451-4\",\"RegionNum\":3,\"RegionCategory\":\"材料科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Electronic Materials","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00530-024-01451-4","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Exploring multi-level transformers with feature frame padding network for 3D human pose estimation
Recently, transformer-based architecture achieved remarkable performance in 2D to 3D lifting pose estimation. Despite advancements in transformer-based architecture they still struggle to handle depth ambiguity, limited temporal information, lacking edge frame details, and short-term temporal features. Consequently, transformer architecture encounters challenges in preciously estimating the 3D human position. To address these problems, we proposed Multi-Level Transformers with a Feature Frame Padding Network (MLTFFPN). To do this, we first propose the frame-padding network, which allows the network to capture longer temporal dependencies and effectively address the lacking edge frame information, enabling a better understanding of the sequential nature of human motion and improving the accuracy of pose estimation. Furthermore, we employ a multi-level transformer to extract temporal information from 3D human poses, which aims to improve the short-range temporal dependencies among keypoints of the human pose skeleton. Specifically, we introduce the Refined Temporal Constriction and Proliferation Transformer (RTCPT), which incorporates spatio-temporal encoders and a Temporal Constriction and Proliferation (TCP) structure to reveal multi-scale attention information and effectively addresses the depth ambiguity problem. Moreover, we incorporate the Feature Aggregation Refinement (FAR) module into the TCP block in a cross-layer manner, which facilitates semantic representation through the persistent interaction of queries, keys, and values. We extensively evaluate the efficiency of our method through experiments on two well-known benchmark datasets: Human3.6M and MPI-INF-3DHP.