Exploiting multi-transformer encoder with multiple-hypothesis aggregation via diffusion model for 3D human pose estimation

IF 3 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Multimedia Tools and Applications Pub Date : 2024-09-10 DOI:10.1007/s11042-024-20179-x

Sathiyamoorthi Arthanari, Jae Hoon Jeong, Young Hoon Joo

{"title":"Exploiting multi-transformer encoder with multiple-hypothesis aggregation via diffusion model for 3D human pose estimation","authors":"Sathiyamoorthi Arthanari, Jae Hoon Jeong, Young Hoon Joo","doi":"10.1007/s11042-024-20179-x","DOIUrl":null,"url":null,"abstract":"<p>The transformer architecture has consistently achieved cutting-edge performance in the task of 2D to 3D lifting human pose estimation. Despite advances in transformer-based methods they still suffer from issues related to sequential data processing, addressing depth ambiguity, and effective handling of sensitive noisy data. As a result, transformer encoders encounter difficulties in precisely estimating human positions. To solve this problem, a novel multi-transformer encoder with a multiple-hypothesis aggregation (MHAFormer) module is proposed in this study. To do this, a diffusion module is first introduced that generates multiple 3D pose hypotheses and gradually distributes Gaussian noise to ground truth 3D poses. Subsequently, the denoiser is employed within the diffusion module to restore the feasible 3D poses by leveraging the information from the 2D keypoints. Moreover, we propose the multiple-hypothesis aggregation with a join-level reprojection (MHAJR) approach that redesigns the 3D hypotheses into the 2D position and selects the optimal hypothesis by considering reprojection errors. In particular, the multiple-hypothesis aggregation approach tackles depth ambiguity and sequential data processing by considering various possible poses and combining their strengths for a more accurate final estimation. Next, we present the improved spatial-temporal transformers encoder that can help to improve the accuracy and reduce the ambiguity of 3D pose estimation by explicitly modeling the spatial and temporal relationships between different body joints. Specifically, the temporal-transformer encoder introduces the temporal constriction & proliferation (TCP) attention mechanism and the feature aggregation refinement module (FAR) into the refined temporal constriction & proliferation (RTCP) transformer, which enhances intra-block temporal modeling and further refines inter-block feature interaction. Finally, the superiority of the proposed approach is demonstrated through comparison with existing methods using the Human3.6M and MPI-INF-3DHP benchmark datasets.</p>","PeriodicalId":18770,"journal":{"name":"Multimedia Tools and Applications","volume":"47 1","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Multimedia Tools and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11042-024-20179-x","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The transformer architecture has consistently achieved cutting-edge performance in the task of 2D to 3D lifting human pose estimation. Despite advances in transformer-based methods they still suffer from issues related to sequential data processing, addressing depth ambiguity, and effective handling of sensitive noisy data. As a result, transformer encoders encounter difficulties in precisely estimating human positions. To solve this problem, a novel multi-transformer encoder with a multiple-hypothesis aggregation (MHAFormer) module is proposed in this study. To do this, a diffusion module is first introduced that generates multiple 3D pose hypotheses and gradually distributes Gaussian noise to ground truth 3D poses. Subsequently, the denoiser is employed within the diffusion module to restore the feasible 3D poses by leveraging the information from the 2D keypoints. Moreover, we propose the multiple-hypothesis aggregation with a join-level reprojection (MHAJR) approach that redesigns the 3D hypotheses into the 2D position and selects the optimal hypothesis by considering reprojection errors. In particular, the multiple-hypothesis aggregation approach tackles depth ambiguity and sequential data processing by considering various possible poses and combining their strengths for a more accurate final estimation. Next, we present the improved spatial-temporal transformers encoder that can help to improve the accuracy and reduce the ambiguity of 3D pose estimation by explicitly modeling the spatial and temporal relationships between different body joints. Specifically, the temporal-transformer encoder introduces the temporal constriction & proliferation (TCP) attention mechanism and the feature aggregation refinement module (FAR) into the refined temporal constriction & proliferation (RTCP) transformer, which enhances intra-block temporal modeling and further refines inter-block feature interaction. Finally, the superiority of the proposed approach is demonstrated through comparison with existing methods using the Human3.6M and MPI-INF-3DHP benchmark datasets.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过扩散模型利用多变换器编码器和多假设聚合进行三维人体姿态估计

在从二维到三维的升降式人体姿态估计任务中，变换器架构一直保持着最先进的性能。尽管基于变压器的方法取得了进步，但仍存在与顺序数据处理、解决深度模糊性和有效处理敏感噪声数据相关的问题。因此，变压器编码器在精确估计人体位置方面遇到了困难。为解决这一问题，本研究提出了一种带有多重假设聚合（MHAFormer）模块的新型多变换器编码器。为此，首先引入一个扩散模块，生成多个三维姿态假设，并逐渐将高斯噪声分布到地面真实三维姿态上。然后，在扩散模块中使用去噪器，利用二维关键点的信息恢复可行的三维姿势。此外，我们还提出了带有连接级重投（MHAJR）的多假设聚合方法，将三维假设重新设计为二维位置，并通过考虑重投误差来选择最优假设。特别是，多假设聚合方法通过考虑各种可能的姿势并结合其优势以获得更准确的最终估计，从而解决了深度模糊性和顺序数据处理问题。接下来，我们介绍了改进的时空变换器编码器，它可以通过明确模拟不同身体关节之间的时空关系，帮助提高三维姿势估计的准确性并减少模糊性。具体来说，时空变换器编码器将时空收缩与扩散（TCP）注意机制和特征聚合细化模块（FAR）引入到细化时空收缩与扩散（RTCP）变换器中，从而增强了块内时空建模，并进一步细化了块间特征交互。最后，通过使用 Human3.6M 和 MPI-INF-3DHP 基准数据集与现有方法进行比较，证明了所提出方法的优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Multimedia Tools and Applications 工程技术-工程：电子与电气

CiteScore

7.20

自引率

16.70%

发文量

2439

审稿时长

9.2 months

期刊介绍： Multimedia Tools and Applications publishes original research articles on multimedia development and system support tools as well as case studies of multimedia applications. It also features experimental and survey articles. The journal is intended for academics, practitioners, scientists and engineers who are involved in multimedia system research, design and applications. All papers are peer reviewed. Specific areas of interest include: - Multimedia Tools: - Multimedia Applications: - Prototype multimedia systems and platforms