3D Human Mesh Reconstruction by Learning to Sample Joint Adaptive Tokens for Transformers

Proceedings of the 30th ACM International Conference on Multimedia Pub Date : 2022-10-10 DOI:10.1145/3503161.3548133

Youze Xue, Jiansheng Chen, Yudong Zhang, Cheng Yu, Huimin Ma, Hongbing Ma

{"title":"3D Human Mesh Reconstruction by Learning to Sample Joint Adaptive Tokens for Transformers","authors":"Youze Xue, Jiansheng Chen, Yudong Zhang, Cheng Yu, Huimin Ma, Hongbing Ma","doi":"10.1145/3503161.3548133","DOIUrl":null,"url":null,"abstract":"Reconstructing 3D human mesh from a single RGB image is a challenging task due to the inherent depth ambiguity. Researchers commonly use convolutional neural networks to extract features and then apply spatial aggregation on the feature maps to explore the embedded 3D cues in the 2D image. Recently, two methods of spatial aggregation, the transformers and the spatial attention, are adopted to achieve the state-of-the-art performance, whereas they both have limitations. The use of transformers helps modelling long-term dependency across different joints whereas the grid tokens are not adaptive for the positions and shapes of human joints in different images. On the contrary, the spatial attention focuses on joint-specific features. However, the non-local information of the body is ignored by the concentrated attention maps. To address these issues, we propose a Learnable Sampling module to generate joint adaptive tokens and then use transformers to aggregate global information. Feature vectors are sampled accordingly from the feature maps to form the tokens of different joints. The sampling weights are predicted by a learnable network so that the model can learn to sample joint-related features adaptively. Our adaptive tokens are explicitly correlated with human joints, so that more effective modeling of global dependency among different human joints can be achieved. To validate the effectiveness of our method, we conduct experiments on several popular datasets including Human3.6M and 3DPW. Our method achieves lower reconstruction errors in terms of both the vertex-based metric and the joint-based metric compared to previous state of the arts. The codes and the trained models are released at https://github.com/thuxyz19/Learnable-Sampling.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"94 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 30th ACM International Conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3503161.3548133","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Reconstructing 3D human mesh from a single RGB image is a challenging task due to the inherent depth ambiguity. Researchers commonly use convolutional neural networks to extract features and then apply spatial aggregation on the feature maps to explore the embedded 3D cues in the 2D image. Recently, two methods of spatial aggregation, the transformers and the spatial attention, are adopted to achieve the state-of-the-art performance, whereas they both have limitations. The use of transformers helps modelling long-term dependency across different joints whereas the grid tokens are not adaptive for the positions and shapes of human joints in different images. On the contrary, the spatial attention focuses on joint-specific features. However, the non-local information of the body is ignored by the concentrated attention maps. To address these issues, we propose a Learnable Sampling module to generate joint adaptive tokens and then use transformers to aggregate global information. Feature vectors are sampled accordingly from the feature maps to form the tokens of different joints. The sampling weights are predicted by a learnable network so that the model can learn to sample joint-related features adaptively. Our adaptive tokens are explicitly correlated with human joints, so that more effective modeling of global dependency among different human joints can be achieved. To validate the effectiveness of our method, we conduct experiments on several popular datasets including Human3.6M and 3DPW. Our method achieves lower reconstruction errors in terms of both the vertex-based metric and the joint-based metric compared to previous state of the arts. The codes and the trained models are released at https://github.com/thuxyz19/Learnable-Sampling.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过学习采样变形金刚关节自适应令牌进行三维人体网格重建

由于固有的深度模糊性，从单个RGB图像重建三维人体网格是一项具有挑战性的任务。研究人员通常使用卷积神经网络提取特征，然后在特征映射上应用空间聚合来探索2D图像中嵌入的3D线索。近年来，为了达到最先进的性能，人们采用了两种空间聚集方法:变压器和空间关注，但这两种方法都有局限性。变压器的使用有助于模拟不同关节之间的长期依赖关系，而网格标记不能适应不同图像中人体关节的位置和形状。相反，空间注意力集中在关节特征上。然而，身体的非局部信息被集中的注意图所忽略。为了解决这些问题，我们提出了一个可学习的采样模块来生成联合自适应令牌，然后使用变压器来聚合全局信息。从特征映射中抽取相应的特征向量，形成不同节点的标记。通过可学习网络预测采样权值，使模型能够自适应学习对关节相关特征进行采样。我们的自适应标记与人体关节显式相关，因此可以更有效地建模不同人体关节之间的全局依赖关系。为了验证我们方法的有效性，我们在几个流行的数据集上进行了实验，包括Human3.6M和3DPW。与以前的技术相比，我们的方法在基于顶点的度量和基于关节的度量方面都实现了更低的重建误差。代码和训练过的模型发布在https://github.com/thuxyz19/Learnable-Sampling。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 30th ACM International Conference on Multimedia

自引率

0.00%

发文量