3D Human Mesh Reconstruction by Learning to Sample Joint Adaptive Tokens for Transformers

Youze Xue, Jiansheng Chen, Yudong Zhang, Cheng Yu, Huimin Ma, Hongbing Ma
{"title":"3D Human Mesh Reconstruction by Learning to Sample Joint Adaptive Tokens for Transformers","authors":"Youze Xue, Jiansheng Chen, Yudong Zhang, Cheng Yu, Huimin Ma, Hongbing Ma","doi":"10.1145/3503161.3548133","DOIUrl":null,"url":null,"abstract":"Reconstructing 3D human mesh from a single RGB image is a challenging task due to the inherent depth ambiguity. Researchers commonly use convolutional neural networks to extract features and then apply spatial aggregation on the feature maps to explore the embedded 3D cues in the 2D image. Recently, two methods of spatial aggregation, the transformers and the spatial attention, are adopted to achieve the state-of-the-art performance, whereas they both have limitations. The use of transformers helps modelling long-term dependency across different joints whereas the grid tokens are not adaptive for the positions and shapes of human joints in different images. On the contrary, the spatial attention focuses on joint-specific features. However, the non-local information of the body is ignored by the concentrated attention maps. To address these issues, we propose a Learnable Sampling module to generate joint adaptive tokens and then use transformers to aggregate global information. Feature vectors are sampled accordingly from the feature maps to form the tokens of different joints. The sampling weights are predicted by a learnable network so that the model can learn to sample joint-related features adaptively. Our adaptive tokens are explicitly correlated with human joints, so that more effective modeling of global dependency among different human joints can be achieved. To validate the effectiveness of our method, we conduct experiments on several popular datasets including Human3.6M and 3DPW. Our method achieves lower reconstruction errors in terms of both the vertex-based metric and the joint-based metric compared to previous state of the arts. The codes and the trained models are released at https://github.com/thuxyz19/Learnable-Sampling.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"94 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 30th ACM International Conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3503161.3548133","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Reconstructing 3D human mesh from a single RGB image is a challenging task due to the inherent depth ambiguity. Researchers commonly use convolutional neural networks to extract features and then apply spatial aggregation on the feature maps to explore the embedded 3D cues in the 2D image. Recently, two methods of spatial aggregation, the transformers and the spatial attention, are adopted to achieve the state-of-the-art performance, whereas they both have limitations. The use of transformers helps modelling long-term dependency across different joints whereas the grid tokens are not adaptive for the positions and shapes of human joints in different images. On the contrary, the spatial attention focuses on joint-specific features. However, the non-local information of the body is ignored by the concentrated attention maps. To address these issues, we propose a Learnable Sampling module to generate joint adaptive tokens and then use transformers to aggregate global information. Feature vectors are sampled accordingly from the feature maps to form the tokens of different joints. The sampling weights are predicted by a learnable network so that the model can learn to sample joint-related features adaptively. Our adaptive tokens are explicitly correlated with human joints, so that more effective modeling of global dependency among different human joints can be achieved. To validate the effectiveness of our method, we conduct experiments on several popular datasets including Human3.6M and 3DPW. Our method achieves lower reconstruction errors in terms of both the vertex-based metric and the joint-based metric compared to previous state of the arts. The codes and the trained models are released at https://github.com/thuxyz19/Learnable-Sampling.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过学习采样变形金刚关节自适应令牌进行三维人体网格重建
由于固有的深度模糊性,从单个RGB图像重建三维人体网格是一项具有挑战性的任务。研究人员通常使用卷积神经网络提取特征,然后在特征映射上应用空间聚合来探索2D图像中嵌入的3D线索。近年来,为了达到最先进的性能,人们采用了两种空间聚集方法:变压器和空间关注,但这两种方法都有局限性。变压器的使用有助于模拟不同关节之间的长期依赖关系,而网格标记不能适应不同图像中人体关节的位置和形状。相反,空间注意力集中在关节特征上。然而,身体的非局部信息被集中的注意图所忽略。为了解决这些问题,我们提出了一个可学习的采样模块来生成联合自适应令牌,然后使用变压器来聚合全局信息。从特征映射中抽取相应的特征向量,形成不同节点的标记。通过可学习网络预测采样权值,使模型能够自适应学习对关节相关特征进行采样。我们的自适应标记与人体关节显式相关,因此可以更有效地建模不同人体关节之间的全局依赖关系。为了验证我们方法的有效性,我们在几个流行的数据集上进行了实验,包括Human3.6M和3DPW。与以前的技术相比,我们的方法在基于顶点的度量和基于关节的度量方面都实现了更低的重建误差。代码和训练过的模型发布在https://github.com/thuxyz19/Learnable-Sampling。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Adaptive Anti-Bottleneck Multi-Modal Graph Learning Network for Personalized Micro-video Recommendation Composite Photograph Harmonization with Complete Background Cues Domain-Specific Conditional Jigsaw Adaptation for Enhancing transferability and Discriminability Enabling Effective Low-Light Perception using Ubiquitous Low-Cost Visible-Light Cameras Restoration of Analog Videos Using Swin-UNet
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1