{"title":"MaskRecon: High-quality human reconstruction via masked autoencoders using a single RGB-D image","authors":"","doi":"10.1016/j.neucom.2024.128487","DOIUrl":null,"url":null,"abstract":"<div><p>In this paper, we explore reconstructing high-quality clothed 3D humans from a single RGB-D image, assuming that virtual humans can be represented by front-view and back-view depths. Due to the scarcity of captured real RGB-D human images, we employ rendered images to train our method. However, rendered images lack background with significant depth variation in silhouettes, leading to shape prediction inaccuracies and noise. To mitigate this issue, we introduce a pseudo-multi-task framework, which incorporates a Conditional Generative Adversarial Network (CGAN) to infer back-view RGB-D images and a self-supervised Masked Autoencoder (MAE) to capture latent structural information of the human body. Additionally, we propose a Multi-scale Feature Fusion (MFF) module to effectively merge structural information and conditional features at various scales. Our method surpasses many existing techniques, as demonstrated through evaluations on the Thuman, RenderPeople, and BUFF datasets. Notably, our approach excels in reconstructing high-quality human models, even under challenging conditions such as complex poses and loose clothing, both on rendered and real-world images. Codes are available at <span><span>https://github.com/Archaic-Atom/MaskRecon</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":null,"pages":null},"PeriodicalIF":5.5000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S092523122401258X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, we explore reconstructing high-quality clothed 3D humans from a single RGB-D image, assuming that virtual humans can be represented by front-view and back-view depths. Due to the scarcity of captured real RGB-D human images, we employ rendered images to train our method. However, rendered images lack background with significant depth variation in silhouettes, leading to shape prediction inaccuracies and noise. To mitigate this issue, we introduce a pseudo-multi-task framework, which incorporates a Conditional Generative Adversarial Network (CGAN) to infer back-view RGB-D images and a self-supervised Masked Autoencoder (MAE) to capture latent structural information of the human body. Additionally, we propose a Multi-scale Feature Fusion (MFF) module to effectively merge structural information and conditional features at various scales. Our method surpasses many existing techniques, as demonstrated through evaluations on the Thuman, RenderPeople, and BUFF datasets. Notably, our approach excels in reconstructing high-quality human models, even under challenging conditions such as complex poses and loose clothing, both on rendered and real-world images. Codes are available at https://github.com/Archaic-Atom/MaskRecon.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.