Chaofan Huo, Ye Shi, Yuexin Ma, Lan Xu, Jingyi Yu, Jingya Wang
{"title":"StackFLOW: Monocular Human-Object Reconstruction by Stacked Normalizing Flow with Offset","authors":"Chaofan Huo, Ye Shi, Yuexin Ma, Lan Xu, Jingyi Yu, Jingya Wang","doi":"arxiv-2407.20545","DOIUrl":null,"url":null,"abstract":"Modeling and capturing the 3D spatial arrangement of the human and the object\nis the key to perceiving 3D human-object interaction from monocular images. In\nthis work, we propose to use the Human-Object Offset between anchors which are\ndensely sampled from the surface of human mesh and object mesh to represent\nhuman-object spatial relation. Compared with previous works which use contact\nmap or implicit distance filed to encode 3D human-object spatial relations, our\nmethod is a simple and efficient way to encode the highly detailed spatial\ncorrelation between the human and object. Based on this representation, we\npropose Stacked Normalizing Flow (StackFLOW) to infer the posterior\ndistribution of human-object spatial relations from the image. During the\noptimization stage, we finetune the human body pose and object 6D pose by\nmaximizing the likelihood of samples based on this posterior distribution and\nminimizing the 2D-3D corresponding reprojection loss. Extensive experimental\nresults show that our method achieves impressive results on two challenging\nbenchmarks, BEHAVE and InterCap datasets.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.20545","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Modeling and capturing the 3D spatial arrangement of the human and the object
is the key to perceiving 3D human-object interaction from monocular images. In
this work, we propose to use the Human-Object Offset between anchors which are
densely sampled from the surface of human mesh and object mesh to represent
human-object spatial relation. Compared with previous works which use contact
map or implicit distance filed to encode 3D human-object spatial relations, our
method is a simple and efficient way to encode the highly detailed spatial
correlation between the human and object. Based on this representation, we
propose Stacked Normalizing Flow (StackFLOW) to infer the posterior
distribution of human-object spatial relations from the image. During the
optimization stage, we finetune the human body pose and object 6D pose by
maximizing the likelihood of samples based on this posterior distribution and
minimizing the 2D-3D corresponding reprojection loss. Extensive experimental
results show that our method achieves impressive results on two challenging
benchmarks, BEHAVE and InterCap datasets.