{"title":"DreamHOI:利用扩散先验条件生成受试者驱动的三维人-物互动效果","authors":"Thomas Hanwen Zhu, Ruining Li, Tomas Jakab","doi":"arxiv-2409.08278","DOIUrl":null,"url":null,"abstract":"We present DreamHOI, a novel method for zero-shot synthesis of human-object\ninteractions (HOIs), enabling a 3D human model to realistically interact with\nany given object based on a textual description. This task is complicated by\nthe varying categories and geometries of real-world objects and the scarcity of\ndatasets encompassing diverse HOIs. To circumvent the need for extensive data,\nwe leverage text-to-image diffusion models trained on billions of image-caption\npairs. We optimize the articulation of a skinned human mesh using Score\nDistillation Sampling (SDS) gradients obtained from these models, which predict\nimage-space edits. However, directly backpropagating image-space gradients into\ncomplex articulation parameters is ineffective due to the local nature of such\ngradients. To overcome this, we introduce a dual implicit-explicit\nrepresentation of a skinned mesh, combining (implicit) neural radiance fields\n(NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization,\nwe transition between implicit and explicit forms, grounding the NeRF\ngeneration while refining the mesh articulation. We validate our approach\nthrough extensive experiments, demonstrating its effectiveness in generating\nrealistic HOIs.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"24 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors\",\"authors\":\"Thomas Hanwen Zhu, Ruining Li, Tomas Jakab\",\"doi\":\"arxiv-2409.08278\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present DreamHOI, a novel method for zero-shot synthesis of human-object\\ninteractions (HOIs), enabling a 3D human model to realistically interact with\\nany given object based on a textual description. This task is complicated by\\nthe varying categories and geometries of real-world objects and the scarcity of\\ndatasets encompassing diverse HOIs. To circumvent the need for extensive data,\\nwe leverage text-to-image diffusion models trained on billions of image-caption\\npairs. We optimize the articulation of a skinned human mesh using Score\\nDistillation Sampling (SDS) gradients obtained from these models, which predict\\nimage-space edits. However, directly backpropagating image-space gradients into\\ncomplex articulation parameters is ineffective due to the local nature of such\\ngradients. To overcome this, we introduce a dual implicit-explicit\\nrepresentation of a skinned mesh, combining (implicit) neural radiance fields\\n(NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization,\\nwe transition between implicit and explicit forms, grounding the NeRF\\ngeneration while refining the mesh articulation. We validate our approach\\nthrough extensive experiments, demonstrating its effectiveness in generating\\nrealistic HOIs.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":\"24 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.08278\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08278","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors
We present DreamHOI, a novel method for zero-shot synthesis of human-object
interactions (HOIs), enabling a 3D human model to realistically interact with
any given object based on a textual description. This task is complicated by
the varying categories and geometries of real-world objects and the scarcity of
datasets encompassing diverse HOIs. To circumvent the need for extensive data,
we leverage text-to-image diffusion models trained on billions of image-caption
pairs. We optimize the articulation of a skinned human mesh using Score
Distillation Sampling (SDS) gradients obtained from these models, which predict
image-space edits. However, directly backpropagating image-space gradients into
complex articulation parameters is ineffective due to the local nature of such
gradients. To overcome this, we introduce a dual implicit-explicit
representation of a skinned mesh, combining (implicit) neural radiance fields
(NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization,
we transition between implicit and explicit forms, grounding the NeRF
generation while refining the mesh articulation. We validate our approach
through extensive experiments, demonstrating its effectiveness in generating
realistic HOIs.