DreamHOI：利用扩散先验条件生成受试者驱动的三维人-物互动效果

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI:arxiv-2409.08278

Thomas Hanwen Zhu, Ruining Li, Tomas Jakab

{"title":"DreamHOI：利用扩散先验条件生成受试者驱动的三维人-物互动效果","authors":"Thomas Hanwen Zhu, Ruining Li, Tomas Jakab","doi":"arxiv-2409.08278","DOIUrl":null,"url":null,"abstract":"We present DreamHOI, a novel method for zero-shot synthesis of human-object\ninteractions (HOIs), enabling a 3D human model to realistically interact with\nany given object based on a textual description. This task is complicated by\nthe varying categories and geometries of real-world objects and the scarcity of\ndatasets encompassing diverse HOIs. To circumvent the need for extensive data,\nwe leverage text-to-image diffusion models trained on billions of image-caption\npairs. We optimize the articulation of a skinned human mesh using Score\nDistillation Sampling (SDS) gradients obtained from these models, which predict\nimage-space edits. However, directly backpropagating image-space gradients into\ncomplex articulation parameters is ineffective due to the local nature of such\ngradients. To overcome this, we introduce a dual implicit-explicit\nrepresentation of a skinned mesh, combining (implicit) neural radiance fields\n(NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization,\nwe transition between implicit and explicit forms, grounding the NeRF\ngeneration while refining the mesh articulation. We validate our approach\nthrough extensive experiments, demonstrating its effectiveness in generating\nrealistic HOIs.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"24 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors\",\"authors\":\"Thomas Hanwen Zhu, Ruining Li, Tomas Jakab\",\"doi\":\"arxiv-2409.08278\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present DreamHOI, a novel method for zero-shot synthesis of human-object\\ninteractions (HOIs), enabling a 3D human model to realistically interact with\\nany given object based on a textual description. This task is complicated by\\nthe varying categories and geometries of real-world objects and the scarcity of\\ndatasets encompassing diverse HOIs. To circumvent the need for extensive data,\\nwe leverage text-to-image diffusion models trained on billions of image-caption\\npairs. We optimize the articulation of a skinned human mesh using Score\\nDistillation Sampling (SDS) gradients obtained from these models, which predict\\nimage-space edits. However, directly backpropagating image-space gradients into\\ncomplex articulation parameters is ineffective due to the local nature of such\\ngradients. To overcome this, we introduce a dual implicit-explicit\\nrepresentation of a skinned mesh, combining (implicit) neural radiance fields\\n(NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization,\\nwe transition between implicit and explicit forms, grounding the NeRF\\ngeneration while refining the mesh articulation. We validate our approach\\nthrough extensive experiments, demonstrating its effectiveness in generating\\nrealistic HOIs.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":\"24 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.08278\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08278","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们介绍的 DreamHOI 是一种用于人-物互动（HOIs）零镜头合成的新方法，它能让三维人体模型根据文字描述与任何给定物体进行逼真的互动。由于现实世界中物体的类别和几何形状各不相同，而包含各种 HOIs 的数据集又十分稀缺，因此这项任务变得十分复杂。为了避免对大量数据的需求，我们利用了在数十亿图像标题对上训练的文本到图像扩散模型。我们利用从这些模型中获得的分数扩散采样（SDS）梯度来优化带皮肤人体网格的衔接，这些梯度可以预测图像空间的编辑。然而，由于图像空间梯度的局部性，直接将图像空间梯度反向传播到复杂的衔接参数中是无效的。为了克服这一问题，我们引入了蒙皮网格的隐式-显式双重表示法，将（隐式）神经辐射场（NeRFs）与（显式）骨架驱动的网格衔接相结合。在优化过程中，我们在隐式和显式之间进行转换，在完善网格衔接的同时将神经辐射场的生成基础化。我们通过大量实验验证了我们的方法，证明了它在生成逼真的 HOIs 方面的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors

We present DreamHOI, a novel method for zero-shot synthesis of human-object interactions (HOIs), enabling a 3D human model to realistically interact with any given object based on a textual description. This task is complicated by the varying categories and geometries of real-world objects and the scarcity of datasets encompassing diverse HOIs. To circumvent the need for extensive data, we leverage text-to-image diffusion models trained on billions of image-caption pairs. We optimize the articulation of a skinned human mesh using Score Distillation Sampling (SDS) gradients obtained from these models, which predict image-space edits. However, directly backpropagating image-space gradients into complex articulation parameters is ineffective due to the local nature of such gradients. To overcome this, we introduce a dual implicit-explicit representation of a skinned mesh, combining (implicit) neural radiance fields (NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization, we transition between implicit and explicit forms, grounding the NeRF generation while refining the mesh articulation. We validate our approach through extensive experiments, demonstrating its effectiveness in generating realistic HOIs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Computer Vision and Pattern Recognition

自引率

0.00%

发文量

期刊最新文献

Massively Multi-Person 3D Human Motion Forecasting with Scene Context Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Precise Forecasting of Sky Images Using Spatial Warping JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation Applications of Knowledge Distillation in Remote Sensing: A Survey