DreamHOI:利用扩散先验条件生成受试者驱动的三维人-物互动效果

Thomas Hanwen Zhu, Ruining Li, Tomas Jakab
{"title":"DreamHOI:利用扩散先验条件生成受试者驱动的三维人-物互动效果","authors":"Thomas Hanwen Zhu, Ruining Li, Tomas Jakab","doi":"arxiv-2409.08278","DOIUrl":null,"url":null,"abstract":"We present DreamHOI, a novel method for zero-shot synthesis of human-object\ninteractions (HOIs), enabling a 3D human model to realistically interact with\nany given object based on a textual description. This task is complicated by\nthe varying categories and geometries of real-world objects and the scarcity of\ndatasets encompassing diverse HOIs. To circumvent the need for extensive data,\nwe leverage text-to-image diffusion models trained on billions of image-caption\npairs. We optimize the articulation of a skinned human mesh using Score\nDistillation Sampling (SDS) gradients obtained from these models, which predict\nimage-space edits. However, directly backpropagating image-space gradients into\ncomplex articulation parameters is ineffective due to the local nature of such\ngradients. To overcome this, we introduce a dual implicit-explicit\nrepresentation of a skinned mesh, combining (implicit) neural radiance fields\n(NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization,\nwe transition between implicit and explicit forms, grounding the NeRF\ngeneration while refining the mesh articulation. We validate our approach\nthrough extensive experiments, demonstrating its effectiveness in generating\nrealistic HOIs.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors\",\"authors\":\"Thomas Hanwen Zhu, Ruining Li, Tomas Jakab\",\"doi\":\"arxiv-2409.08278\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present DreamHOI, a novel method for zero-shot synthesis of human-object\\ninteractions (HOIs), enabling a 3D human model to realistically interact with\\nany given object based on a textual description. This task is complicated by\\nthe varying categories and geometries of real-world objects and the scarcity of\\ndatasets encompassing diverse HOIs. To circumvent the need for extensive data,\\nwe leverage text-to-image diffusion models trained on billions of image-caption\\npairs. We optimize the articulation of a skinned human mesh using Score\\nDistillation Sampling (SDS) gradients obtained from these models, which predict\\nimage-space edits. However, directly backpropagating image-space gradients into\\ncomplex articulation parameters is ineffective due to the local nature of such\\ngradients. To overcome this, we introduce a dual implicit-explicit\\nrepresentation of a skinned mesh, combining (implicit) neural radiance fields\\n(NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization,\\nwe transition between implicit and explicit forms, grounding the NeRF\\ngeneration while refining the mesh articulation. We validate our approach\\nthrough extensive experiments, demonstrating its effectiveness in generating\\nrealistic HOIs.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.08278\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08278","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

我们介绍的 DreamHOI 是一种用于人-物互动(HOIs)零镜头合成的新方法,它能让三维人体模型根据文字描述与任何给定物体进行逼真的互动。由于现实世界中物体的类别和几何形状各不相同,而包含各种 HOIs 的数据集又十分稀缺,因此这项任务变得十分复杂。为了避免对大量数据的需求,我们利用了在数十亿图像标题对上训练的文本到图像扩散模型。我们利用从这些模型中获得的分数扩散采样(SDS)梯度来优化带皮肤人体网格的衔接,这些梯度可以预测图像空间的编辑。然而,由于图像空间梯度的局部性,直接将图像空间梯度反向传播到复杂的衔接参数中是无效的。为了克服这一问题,我们引入了蒙皮网格的隐式-显式双重表示法,将(隐式)神经辐射场(NeRFs)与(显式)骨架驱动的网格衔接相结合。在优化过程中,我们在隐式和显式之间进行转换,在完善网格衔接的同时将神经辐射场的生成基础化。我们通过大量实验验证了我们的方法,证明了它在生成逼真的 HOIs 方面的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors
We present DreamHOI, a novel method for zero-shot synthesis of human-object interactions (HOIs), enabling a 3D human model to realistically interact with any given object based on a textual description. This task is complicated by the varying categories and geometries of real-world objects and the scarcity of datasets encompassing diverse HOIs. To circumvent the need for extensive data, we leverage text-to-image diffusion models trained on billions of image-caption pairs. We optimize the articulation of a skinned human mesh using Score Distillation Sampling (SDS) gradients obtained from these models, which predict image-space edits. However, directly backpropagating image-space gradients into complex articulation parameters is ineffective due to the local nature of such gradients. To overcome this, we introduce a dual implicit-explicit representation of a skinned mesh, combining (implicit) neural radiance fields (NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization, we transition between implicit and explicit forms, grounding the NeRF generation while refining the mesh articulation. We validate our approach through extensive experiments, demonstrating its effectiveness in generating realistic HOIs.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Massively Multi-Person 3D Human Motion Forecasting with Scene Context Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Precise Forecasting of Sky Images Using Spatial Warping JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation Applications of Knowledge Distillation in Remote Sensing: A Survey
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1